Information Retrieval in Complex Systems seminal work, state of the art, challenges



Yüklə 180,5 Kb.
tarix19.07.2018
ölçüsü180,5 Kb.
#57265


Information Retrieval in Complex Systems

  • seminal work, state of the art, challenges

  • Holger Bast


Overview

  • Exploiting the link structure

    • Google, HITS, SmartyPants
    • Trawling
  • Semantic web

    • XML, XML-Schema
    • RDF, DAML+OIL
  • Interactive retrieval

    • Scatter/Gather
    • Phrase Browsing


Google

  • Brin and Page (Stanford), 1998

  • PageRank = a web page’s authority

  • Vector of PageRanks = principal eigenvector of (variant of) link matrix



Google - PageRank



HITS

  • Jon Kleinberg, JACM, 1999

  • Find hubs and authorities in some base set

    • good hub = points to good authorities
    • good authority = pointed to by good hubs
  • If A is the link matrix (n x n) and a,h are the authoritativeness/hubbiness vectors, then

    • h ~ A · a
    • a ~ AT · h
  • a,h = principal eigenvector of AT· A, A · AT resp.



HITS – root and base set



SmartyPants

  • Achlioptas, Fiat, Karlin, McSherry, FOCS’01

  • Model link structure, term distribution in documents, and query generation in a common framework

  • Relevance then becomes a well-defined quantity

  • Algorithm: compute truncated singular value decomposition of a huge sparse matrix (LSI style)



SmartyPants

  • Model:

    • k underlying concepts, topics = vectors in Rk
    • authority, hubness of web pages: k  n matrices Adocs,Hdocs
    • link matrix (nn) “instantiated from” HTdocs · Adocs
    • term distribution: k  m matrices Aterms,Hterms
    • term-document matrix (mn) instantiated from ATterms · Adocs + HTterms · Hdocs
    • query instantiated from HTterms · u for some topic u
  • Algorithm SP:



Trawling for Web Communities

  • Kumar, Raghavan, Rajagopalan, Tomkins, 1999

  • Web communities are characterized by dense directed bipartite subgraphs

  • Challenge: fast algorithms for nontrivial graph problem, only few passes over data permitted

  • Algorithm: intensive pruning, some loss, but no false positives

  • Outcome: communities detected before participants themselves were aware of it



Overview

  • Exploiting the link structure

    • Google, HITS, SmartyPants
    • Trawling
  • Semantic Web

    • XML, XML-Schema
    • RDF, DAML+OIL
  • Interactive browsing

    • Scatter/Gather
    • Phrase Browsing


Semantic Web

  • Typical HTML document:

    IR in Complex Systems


    Speaker: H. Bast Room: 024
  • Problem: layout and semantic information intermingled

  • Idea: add structure + semantics to documents in a standardized, machine-readable way



XML

  • XML = eXtensible Markup Language

  • Example: IR in Complex Systems H. Bast MPI Informatik

  • XML is the “structural ASCII”

  • XML-Schema = document specification, itself XML



RDF

  • RDF = Resource Description Framework

  • Example: IR in Complex Systems H. Bast

  • RDF is the “semantic ASCII”



DAML+OIL

  • DAML = DARPA Agent Markup Language OIL = Ontology Inference Layer

  • Example:

  • DAML+OIL extends RDF and adds some second order logic



Overview

  • Exploiting the link structure

    • Google, HITS, SmartyPants
    • Trawling
  • Semantic Web

    • XML, XML-Schema
    • RDF, DAML+OIL
  • Interactive browsing

    • Scatter/Gather
    • Phrase Browsing


Scatter/Gather

  • Cutting, Karger, Pedersen, Tukey, SIGIR’92

  • Motivation: Zooming into a large document collection

  • Realisation: geometric clustering

  • Challenge: extremely fast algorithms required, i.p.

    • linear-time preprocessing
    • constant-time query processing
  • Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)



Scatter/Gather – Example



Scatter/Gather – Example



Phrase Browsing

  • Nevill-Manning,Witten,Moffat, 1997

  • Formulating a good query requires more or less knowledge of the document collection

    • if less, fine
    • if more, interaction is a must
  • Build hierarchy of phrases

  • Example: http://www.nzdl.org/cgi-bin/library

  • Challenge: fast algorithms for finding minimal grammar, e.g. for S  babaabaabaa



Teoma

  • More refined concept of authoritativeness, depending on the specific query (“subject-specific popularity”)

  • More sophisticated query refinement

  • But: Coverage is only 10% of that of Google

  • Example: http://www.teoma.com





SmartyPants

  • Model:

    • k underlying concepts
    • topics = vectors in Rk
    • for each page:
      • authority vector in Rk
      • hubness vector in Rk
      • k  n matrices Adocs,Hdocs
    • Matrix of links “instantiated from” HTdocs · Adocs


SmartyPants

  • Model (continued):

    • for each term:
      • authority vector in Rk
      • hubness vector in Rk
    • gives k  m matrices Aterms,Hterms
    • term-document matrix (m  n) instantiated from ATterms · Adocs + HTterms · Hdocs
    • query instantiated from HTterms · u for some topic u
  • SP recovers authority scores ATdocs · u from given query instance, and ranks pages accordingly



Yüklə 180,5 Kb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə