Information Retrieval in Complex Systems seminal work, state of the art, challenges

Yüklə 180,5 Kb.

Information Retrieval in Complex Systems

Overview

Google

Google - PageRank

HITS

HITS – root and base set

SmartyPants

SmartyPants

Trawling for Web Communities

Overview

Semantic Web

XML

RDF

DAML+OIL

Overview

Scatter/Gather

Scatter/Gather – Example

Phrase Browsing

Teoma

SmartyPants

SmartyPants

Yüklə 180,5 Kb.

Dostları ilə paylaş:

Information Retrieval in Complex Systems seminal work, state of the art, challenges

Information Retrieval in Complex Systems

seminal work, state of the art, challenges

Holger Bast

Overview

Exploiting the link structure

Semantic web

Interactive retrieval

Google

Brin and Page (Stanford), 1998

PageRank = a web page’s authority

Vector of PageRanks = principal eigenvector of (variant of) link matrix

Google - PageRank

HITS

Jon Kleinberg, JACM, 1999

Find hubs and authorities in some base set

If A is the link matrix (n x n) and a,h are the authoritativeness/hubbiness vectors, then

a,h = principal eigenvector of AT· A, A · AT resp.

HITS – root and base set

SmartyPants

Achlioptas, Fiat, Karlin, McSherry, FOCS’01

Model link structure, term distribution in documents, and query generation in a common framework

Relevance then becomes a well-defined quantity

Algorithm: compute truncated singular value decomposition of a huge sparse matrix (LSI style)

SmartyPants

Model:

Algorithm SP:

Trawling for Web Communities

Kumar, Raghavan, Rajagopalan, Tomkins, 1999

Web communities are characterized by dense directed bipartite subgraphs

Challenge: fast algorithms for nontrivial graph problem, only few passes over data permitted

Algorithm: intensive pruning, some loss, but no false positives

Outcome: communities detected before participants themselves were aware of it

Overview

Exploiting the link structure

Semantic Web

Interactive browsing

Semantic Web

Typical HTML document:

IR in Complex Systems

Problem: layout and semantic information intermingled

Idea: add structure + semantics to documents in a standardized, machine-readable way

XML

XML = eXtensible Markup Language

Example: IR in Complex Systems H. Bast MPI Informatik

XML is the “structural ASCII”

XML-Schema = document specification, itself XML

RDF

RDF = Resource Description Framework

Example: IR in Complex Systems H. Bast

RDF is the “semantic ASCII”

DAML+OIL

DAML = DARPA Agent Markup Language OIL = Ontology Inference Layer

Example:

DAML+OIL extends RDF and adds some second order logic

Overview

Exploiting the link structure

Semantic Web

Interactive browsing

Scatter/Gather

Cutting, Karger, Pedersen, Tukey, SIGIR’92

Motivation: Zooming into a large document collection

Realisation: geometric clustering

Challenge: extremely fast algorithms required, i.p.

Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)

Scatter/Gather – Example

Scatter/Gather – Example

Phrase Browsing

Nevill-Manning,Witten,Moffat, 1997

Formulating a good query requires more or less knowledge of the document collection

Build hierarchy of phrases

Example: http://www.nzdl.org/cgi-bin/library

Challenge: fast algorithms for finding minimal grammar, e.g. for S  babaabaabaa

Teoma

More refined concept of authoritativeness, depending on the specific query (“subject-specific popularity”)

More sophisticated query refinement

But: Coverage is only 10% of that of Google

Example: http://www.teoma.com

SmartyPants

Model: