Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik ag 1 Algorithmen und Komplexität

Yüklə 153,5 Kb.

Intelligent Search by Dimension Reduction

Short History

Dimension Reduction

Dimension Reduction

Generic Method

Generic Method

Specific Methods

Specific Methods

Specific Methods

Comparing Methods

Why does LSI work so well?

Polysemy and Simonymy

Help from Linear Algebra

A new LSI theorem

Exploiting Link Structure

Model details

Model - Problems

Perspective

Specific Methods

Dimension Reduction Methods

I will talk about …

Overview

Scatter/Gather

Scatter/Gather – Example

Phrase Browsing

Teoma

Yüklə 153,5 Kb.

Dostları ilə paylaş:

Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik ag 1 Algorithmen und Komplexität

Intelligent Search by Dimension Reduction

Holger Bast

Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität

Short History

I started getting involved in April 2002

I gave a series of lectures in Nov./Dec. 2002

We started a subgroup beginning of this year, current members are:

Dimension Reduction

Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space

Formally: given an m×n matrix, possibly full rank, find best low-rank approximation

Dimension Reduction

Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space

Formally: given an m×n matrix, possibly full rank, find best low-rank approximation

Generic Method

Find concepts (vectors in term space) c1,…,ck

Replace each document by a linear combination of the c1,…,ck

That is, replace term document matrix by product C·D’, where

Generic Method

Find concepts (vectors in term space) c1,…,ck

Replace each document by a linear combination of the c1,…,ck

That is, replace term document matrix by product C·D’, where

Specific Methods

Latent semantic indexing (LSI)

Specific Methods

Probabilistic Latent Semantic Indexing (PLSI)

Specific Methods

Concept Indexing (CI)

Comparing Methods

Fundamental question: which method is how good under which circumstances?

Few theoretically founded answers to this question

Why does LSI work so well?

A good method should produce

A formula for angles in the reduced space:

Polysemy and Simonymy

Let Tij be the dot product of the i-th with the j-th row of a term-document matrix (~ co-occurence of terms i and j)

Without polysems and simonyms we have

A symmetric matrix (Tij) with 1. and 2. is called strictly ultrametric

Help from Linear Algebra

Theorem [Martinez,Michon,San Martin 1994]: The inverse of a strictly ultrametric matrix is an M-matrix, i.e. its diagonal entries are positive and its off-diagonal entries are nonpositive

A new LSI theorem

Theorem: If D can be well approximated by a set of concepts free from polysemy and simonymy, then in the reduced LSI-space these concepts form large pairwise angles.

Beware: This only holds for the original LSI, not for its widely used variant!

Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!

Exploiting Link Structure

Achlioptas,Fiat,Karlin,McSherry (FOCS’01):

Model details

Underlying parameters

The input we see

Goal: recover ordering of A1·q,…,An·q

Model - Problems

Link matrix generation L  HT·A

Term document matrix generation D  A·C + H·C

So far, we could solve the special case where A differs from H by only a diagonal matrix (i.e. hub topic = authority topic)

Perspective

Strong theoretical foundations

Make proper use of human intelligence

Specific Methods

Latent semantic indexing (LSI) [Dumais et al. ’89]

Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99]

Concept Indexing (CI) [Karypis & Han ’00]

Dimension Reduction Methods

Main idea: the high-dimensional space of objects is a variant of an underlying low dimensional space

Formally: given an m×n matrix, possibly full rank, find best low-rank approximation

I will talk about …

Dimension reduction techniques

Exploiting link structure

Perspective

Overview

Exploiting the link structure

Semantic Web

Interactive browsing

Scatter/Gather

Cutting, Karger, Pedersen, Tukey, SIGIR’92

Motivation: Zooming into a large document collection

Realisation: geometric clustering

Challenge: extremely fast algorithms required, i.p.

Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)

Scatter/Gather – Example

Scatter/Gather – Example