Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik ag 1 Algorithmen und Komplexität



Yüklə 153,5 Kb.
tarix19.07.2018
ölçüsü153,5 Kb.
#57266


Intelligent Search by Dimension Reduction

  • Holger Bast

  • Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität


Short History

  • I started getting involved in April 2002

  • I gave a series of lectures in Nov./Dec. 2002

  • We started a subgroup beginning of this year, current members are:

    • Kurt Mehlhorn (director of AG 1)
    • Hisao Tamaki (visiting professor from Japan)
    • Kavitha Telikepalli, Venkatesh Srinivasan (postdocs)
    • Irit Katriel, Debapriyo Majumdar (PhD students, IMPRS)


Dimension Reduction

  • Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space

  • Formally: given an m×n matrix, possibly full rank, find best low-rank approximation



Dimension Reduction

  • Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space

  • Formally: given an m×n matrix, possibly full rank, find best low-rank approximation



Generic Method

  • Find concepts (vectors in term space) c1,…,ck

  • Replace each document by a linear combination of the c1,…,ck

  • That is, replace term document matrix by product C·D’, where

    • columns of C are c1,…,ck
    • columns of D’ are documents expressed in terms of concepts


Generic Method

  • Find concepts (vectors in term space) c1,…,ck

  • Replace each document by a linear combination of the c1,…,ck

  • That is, replace term document matrix by product C·D’, where

    • columns of C are c1,…,ck
    • columns of D’ are documents expressed in terms of concepts


Specific Methods

  • Latent semantic indexing (LSI)

    • Dumais et al. 1989
    • orthogonal concepts c1,…,ck
    • span of c1,…,ck is that k-dimensional subspace which minimizes the squared distances
    • choice of basis not specified (at least two sensible ways)
    • computable in poynomial time via the singular value decomposition (SVD)
    • surprisingly good in practice


Specific Methods

  • Probabilistic Latent Semantic Indexing (PLSI)

    • Hofmann 1999
    • find stochastic matrix of rank k that maximizes the probability that given matrix is an instance
    • connects problem to statistical learning theory
    • hard to compute, approximate by local search techniques
    • very good results on some test collections


Specific Methods

  • Concept Indexing (CI)

    • Karypis & Han 2000
    • c1,…,ck = centroid vectors of a k-clustering
    • documents = projections onto these centroids
    • computationally easy (given the clustering)
    • gave surprisingly good results in a recent DFKI project (Arbeitsamt)


Comparing Methods

  • Fundamental question: which method is how good under which circumstances?

  • Few theoretically founded answers to this question

    • seminal paper: A Probabilistic Analysis of Latent Semantic Indexing, Papadimitriou, Raghavan, Tamaki, Vempala, PODS’98 (ten years after LSI was born!)
    • follow-up paper: Spectral Analysis of Data, Azar, Fiat, Karlin, McSherry, Saia, STOC’01
    • main statement: LSI is robust against addition of (how much?) noise


Why does LSI work so well?

  • A good method should produce

  • A formula for angles in the reduced space:

    • Let D = C·G, and let c1’,…,ck’ be the images of the concepts under LSI
    • Then the k×k dot products ci’·cj’ are given by the matrix (G·GT)-1
    • That is, pairwise angles are ≥ 90 degrees if and only if (G·GT)-1 has nonpositive offdiagonal entries (M-matrix)


Polysemy and Simonymy

  • Let Tij be the dot product of the i-th with the j-th row of a term-document matrix (~ co-occurence of terms i and j)

    • Call term k a polysem if there exist terms i and j such that for some t, Tik, Tjk ≥ t but Tij < t
    • Two terms i and j are simonyms if Tij ≥ Tii or Tjj
  • Without polysems and simonyms we have

    • Tij ≥ min(Tik,Tjk) for all i,j,k
    • Tii > Tij for all j≠i
  • A symmetric matrix (Tij) with 1. and 2. is called strictly ultrametric



Help from Linear Algebra

  • Theorem [Martinez,Michon,San Martin 1994]: The inverse of a strictly ultrametric matrix is an M-matrix, i.e. its diagonal entries are positive and its off-diagonal entries are nonpositive



A new LSI theorem

  • Theorem: If D can be well approximated by a set of concepts free from polysemy and simonymy, then in the reduced LSI-space these concepts form large pairwise angles.

  • Beware: This only holds for the original LSI, not for its widely used variant!

  • Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!



Exploiting Link Structure

  • Achlioptas,Fiat,Karlin,McSherry (FOCS’01):

    • documents have a topic (implicit in the distribution of terms)
    • and a quality (implicit in the link structure)
    • represent each document by a vector
      • direction corresponds to the topic
      • length corresponds to the quality
    • Goal: for a given query, rank documents by their dot product with the topic of the query


Model details

  • Underlying parameters

    • A = [A1 … An] authority topics, one per doc.
    • H = [H1 … Hn] hub topics, one per doc.
    • C = [C1 … Ck] translates topics to terms
    • q = [q1 … qk] query topic
  • The input we see

    • D  A·C + H·C term document matrix
    • L  HT·A link matrix
    • Q  q·C query terms
  • Goal: recover ordering of A1·q,…,An·q



Model - Problems

  • Link matrix generation L  HT·A

    • is ok, because the presence of a link is related to the hub/authority value
  • Term document matrix generation D  A·C + H·C

    • very unrealistic: the term distribution gives information on the topic, but not on the quality!
    • more realistic: D  A0·C + H0·C, where A0 and H0 contain the normed columns of A and H
  • So far, we could solve the special case where A differs from H by only a diagonal matrix (i.e. hub topic = authority topic)



Perspective

  • Strong theoretical foundations

    • unifying framework + comparative analysis for large variety of dimension reduction methods
    • realistic models + performance guarantuees
  • Make proper use of human intelligence

    • integrate explicit knowledge
    • but only as much as required (automatic detection)
    • combine dimension reduction methods with interactive schemes (e.g. phrase browsing)




Specific Methods

  • Latent semantic indexing (LSI) [Dumais et al. ’89]

    • orthogonal concepts c1,…,ck
    • span of c1,…,ck is that k-dimensional subspace which minimizes the squared distances
  • Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99]

    • find stochastic matrix of rank k that maximizes the probability that given matrix is an instance
  • Concept Indexing (CI) [Karypis & Han ’00]

    • c1,…,ck = centroid vectors of a k-clustering
    • documents = projections onto these centroids


Dimension Reduction Methods

  • Main idea: the high-dimensional space of objects is a variant of an underlying low dimensional space

  • Formally: given an m×n matrix, possibly full rank, find best low-rank approximation



I will talk about …

  • Dimension reduction techniques

    • some methods
    • a new theorem
  • Exploiting link structure

    • state of the art
    • some new ideas
  • Perspective



Overview

  • Exploiting the link structure

    • Google, HITS, SmartyPants
    • Trawling
  • Semantic Web

  • Interactive browsing



Scatter/Gather

  • Cutting, Karger, Pedersen, Tukey, SIGIR’92

  • Motivation: Zooming into a large document collection

  • Realisation: geometric clustering

  • Challenge: extremely fast algorithms required, i.p.

    • linear-time preprocessing
    • constant-time query processing
  • Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)



Scatter/Gather – Example



Scatter/Gather – Example



Phrase Browsing

  • Nevill-Manning,Witten,Moffat, 1997

  • Formulating a good query requires more or less knowledge of the document collection

    • if less, fine
    • if more, interaction is a must
  • Build hierarchy of phrases

  • Example: http://www.nzdl.org/cgi-bin/library

  • Challenge: fast algorithms for finding minimal grammar, e.g. for S  babaabaabaa



Teoma

  • More refined concept of authoritativeness, depending on the specific query (“subject-specific popularity”)

  • More sophisticated query refinement

  • But: Coverage is only 10% of that of Google

  • Example: http://www.teoma.com



Yüklə 153,5 Kb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə