
Intelligent Search by Dimension Reduction Holger Bast MaxPlanckInstitut für Informatik ag 1 Algorithmen und Komplexität

tarix  19.07.2018  ölçüsü  153,5 Kb.   #57266 

Intelligent Search by Dimension Reduction Holger Bast MaxPlanckInstitut für Informatik AG 1  Algorithmen und Komplexität
Short History I started getting involved in April 2002 I gave a series of lectures in Nov./Dec. 2002 We started a subgroup beginning of this year, current members are:  Kurt Mehlhorn (director of AG 1)
 Hisao Tamaki (visiting professor from Japan)
 Kavitha Telikepalli, Venkatesh Srinivasan (postdocs)
 Irit Katriel, Debapriyo Majumdar (PhD students, IMPRS)
Dimension Reduction Given a highdimensional space of objects, recover the (assumed) underlying low dimensional space Formally: given an m×n matrix, possibly full rank, find best lowrank approximation
Dimension Reduction Given a highdimensional space of objects, recover the (assumed) underlying low dimensional space Formally: given an m×n matrix, possibly full rank, find best lowrank approximation
Generic Method Find concepts (vectors in term space) c1,…,ck Replace each document by a linear combination of the c1,…,ck That is, replace term document matrix by product C·D’, where  columns of C are c1,…,ck
 columns of D’ are documents expressed in terms of concepts
Generic Method Find concepts (vectors in term space) c1,…,ck Replace each document by a linear combination of the c1,…,ck That is, replace term document matrix by product C·D’, where  columns of C are c1,…,ck
 columns of D’ are documents expressed in terms of concepts
Specific Methods Latent semantic indexing (LSI)  Dumais et al. 1989
 orthogonal concepts c1,…,ck
 span of c1,…,ck is that kdimensional subspace which minimizes the squared distances
 choice of basis not specified (at least two sensible ways)
 computable in poynomial time via the singular value decomposition (SVD)
 surprisingly good in practice
Specific Methods Probabilistic Latent Semantic Indexing (PLSI)  Hofmann 1999
 find stochastic matrix of rank k that maximizes the probability that given matrix is an instance
 connects problem to statistical learning theory
 hard to compute, approximate by local search techniques
 very good results on some test collections
Specific Methods Concept Indexing (CI)  Karypis & Han 2000
 c1,…,ck = centroid vectors of a kclustering
 documents = projections onto these centroids
 computationally easy (given the clustering)
 gave surprisingly good results in a recent DFKI project (Arbeitsamt)
Comparing Methods Fundamental question: which method is how good under which circumstances? Few theoretically founded answers to this question  seminal paper: A Probabilistic Analysis of Latent Semantic Indexing, Papadimitriou, Raghavan, Tamaki, Vempala, PODS’98 (ten years after LSI was born!)
 followup paper: Spectral Analysis of Data, Azar, Fiat, Karlin, McSherry, Saia, STOC’01
 main statement: LSI is robust against addition of (how much?) noise
Why does LSI work so well? A good method should produce  small angles between documents on similar topics
 large angles between documents on different topics
A formula for angles in the reduced space:  Let D = C·G, and let c1’,…,ck’ be the images of the concepts under LSI
 Then the k×k dot products ci’·cj’ are given by the matrix (G·GT)1
 That is, pairwise angles are ≥ 90 degrees if and only if (G·GT)1 has nonpositive offdiagonal entries (Mmatrix)
Polysemy and Simonymy Let Tij be the dot product of the ith with the jth row of a termdocument matrix (~ cooccurence of terms i and j)  Call term k a polysem if there exist terms i and j such that for some t, Tik, Tjk ≥ t but Tij < t
 Two terms i and j are simonyms if Tij ≥ Tii or Tjj
Without polysems and simonyms we have  Tij ≥ min(Tik,Tjk) for all i,j,k
 Tii > Tij for all j≠i
A symmetric matrix (Tij) with 1. and 2. is called strictly ultrametric
Help from Linear Algebra Theorem [Martinez,Michon,San Martin 1994]: The inverse of a strictly ultrametric matrix is an Mmatrix, i.e. its diagonal entries are positive and its offdiagonal entries are nonpositive
A new LSI theorem Theorem: If D can be well approximated by a set of concepts free from polysemy and simonymy, then in the reduced LSIspace these concepts form large pairwise angles. Beware: This only holds for the original LSI, not for its widely used variant! Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!
Exploiting Link Structure Achlioptas,Fiat,Karlin,McSherry (FOCS’01):  documents have a topic (implicit in the distribution of terms)
 and a quality (implicit in the link structure)
 represent each document by a vector
 direction corresponds to the topic
 length corresponds to the quality
 Goal: for a given query, rank documents by their dot product with the topic of the query
Model details Underlying parameters  A = [A1 … An] authority topics, one per doc.
 H = [H1 … Hn] hub topics, one per doc.
 C = [C1 … Ck] translates topics to terms
 q = [q1 … qk] query topic
The input we see  D A·C + H·C term document matrix
 L HT·A link matrix
 Q q·C query terms
Goal: recover ordering of A1·q,…,An·q
Model  Problems Link matrix generation L HT·A  is ok, because the presence of a link is related to the hub/authority value
Term document matrix generation D A·C + H·C  very unrealistic: the term distribution gives information on the topic, but not on the quality!
 more realistic: D A0·C + H0·C, where A0 and H0 contain the normed columns of A and H
So far, we could solve the special case where A differs from H by only a diagonal matrix (i.e. hub topic = authority topic)
Perspective Strong theoretical foundations  unifying framework + comparative analysis for large variety of dimension reduction methods
 realistic models + performance guarantuees
Make proper use of human intelligence  integrate explicit knowledge
 but only as much as required (automatic detection)
 combine dimension reduction methods with interactive schemes (e.g. phrase browsing)
Specific Methods Latent semantic indexing (LSI) [Dumais et al. ’89]  orthogonal concepts c1,…,ck
 span of c1,…,ck is that kdimensional subspace which minimizes the squared distances
Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99]  find stochastic matrix of rank k that maximizes the probability that given matrix is an instance
Concept Indexing (CI) [Karypis & Han ’00]  c1,…,ck = centroid vectors of a kclustering
 documents = projections onto these centroids
Dimension Reduction Methods Main idea: the highdimensional space of objects is a variant of an underlying low dimensional space Formally: given an m×n matrix, possibly full rank, find best lowrank approximation
I will talk about … Dimension reduction techniques  some methods
 a new theorem
Exploiting link structure  state of the art
 some new ideas
Perspective
Overview Exploiting the link structure  Google, HITS, SmartyPants
 Trawling
Semantic Web  XML, XMLSchema
 RDF, DAML+OIL
Interactive browsing  Scatter/Gather
 Phrase Browsing
Scatter/Gather Cutting, Karger, Pedersen, Tukey, SIGIR’92 Motivation: Zooming into a large document collection Realisation: geometric clustering Challenge: extremely fast algorithms required, i.p.  lineartime preprocessing
 constanttime query processing
Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)
Scatter/Gather – Example
Scatter/Gather – Example
Phrase Browsing NevillManning,Witten,Moffat, 1997 Formulating a good query requires more or less knowledge of the document collection  if less, fine
 if more, interaction is a must
Build hierarchy of phrases Example: http://www.nzdl.org/cgibin/library Challenge: fast algorithms for finding minimal grammar, e.g. for S babaabaabaa
Teoma More refined concept of authoritativeness, depending on the specific query (“subjectspecific popularity”) More sophisticated query refinement But: Coverage is only 10% of that of Google Example: http://www.teoma.com
Dostları ilə paylaş: 

