
Ranking the Web

tarix  26.09.2018  ölçüsü  1,01 Mb.   #70485 

Ranking the Web Gianna M. Del Corso Antonio Gullí IITCNR, Pisa
Overview Web Statistics Some Web Ranking Algorithms Zooming on PageRank  Personalization
 Fast PageRank
Fun Results and Web Comparison
Web Statistics
Web Statistics January 2004, 151 millions active in the U.S. 76% used a SE at least once a month. Time spent searching ~ 40 mins.
Share Of Searches: February 2004 February 2004 1.5Millions US web surfers
Search Referrals March 2004 25 Millions Web Pages
Jupiter Media Metrix estimates Paid Ad will reach as much as $4 billion by 2005 Business growing rate increase of 20% in next five years
Google’s numbers
Web Ranking
Web Ranking The author of p gives a vote to q
Hits Eigenvectors computation can be used by: Where a: Vector of Authorities’ scores h: Vector of Hubs’ scores. W: Adjacency matrix in which wi,j = 1 if points to j.
Hits
Salsa Two separate random walks
Hits vs Salsa H = WrWcT A = WcTWr W is the adjacency matrix of G Wr is W divided by the sum of entries in its rows Wc is W divided by the sum of entries in its cols Stationary distribution proportional to inlinks and outlinks!!
Google’s PageRank
Google’s PageRank “Random Surfer Model”  Rank of page equals to the probability of sitting on that page Where  B(i) : set of pages inlinking to i.
 N(j) : num outgoing links from j.
Google’s PageRank Dangling nodes, i.e. Web pages with no outlinks
Personalized PageRank
Eurekester Create and join SearchGroups to focus your search by area of interest
Fast PageRank
PageRank Standard Algorithm for computing PR: Power Method applied to Takes several days due to the size of Web Graph
Why we need a fast linkbased rank? “…The link structure of the Web is significantly more dynamic than the contents on the Web. Every week, about 25% new links are created. After a year, about 80% of the links on the Web are replaced with new ones. This result indicates that search engines need to update linkbased ranking metrics very often…”
Accelerating PageRank Web Graph Compression to fit in internal memory [Boldi et al., 04] Efficient External memory implementation [Haveliwala, 99; Chen et al., 02] Combination of the above strategies
Accelerating PageRank Adaptive Power method:
C = set of pages converged, N = set of pages not yet converged Run PM on detecting converged components. In the paper, many other adapting strategies!! Slowconverging pages have high PageRank
Accelerating PageRank Extrapolation strategies: where ui eigvs
Accelerating PageRank Block Structure  Reordering web pages according to a lexicographical order.
 Compute “local Rank”
 Create a new starting vector

Accelerating PageRank Viewing PR as a linear system problem Transforming it in a sparse formulation Exploiting reducibility via permutations Comparing different scalar and block solvers
“Rich Get Richer” phenomenon “.. From our experimental data, we could observe that the top 20% of the pages with the highest number of incoming links obtained 70% of the new links after 7 months, while the bottom 60% of the pages obtained virtually no new incoming links during that period…”
Web Spamming
Spamming PageRank
Spamming PageRank Setting up sophisticated link structures within a spam farm does not improve the ranking of the target page.
Spamming Hits Easy to spam Create a new page p pointing to many authority pages (e.g., Yahoo, Google, etc.) p becomes a good hub page  … On p, add a link to your home page
Fun Results (aka “Google Bombing”)
Fun Search Resuls and Demo Fun Search Resuls and Demo
Fun Results (aka “Google Bombing”) Some Recent (as of 2004) and popular examples :  “weapons of mass destruction  hoax, IE error lookalike saying “weapons of mass destruction cannot be found”.
 great president  biography of George W. Bush.
 litigious bastards  homepage of the SCO Group.
 Buffone  Facce da culo  Discorsi Folli – Silvio Berlusconi
 out of touch executives – Google’s own corporate info page
 Waffle – John Kerry’s site (blog spamming campaign)
Will Google still dominate search in 2005? Every three years, a new search engine takes the lead and has its 15 minutes of fame. A timeline is at http://www.investors.com/ Open Source alternative [ Nutch ]
Comparing Ranks (Online Demo)
Bibliography K. Bharat, M. Henzinger: Improved Algorithms for Topic Distillation in a Hyperlinked Environment, SIGIR Conference, 1998 P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. To appear in Proc. of the Thirteenth International World−Wide Web Conference. S. Brin and L. Page, The anatomy of a largescale hypertextual Web search engine, Computer Networks and ISDN Systems vol. 30 num 17, 1998 S Brin, L. Page: The Anatomy of a LargeScale Hypertextual Web Search Engine, WWW Conference, 1998 M. Bianchini, M. Gori, F. Scarselli, "Inside PageRank". Technical report DII 1/03, Department of Information Engineering, University of Siena, 2001. Y.Y. Chen, Q. Gan, T. Suel: I/OEfficient Techniques for Computing Pagerank", Proceedings of the eleventh international conference on Information and knowledge management J. Cho, S. Roy: Impact of Web Search Engines on Page Popularity In Proceedings of the WorldWide Web Conference (WWW), May 2004. G.M. Del Corso, A. Gulli, F. Romani: Fast PageRank Computation Via a Sparse Linear System, ITTCNR TechReport 2004 C.P.C Lee, G.H. Golub, S.A. Zenios: A Fast two stage algorithm for computing PageRank, Stanford TechReport 2004
Bibliography R. Lempel, S. Moran: SALSA: The Stochastic Approach for LinkStructure Analysis, ACM Transactions on Information Systems Vol. 19 No.2, 2001 T. H. Haveliwala: TopicSensitive PageRank: A ContextSensitive Ranking Algorithm for Web Search, IEEE Trans. on Knowledge and Data Eng, 2003 T. H. Haveliwala, Sepandar D. Kamvar, and Glen Jeh, "An Analytical Comparison of Approaches to Personalizing PageRank", Preprint, June, 2003 S.D. Kamvar, T.H. Haveliwala, C.D. Manning, G.H. Golub: Extrapolation Methods for Accelerating PageRank Computations, WWW Conf., 2003 S.D. Kamvar, T.H. Haveliwala, C.D. Manning, G.H. Golub: Exploiting the Block Structure of the Web for Computing PageRank, Stanford Tech.Rep, 2003 S.D. Kamvar, T. H. Haveliwala, and G. H. Golub, "Adaptive Methods for the Computation of PageRank", Linear Algebra and its Applications, Special Issue on the Numerical Solution of Markov Chains, Nov., 2003. Kleinberg: Authoritative Sources in a Hyperlinked Environment, Journal of the ACM Vol.46 No.5, 1999 A. Ntoulas, J. Cho, C. Olston "What's New on the Web? The Evolution of the Web from a Search Engine Perspective." WorldWide Web Conference, May 2004. G., Zoltan; GarciaMolina, Hector. Web Spam Taxonomy. Technical Report, Stanford University, 2004
Broder’s Altavista Patented May 2003  A, Attractor Matrix: sites externally endorsed
 N, Non Attractor Matrix: sites deemed to be avoided
 Use a linear combination of A, N and other matrices

 Suggest to also use non principal eigenvectors
Accelerating PageRank The Markov Chain associated with P is lumpable  Combine D nodes into a block. P1 is the transition matrix
 Compute the stationary distribution of P1
 Combine ND nodes into a block. P2 is the transition matrix
 Compute the stationary distribution of P2
 Concatenate the results
 D are the dangling nodes, ND the non dangling nodes
Finally…the perfect search engine?  Sergei Brin: “It would be the mind of God. Larry says it would know exactly what you want and give you back exactly what you need.”
 Chackabarti: “The web grew exponentially from almost zero to 800 million pages between 1991 and 1999. In comparison, it took 3.5 million years for the human brain to grow linearly from 400 to 1400 cubic centimeters. How do we work with the web without getting overwhelmed? We look for relevance and quality. Can we design programs to recognize these properties?”
Dostları ilə paylaş: 

