Why is Page Importance Rating important?



Yüklə 1,23 Mb.
tarix08.08.2018
ölçüsü1,23 Mb.
#61562



Why is Page Importance Rating important?

  • Why is Page Importance Rating important?

    • New challenges for information retrieval on the World Wide Web.
      • Huge number of web pages: 150 million by1998
      • 1000 billion by 2008
      • Diversity of web pages: different topics, different quality, etc.
  • What is PageRank?

      • A method for rating the importance of web pages objectively and mechanically using the link structure of the web.


PageRank was developed by Larry Page (hence the name Page-Rank) and Sergey Brin.

  • PageRank was developed by Larry Page (hence the name Page-Rank) and Sergey Brin.

  • It is first as part of a research project about a new kind of search engine. That project started in 1995 and led to a functional prototype in 1998.

  • Shortly after, Page and Brin founded Google.

  • 16 billion…



There are some news about that PageRank will be canceled by Google.

  • There are some news about that PageRank will be canceled by Google.

  • There are large numbers of Search Engine Optimization (SEO).

  • SEO use different trick methods to make a web page more important under the rating of PageRank.



150 million web pages  1.7 billion links

  • 150 million web pages  1.7 billion links



u: a web page

  • u: a web page

  • Bu: the set of u’s backlinks

  • Nv: the number of forward links of page v

  • c: the normalization factor to make ||R||L1 = 1 (||R||L1= |R1 + … + Rn|)

















The Random Surfer Model

  • The Random Surfer Model

    • The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random
  • The Modified Model

    • The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E






Links that point to any page with no outgoing links

  • Links that point to any page with no outgoing links

  • Most are pages that have not been downloaded yet

  • Affect the model since it is not clear where their weight should be distributed

  • Do not affect the ranking of any other page directly

  • Can be simply removed before pagerank calculation and added back afterwards



Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs to identify pages

  • Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs to identify pages

  • Sort the link structure by ID

  • Remove all the dangling links from the database

  • Make an initial assignment of ranks and start iteration

      • Choosing a good initial assignment can speed up the pagerank
  • Adding the dangling links back.



PR (322 Million Links): 52 iterations

  • PR (322 Million Links): 52 iterations

  • PR (161 Million Links): 45 iterations

  • Scaling factor is roughly linear in logn



The Web is an expander-like graph

  • The Web is an expander-like graph

    • Theory of random walk: a random walk on a graph is said to be rapidly-mixing if it quickly converges to a limiting distribution on the set of nodes in the graph. A random walk is rapidly-mixing on a graph if and only if the graph is an expander graph.
    • Expander graph: every subset of nodes S has a neighborhood (set of vertices accessible via outedges emanating from nodes in S) that is larger than some factor α times of |S|. A graph has a good expansion factor if and only if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue.


Two search engines:

  • Two search engines:

  • Title-based search engine

    • Searches only the “Titles”
    • Finds all the web pages whose titles contain all the query words
    • Sorts the results by PageRank
    • Very simple and cheap to implement
    • Title match ensures high precision, and PageRank ensures high quality
  • Full text search engine

    • Called Google
    • Examines all the words in every stored document and also performs PageRank (Rank Merging)
    • More precise but more complicated






Important component of PageRank calculation is E

  • Important component of PageRank calculation is E

    • A vector over the web pages (used as source of rank)
    • Powerful parameter to adjust the page ranks
  • E vector corresponds to the distribution of web pages that a random surfer periodically jumps to

  • Instead in Personalized PageRank E consists of a single web page



Some highly accessed web pages have low page rank possibly because

  • Some highly accessed web pages have low page rank possibly because

    • People do not want to link to these pages from their own web pages (the example in their paper is pornographic sites…)
    • Some important backlinks are omitted
    • use usage data as a start vector for PageRank.




PageRank is a global ranking of all web pages based on their locations in the web graph structure

  • PageRank is a global ranking of all web pages based on their locations in the web graph structure

  • PageRank uses information which is external to the web pages – backlinks

  • Backlinks from important pages are more significant than backlinks from average pages

  • The structure of the web graph is very useful for information retrieval tasks.



Yüklə 1,23 Mb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə