## Why is Page Importance Rating important? ## Why is Page Importance Rating important? - New challenges for information retrieval on the World Wide Web.
- Huge number of web pages: 150 million by1998
- 1000 billion by 2008
- Diversity of web pages: different topics, different quality, etc.
## What is PageRank? - A method for rating the importance of web pages objectively and mechanically using the link structure of the web.
## PageRank was developed by Larry Page (hence the name *Page*-Rank) and Sergey Brin. ## PageRank was developed by Larry Page (hence the name *Page*-Rank) and Sergey Brin. ## It is first as part of a research project about a new kind of search engine. That project started in 1995 and led to a functional prototype in 1998. ## Shortly after, Page and Brin founded Google. ## 16 billion…
## There are some news about that PageRank will be canceled by Google. ## There are some news about that PageRank will be canceled by Google. ## There are large numbers of Search Engine Optimization (SEO). ## SEO use different trick methods to make a web page more important under the rating of PageRank.
## 150 million web pages 1.7 billion links ## 150 million web pages 1.7 billion links
## u: a web page ## Bu: the set of u’s backlinks ## Nv: the number of forward links of page v ## c: the normalization factor to make ||R||L1 = 1 (||R||L1= |R1 + … + Rn|)
## The Random Surfer Model ## The Random Surfer Model - The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random
## The Modified Model - The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E
## Links that point to any page with no outgoing links ## Links that point to any page with no outgoing links ## Affect the model since it is not clear where their weight should be distributed ## Do not affect the ranking of any other page directly ## Can be simply removed before pagerank calculation and added back afterwards
## Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs to identify pages ## Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs to identify pages ## Remove all the dangling links from the database ## Make an initial assignment of ranks and start iteration - Choosing a good initial assignment can speed up the pagerank
## Adding the dangling links back.
## PR (322 Million Links): 52 iterations ## PR (322 Million Links): 52 iterations ## PR (161 Million Links): 45 iterations ## Scaling factor is roughly linear in *logn*
## The Web is an expander-like graph ## The Web is an expander-like graph - Theory of random walk: a random walk on a graph is said to be rapidly-mixing if it quickly converges to a limiting distribution on the set of nodes in the graph. A random walk is rapidly-mixing on a graph if and only if the graph is an expander graph.
- Expander graph: every subset of nodes S has a neighborhood (set of vertices accessible via outedges emanating from nodes in S) that is larger than some factor α times of |S|. A graph has a good expansion factor if and only if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue.
## Two search engines: ## Two search engines: ## Title-based search engine - Searches only the “Titles”
- Finds all the web pages whose titles contain all the query words
- Sorts the results by PageRank
- Very simple and cheap to implement
- Title match ensures high precision, and PageRank ensures high quality
## Full text search engine - Called Google
- Examines all the words in every stored document and also performs PageRank (Rank Merging)
- More precise but more complicated
## Important component of PageRank calculation is *E* ## Important component of PageRank calculation is *E* - A vector over the web pages (used as source of rank)
- Powerful parameter to adjust the page ranks
## E vector corresponds to the distribution of web pages that a random surfer periodically jumps to
## Some highly accessed web pages have low page rank possibly because ## Some highly accessed web pages have low page rank possibly because - People do not want to link to these pages from their own web pages (the example in their paper is pornographic sites…)
- Some important backlinks are omitted
- use usage data as a start vector for PageRank.
## PageRank is a global ranking of all web pages based on their locations in the web graph structure ## PageRank is a global ranking of all web pages based on their locations in the web graph structure ## PageRank uses information which is external to the web pages – backlinks ## Backlinks from important pages are more significant than backlinks from average pages ## The structure of the web graph is very useful for information retrieval tasks.
**Dostları ilə paylaş:** |