Jure Leskovec Stanford Univ

Yüklə 4,9 Mb.

Pdf görüntüsü

səhifə	8/196
tarix	08.10.2017
ölçüsü	4,9 Mb.
	#3820

1 ... 4 5 6 7 8 9 10 11 ... 196

1.1. WHAT IS DATA MINING?

2. Extracting the most prominent features of the data and ignoring the rest.

We shall explore these two approaches in the following sections.

1.1.4

Summarization

One of the most interesting forms of summarization is the PageRank idea, which

made Google successful and which we shall cover in Chapter 5. In this form

of Web mining, the entire complex structure of the Web is summarized by a

single number for each page. This number, the “PageRank” of the page, is

(oversimplifying somewhat) the probability that a random walker on the graph

would be at that page at any given time. The remarkable property this ranking

has is that it reﬂects very well the “importance” of the page – the degree to

which typical searchers would like that page returned as an answer to their

search query.

Another important form of summary – clustering – will be covered in Chap-

ter 7. Here, data is viewed as points in a multidimensional space. Points

that are “close” in this space are assigned to the same cluster. The clusters

themselves are summarized, perhaps by giving the centroid of the cluster and

the average distance from the centroid of points in the cluster. These cluster

summaries become the summary of the entire data set.

Example 1.2 :

A famous instance of clustering to solve a problem took place

long ago in London, and it was done entirely without computers.

The physician

John Snow, dealing with a Cholera outbreak plotted the cases on a map of the

city. A small illustration suggesting the process is shown in Fig. 1.1.

Figure 1.1: Plotting cholera cases on a map of London

See http://en.wikipedia.org/wiki/1854 Broad Street cholera outbreak.

CHAPTER 1. DATA MINING

The cases clustered around some of the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not get sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered.

✷

1.1.5

Feature Extraction

The typical feature-based model looks for the most extreme examples of a phe-

nomenon and represents the data by these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all statistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for data that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization of

the data that we seek. The original application of this sort of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a store or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pairs of sets that have a relatively large fraction of

their elements in common. An example is treating customers at an on-

line store like Amazon as the set of items they have bought. In order

for Amazon to recommend something else they might like, Amazon can

look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single-minded, that is, they bought only one kind of

thing, then clustering customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3.

1.2

Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

1.2. STATISTICAL LIMITS ON DATA MINING

1.2.1

Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it could

ﬁnd, including credit-card receipts, hotel records, travel data, and many other

kinds of information in order to track terrorist activity. This idea naturally

caused great concern among privacy advocates, and the project, called TIA,

or Total Information Awareness, was eventually killed by Congress, although

it is unclear whether the project in fact exists under another name. It is not

the purpose of this book to discuss the diﬃcult issue of the privacy-security

tradeoﬀ. However, the prospect of TIA or a system like it does raise technical

questions about its feasibility and the realism of its assumptions.

The concern raised by many is that if you look at so much data, and you try

to ﬁnd within it activities that look like terrorist behavior, are you not going to

ﬁnd many innocent activities – or even illicit activities that are not terrorism –

that will result in visits from the police and maybe worse than just a visit? The

answer is that it all depends on how narrowly you deﬁne the activities that you

look for. Statisticians have seen this problem in many guises and have a theory,

which we introduce in the next section.

1.2.2

Bonferroni’s Principle

Suppose you have a certain amount of data, and you look for events of a cer-

tain type within that data. You can expect events of this type to occur, even if

the data is completely random, and the number of occurrences of these events

will grow as the size of the data grows. These occurrences are “bogus,” in the

sense that they have no cause other than that random data will always have

some number of unusual features that look signiﬁcant but aren’t. A theorem

of statistics, known as the Bonferroni correction gives a statistically sound way

to avoid most of these bogus positive responses to a search through the data.

Without going into the statistical details, we oﬀer an informal version, Bon-

ferroni’s principle, that helps us avoid treating random occurrences as if they

were real. Calculate the expected number of occurrences of the events you are

looking for, on the assumption that data is random. If this number is signiﬁ-

cantly larger than the number of real instances you hope to ﬁnd, then you must

expect almost anything you ﬁnd to be bogus, i.e., a statistical artifact rather

than evidence of what you are looking for. This observation is the informal

statement of Bonferroni’s principle.

In a situation like searching for terrorists, where we expect that there are

few terrorists operating at any one time, Bonferroni’s principle says that we

may only detect terrorists by looking for events that are so rare that they are

unlikely to occur in random data. We shall give an extended example in the

next section.

Yüklə 4,9 Mb.

Dostları ilə paylaş:

1 ... 4 5 6 7 8 9 10 11 ... 196