Jure Leskovec Stanford Univ

Yüklə 4,9 Mb.

Pdf görüntüsü

səhifə	7/196
tarix	08.10.2017
ölçüsü	4,9 Mb.
	#3820

1 2 3 4 5 6 7 8 9 10 ... 196

CONTENTS

xvii

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 433

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 434

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 434

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 436

12 Large-Scale Machine Learning

439

12.1 The Machine-Learning Model . . . . . . . . . . . . . . . . . . . . 440

12.1.1 Training Sets . . . . . . . . . . . . . . . . . . . . . . . . . 440

12.1.2 Some Illustrative Examples . . . . . . . . . . . . . . . . . 440

12.1.3 Approaches to Machine Learning . . . . . . . . . . . . . . 443

12.1.4 Machine-Learning Architecture . . . . . . . . . . . . . . . 444

12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 447

12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

12.2.1 Training a Perceptron with Zero Threshold . . . . . . . . 447

12.2.2 Convergence of Perceptrons . . . . . . . . . . . . . . . . . 451

12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 451

12.2.4 Allowing the Threshold to Vary . . . . . . . . . . . . . . . 453

12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 455

12.2.6 Transforming the Training Set . . . . . . . . . . . . . . . 456

12.2.7 Problems With Perceptrons . . . . . . . . . . . . . . . . . 457

12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 458

12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 459

12.3 Support-Vector Machines . . . . . . . . . . . . . . . . . . . . . . 461

12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 461

12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 462

12.3.3 Finding Optimal Approximate Separators . . . . . . . . . 464

12.3.4 SVM Solutions by Gradient Descent . . . . . . . . . . . . 467

12.3.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 470

12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 471

12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 472

12.4 Learning from Nearest Neighbors . . . . . . . . . . . . . . . . . . 472

12.4.1 The Framework for Nearest-Neighbor Calculations . . . . 473

12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 473

12.4.3 Learning One-Dimensional Functions . . . . . . . . . . . . 474

12.4.4 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . 477

12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 477

12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 479

12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 479

12.5 Comparison of Learning Methods . . . . . . . . . . . . . . . . . . 480

12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 481

12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 483

xviii

CONTENTS

Chapter 1

Data Mining

In this intoductory chapter we begin with the essence of data mining and a dis-

cussion of how data mining is treated by the various disciplines that contribute

to this ﬁeld. We cover “Bonferroni’s Principle,” which is really a warning about

overusing the ability to mine data. This chapter is also the place where we

summarize a few useful ideas that are not data mining but are useful in un-

derstanding some important data-mining concepts. These include the TF.IDF

measure of word importance, behavior of hash functions and indexes, and iden-

tities involving e, the base of natural logarithms. Finally, we give an outline of

the topics covered in the balance of the book.

1.1

What is Data Mining?

The most commonly accepted deﬁnition of “data mining” is the discovery of

“models” for data. A “model,” however, can be one of several things. We

mention below the most important directions in modeling.

1.1.1

Statistical Modeling

Statisticians were the ﬁrst to use the term “data mining.” Originally, “data

mining” or “data dredging” was a derogatory term referring to attempts to

extract information that was not supported by the data. Section 1.2 illustrates

the sort of errors one can make by trying to extract what really isn’t in the data.

Today, “data mining” has taken on a positive meaning. Now, statisticians view

data mining as the construction of a statistical model, that is, an underlying

distribution from which the visible data is drawn.

Example 1.1 :

Suppose our data is a set of numbers. This data is much

simpler than data that would be data-mined, but it will serve as an example. A

statistician might decide that the data comes from a Gaussian distribution and

use a formula to compute the most likely parameters of this Gaussian. The mean

CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data.

✷

1.1.2

Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning prac-

titioners, such as Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3

Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent approaches to modeling data. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been generated. Most other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

Yüklə 4,9 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 10 ... 196