Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	29/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 25 26 27 28 29 30 31 32 ... 219

2.1 What’s a concept

The information that the learner is given takes the form of a set of instances.

In the illustrations in Chapter 1, each instance was an individual, independent

example of the concept to be learned. Of course there are many things you might

like to learn for which the raw data cannot be expressed as individual, inde-

pendent instances. Perhaps background knowledge should be taken into

account as part of the input. Perhaps the raw data is an agglomerated mass that

cannot be fragmented into individual instances. Perhaps it is a single sequence,

say, a time sequence, that cannot meaningfully be cut into pieces. However, this

book is about simple, practical methods of data mining, and we focus on

situations in which the information can be supplied in the form of individual

examples.

Each instance is characterized by the values of attributes that measure dif-

ferent aspects of the instance. There are many different types of attributes,

although typical data mining methods deal only with numeric and nominal, or

categorical, ones.

Finally, we examine the question of preparing input for data mining and

introduce a simple format—the one that is used by the Java code that accom-

panies this book—for representing the input information as a text ﬁle.

2.1 What’s a concept?

Four basically different styles of learning appear in data mining applications. In

classiﬁcation learning, the learning scheme is presented with a set of classiﬁed

examples from which it is expected to learn a way of classifying unseen exam-

ples. In association learning, any association among features is sought, not just

ones that predict a particular class value. In clustering, groups of examples that

belong together are sought. In numeric prediction, the outcome to be predicted

is not a discrete class but a numeric quantity. Regardless of the type of learning

involved, we call the thing to be learned the concept and the output produced

by a learning scheme the concept description.

Most of the examples in Chapter 1 are classiﬁcation problems. The weather

data (Tables 1.2 and 1.3) presents a set of days together with a decision for each

as to whether to play the game or not. The problem is to learn how to classify

new days as play or don’t play. Given the contact lens data (Table 1.1), the

problem is to learn how to decide on a lens recommendation for a new patient—

or more precisely, since every possible combination of attributes is present in

the data, the problem is to learn a way of summarizing the given data. For the

irises (Table 1.4), the problem is to learn how to decide whether a new iris ﬂower

is setosa, versicolor, or virginica, given its sepal length and width and petal length

and width. For the labor negotiations data (Table 1.6), the problem is to decide

whether a new contract is acceptable or not, on the basis of its duration; wage

4 2

C H A P T E R 2

I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S

P088407-Ch002.qxd 4/30/05 11:10 AM Page 42

increase in the ﬁrst, second, and third years; cost of living adjustment; and so

forth.

Classiﬁcation learning is sometimes called supervised because, in a sense, the

method operates under supervision by being provided with the actual outcome

for each of the training examples—the play or don’t play judgment, the lens rec-

ommendation, the type of iris, the acceptability of the labor contract. This

outcome is called the class of the example. The success of classiﬁcation learning

can be judged by trying out the concept description that is learned on an inde-

pendent set of test data for which the true classiﬁcations are known but not

made available to the machine. The success rate on test data gives an objective

measure of how well the concept has been learned. In many practical data

mining applications, success is measured more subjectively in terms of how

acceptable the learned description—such as the rules or the decision tree—are

to a human user.

Most of the examples in Chapter 1 can be used equally well for association

learning, in which there is no speciﬁed class. Here, the problem is to discover

any structure in the data that is “interesting.” Some association rules for the

weather data were given in Section 1.2. Association rules differ from classiﬁca-

tion rules in two ways: they can “predict” any attribute, not just the class, and

they can predict more than one attribute’s value at a time. Because of this there

are far more association rules than classiﬁcation rules, and the challenge is to

avoid being swamped by them. For this reason, association rules are often

limited to those that apply to a certain minimum number of examples—say

80% of the dataset—and have greater than a certain minimum accuracy level—

say 95% accurate. Even then, there are usually lots of them, and they have to be

examined manually to determine whether they are meaningful or not. Associ-

ation rules usually involve only nonnumeric attributes: thus you wouldn’t nor-

mally look for association rules in the iris dataset.

When there is no speciﬁed class, clustering is used to group items that seem

to fall naturally together. Imagine a version of the iris data in which the type of

iris is omitted, such as in Table 2.1. Then it is likely that the 150 instances fall

into natural clusters corresponding to the three iris types. The challenge is to

ﬁnd these clusters and assign the instances to them—and to be able to assign

new instances to the clusters as well. It may be that one or more of the iris types

splits naturally into subtypes, in which case the data will exhibit more than three

natural clusters. The success of clustering is often measured subjectively in terms

of how useful the result appears to be to a human user. It may be followed by a

second step of classiﬁcation learning in which rules are learned that give an

intelligible description of how new instances should be placed into the clusters.

Numeric prediction is a variant of classiﬁcation learning in which the

outcome is a numeric value rather than a category. The CPU performance

problem is one example. Another, shown in Table 2.2, is a version of the weather

2 . 1

W H AT ’ S A C O N C E P T ?

4 3

P088407-Ch002.qxd 4/30/05 11:10 AM Page 43

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 25 26 27 28 29 30 31 32 ... 219