The information that the learner is given
takes the form of a set of instances.
In the illustrations in Chapter 1, each instance was an individual, independent
example of the concept to be learned. Of course there are many things you might
like to learn for which the raw data cannot be expressed as individual, inde-
pendent instances. Perhaps background knowledge should be taken into
account as part of the input. Perhaps the raw data is an agglomerated mass that
cannot be fragmented into individual instances. Perhaps it is a single sequence,
say, a time sequence, that cannot meaningfully be cut into pieces. However, this
book is about simple, practical methods of data mining, and we focus on
situations in which the information can be supplied in the form of individual
examples.
Each instance is characterized by the values of attributes that measure dif-
ferent aspects of the instance. There are many different types of attributes,
although typical data mining methods deal only with numeric and nominal, or
categorical, ones.
Finally, we examine the question of preparing input for data mining and
introduce a simple format—the one that is used by the Java code that accom-
panies this book—for representing the input information as a text file.
2.1 What’s a concept?
Four basically different styles of learning appear in data mining applications. In
classification learning, the learning scheme is presented with a set of classified
examples from which it is expected to learn a way of classifying unseen exam-
ples. In association learning, any association among features is sought, not just
ones that predict a particular class value. In clustering, groups of examples that
belong together are sought. In numeric prediction, the outcome to be predicted
is not a discrete class but a numeric quantity. Regardless of the type of learning
involved, we call the thing to be learned the concept and the output produced
by a learning scheme the concept description.
Most of the examples in Chapter 1 are classification problems. The weather
data (Tables 1.2 and 1.3) presents a set of days together with a decision for each
as to whether to play the game or not. The problem is to learn how to classify
new days as play or don’t play. Given the contact lens data (Table 1.1), the
problem is to learn how to decide on a lens recommendation for a new patient—
or more precisely, since every possible combination of attributes is present in
the data, the problem is to learn a way of summarizing the given data. For the
irises (Table 1.4), the problem is to learn how to decide whether a new iris flower
is setosa, versicolor, or virginica, given its sepal length and width and petal length
and width. For the labor negotiations data (Table 1.6), the problem is to decide
whether a new contract is acceptable or not, on the basis of its duration; wage
4 2
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
P088407-Ch002.qxd 4/30/05 11:10 AM Page 42
increase in the first, second, and third years; cost of living adjustment; and so
forth.
Classification learning is sometimes called
supervised because, in a sense, the
method operates under supervision by being provided with the actual outcome
for each of the training examples—the play or don’t play judgment, the lens rec-
ommendation, the type of iris, the acceptability of the labor contract. This
outcome is called the class of the example. The success of classification learning
can be judged by trying out the concept description that is learned on an inde-
pendent set of test data for which the true classifications are known but not
made available to the machine. The success rate on test data gives an objective
measure of how well the concept has been learned. In many practical data
mining applications, success is measured more subjectively in terms of how
acceptable the learned description—such as the rules or the decision tree—are
to a human user.
Most of the examples in Chapter 1 can be used equally well for association
learning, in which there is no specified class. Here, the problem is to discover
any structure in the data that is “interesting.” Some association rules for the
weather data were given in Section 1.2. Association rules differ from classifica-
tion rules in two ways: they can “predict” any attribute, not just the class, and
they can predict more than one attribute’s value at a time. Because of this there
are far more association rules than classification rules, and the challenge is to
avoid being swamped by them. For this reason, association rules are often
limited to those that apply to a certain minimum number of examples—say
80% of the dataset—and have greater than a certain minimum accuracy level—
say 95% accurate. Even then, there are usually lots of them, and they have to be
examined manually to determine whether they are meaningful or not. Associ-
ation rules usually involve only nonnumeric attributes: thus you wouldn’t nor-
mally look for association rules in the iris dataset.
When there is no specified class, clustering is used to group items that seem
to fall naturally together. Imagine a version of the iris data in which the type of
iris is omitted, such as in Table 2.1. Then it is likely that the 150 instances fall
into natural clusters corresponding to the three iris types. The challenge is to
find these clusters and assign the instances to them—and to be able to assign
new instances to the clusters as well. It may be that one or more of the iris types
splits naturally into subtypes, in which case the data will exhibit more than three
natural clusters. The success of clustering is often measured subjectively in terms
of how useful the result appears to be to a human user. It may be followed by a
second step of classification learning in which rules are learned that give an
intelligible description of how new instances should be placed into the clusters.
Numeric prediction is a variant of classification learning in which the
outcome is a numeric value rather than a category. The CPU performance
problem is one example. Another, shown in Table 2.2, is a version of the weather
2 . 1
W H AT ’ S A C O N C E P T ?
4 3
P088407-Ch002.qxd 4/30/05 11:10 AM Page 43