outliers, and so on. Most learning algorithms use
statistical tests when con-
structing rules or trees and for correcting models that are “overfitted” in that
they depend too strongly on the details of the particular examples used to
produce them (we have already seen an example of this in the two decision trees
of Figure 1.3 for the labor negotiations problem). Statistical tests are used to
validate machine learning models and to evaluate machine learning algorithms.
In our study of practical techniques for data mining, we will learn a great deal
about statistics.
1.5 Generalization as search
One way of visualizing the problem of learning—and one that distinguishes it
from statistical approaches—is to imagine a search through a space of possible
concept descriptions for one that fits the data. Although the idea of generaliza-
tion as search is a powerful conceptual tool for thinking about machine learn-
ing, it is not essential for understanding the practical methods described in this
book. That is why this section is marked optional, as indicated by the gray bar
in the margin.
Suppose, for definiteness, that concepts—the result of learning—are
expressed as rules such as the ones given for the weather problem in Section 1.2
(although other concept description languages would do just as well). Suppose
that we list all possible sets of rules and then look for ones that satisfy a given
set of examples. A big job? Yes. An infinite job? At first glance it seems so because
there is no limit to the number of rules there might be. But actually the number
of possible rule sets is finite. Note first that each individual rule is no greater
than a fixed maximum size, with at most one term for each attribute: for the
weather data of Table 1.2 this involves four terms in all. Because the number of
possible rules is finite, the number of possible rule sets is finite, too, although
extremely large. However, we’d hardly be interested in sets that contained a very
large number of rules. In fact, we’d hardly be interested in sets that had more
rules than there are examples because it is difficult to imagine needing more
than one rule for each example. So if we were to restrict consideration to rule
sets smaller than that, the problem would be substantially reduced, although
still very large.
The threat of an infinite number of possible concept descriptions seems more
serious for the second version of the weather problem in Table 1.3 because these
rules contain numbers. If they are real numbers, you can’t enumerate them, even
in principle. However, on reflection the problem again disappears because the
numbers really just represent breakpoints in the numeric values that appear in
the examples. For instance, consider the temperature attribute in Table 1.3. It
involves the numbers 64, 65, 68, 69, 70, 71, 72, 75, 80, 81, 83, and 85—12 dif-
3 0
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
P088407-Ch001.qxd 4/30/05 11:11 AM Page 30
ferent numbers. There are 13 possible places in which we might want to put a
breakpoint for a rule involving temperature. The problem isn’t infinite after all.
So the process of generalization can be regarded as a search through an enor-
mous, but finite, search space. In principle, the problem can be solved by enu-
merating descriptions and striking out those that do not fit the examples
presented. A positive example eliminates all descriptions that it does not match,
and a negative one eliminates those it does match. With each example the set
of remaining descriptions shrinks (or stays the same). If only one is left, it is the
target description—the target concept.
If several descriptions are left, they may still be used to classify unknown
objects. An unknown object that matches all remaining descriptions should be
classified as matching the target; if it fails to match any description it should be
classified as being outside the target concept. Only when it matches some
descriptions but not others is there ambiguity. In this case if the classification
of the unknown object were revealed, it would cause the set of remaining
descriptions to shrink because rule sets that classified the object the wrong way
would be rejected.
Enumerating the concept space
Regarding it as search is a good way of looking at the learning process. However,
the search space, although finite, is extremely big, and it is generally quite
impractical to enumerate all possible descriptions and then see which ones fit.
In the weather problem there are 4
¥ 4 ¥ 3 ¥ 3 ¥ 2 = 288 possibilities for each
rule. There are four possibilities for the outlook attribute: sunny, overcast, rainy,
or it may not participate in the rule at all. Similarly, there are four for tempera-
ture, three for
weather and
humidity, and two for the class. If we restrict the rule
set to contain no more than 14 rules (because there are 14 examples in the train-
ing set), there are around 2.7
¥ 10
34
possible different rule sets. That’s a lot to
enumerate, especially for such a patently trivial problem.
Although there are ways of making the enumeration procedure more feasi-
ble, a serious problem remains: in practice, it is rare for the process to converge
on a unique acceptable description. Either many descriptions are still in the
running after the examples are processed or the descriptors are all eliminated.
The first case arises when the examples are not sufficiently comprehensive to
eliminate all possible descriptions except for the “correct” one. In practice,
people often want a single “best” description, and it is necessary to apply some
other criteria to select the best one from the set of remaining descriptions. The
second problem arises either because the description language is not expressive
enough to capture the actual concept or because of noise in the examples. If an
example comes in with the “wrong” classification because of an error in some
of the attribute values or in the class that is assigned to it, this will likely
1 . 5
G E N E R A L I ZAT I O N A S S E A RC H
3 1
P088407-Ch001.qxd 4/30/05 11:11 AM Page 31