eliminate the correct description from the space. The
result is that the set of
remaining descriptions becomes empty. This situation is very likely to happen
if the examples contain any noise at all, which inevitably they do except in
artificial situations.
Another way of looking at generalization as search is to imagine it not as a
process of enumerating descriptions and striking out those that don’t apply but
as a kind of hill-climbing in description space to find the description that best
matches the set of examples according to some prespecified matching criterion.
This is the way that most practical machine learning methods work. However,
except in the most trivial cases, it is impractical to search the whole space
exhaustively; most practical algorithms involve heuristic search and cannot
guarantee to find the optimal description.
Bias
Viewing generalization as a search in a space of possible concepts makes it clear
that the most important decisions in a machine learning system are as follows:
The concept description language
The order in which the space is searched
The way that overfitting to the particular training data is avoided
These three properties are generally referred to as the
bias of the search and are
called language bias, search bias, and overfitting-avoidance bias. You bias the
learning scheme by choosing a language in which to express concepts, by search-
ing in a particular way for an acceptable description, and by deciding when the
concept has become so complex that it needs to be simplified.
Language bias
The most important question for language bias is whether the concept descrip-
tion language is universal or whether it imposes constraints on what concepts can
be learned. If you consider the set of all possible examples, a concept is really just
a division of it into subsets. In the weather example, if you were to enumerate all
possible weather conditions, the play concept is a subset of possible weather con-
ditions. A “universal” language is one that is capable of expressing every possible
subset of examples. In practice, the set of possible examples is generally huge, and
in this respect our perspective is a theoretical, not a practical, one.
If the concept description language permits statements involving logical or,
that is, disjunctions, then any subset can be represented. If the description lan-
guage is rule based, disjunction can be achieved by using separate rules. For
example, one possible concept representation is just to enumerate the examples:
If outlook
= overcast and temperature = hot and humidity = high
and windy
= false then play = yes
3 2
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
P088407-Ch001.qxd 4/30/05 11:11 AM Page 32
If
outlook
= rainy and temperature = mild and humidity = high
and windy
= false then play = yes
If outlook
= rainy and temperature = cool and humidity = normal
and windy
= false then play = yes
If outlook
= overcast and temperature = cool and humidity = normal
and windy
= true then play = yes
. . .
If none of the above then play
= no
This is not a particularly enlightening concept description: it simply records the
positive examples that have been observed and assumes that all the rest are neg-
ative. Each positive example is given its own rule, and the concept is the dis-
junction of the rules. Alternatively, you could imagine having individual rules
for each of the negative examples, too—an equally uninteresting concept. In
either case the concept description does not perform any generalization; it
simply records the original data.
On the other hand, if disjunction is not allowed, some possible concepts—
sets of examples—may not be able to be represented at all. In that case, a
machine learning scheme may simply be unable to achieve good performance.
Another kind of language bias is that obtained from knowledge of the par-
ticular domain being used. For example, it may be that some combinations of
attribute values can never happen. This would be the case if one attribute
implied another. We saw an example of this when considering the rules for the
soybean problem described on page 20. Then, it would be pointless to even con-
sider concepts that involved redundant or impossible combinations of attribute
values. Domain knowledge can be used to cut down the search space. Knowl-
edge is power: a little goes a long way, and even a small hint can reduce the
search space dramatically.
Search bias
In realistic data mining problems, there are many alternative concept descrip-
tions that fit the data, and the problem is to find the “best” one according to
some criterion—usually simplicity. We use the term fit in a statistical sense; we
seek the best description that fits the data reasonably well. Moreover, it is often
computationally infeasible to search the whole space and guarantee that the
description found really is the best. Consequently, the search procedure is
heuristic, and no guarantees can be made about the optimality of the final result.
This leaves plenty of room for bias: different search heuristics bias the search in
different ways.
For example, a learning algorithm might adopt a “greedy” search for rules by
trying to find the best rule at each stage and adding it in to the rule set. However,
it may be that the best pair of rules is not just the two rules that are individu-
ally found to be the best. Or when building a decision tree, a commitment to
1 . 5
G E N E R A L I ZAT I O N A S S E A RC H
3 3
P088407-Ch001.qxd 4/30/05 11:11 AM Page 33