Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	24/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 20 21 22 23 24 25 26 27 ... 219

1.5 Generalization as search
Enumerating the concept space

outliers, and so on. Most learning algorithms use statistical tests when con-

structing rules or trees and for correcting models that are “overﬁtted” in that

they depend too strongly on the details of the particular examples used to

produce them (we have already seen an example of this in the two decision trees

of Figure 1.3 for the labor negotiations problem). Statistical tests are used to

validate machine learning models and to evaluate machine learning algorithms.

In our study of practical techniques for data mining, we will learn a great deal

about statistics.

1.5 Generalization as search

One way of visualizing the problem of learning—and one that distinguishes it

from statistical approaches—is to imagine a search through a space of possible

concept descriptions for one that ﬁts the data. Although the idea of generaliza-

tion as search is a powerful conceptual tool for thinking about machine learn-

ing, it is not essential for understanding the practical methods described in this

book. That is why this section is marked optional, as indicated by the gray bar

in the margin.

Suppose, for deﬁniteness, that concepts—the result of learning—are

expressed as rules such as the ones given for the weather problem in Section 1.2

(although other concept description languages would do just as well). Suppose

that we list all possible sets of rules and then look for ones that satisfy a given

set of examples. A big job? Yes. An inﬁnite job? At ﬁrst glance it seems so because

there is no limit to the number of rules there might be. But actually the number

of possible rule sets is ﬁnite. Note ﬁrst that each individual rule is no greater

than a ﬁxed maximum size, with at most one term for each attribute: for the

weather data of Table 1.2 this involves four terms in all. Because the number of

possible rules is ﬁnite, the number of possible rule sets is ﬁnite, too, although

extremely large. However, we’d hardly be interested in sets that contained a very

large number of rules. In fact, we’d hardly be interested in sets that had more

rules than there are examples because it is difﬁcult to imagine needing more

than one rule for each example. So if we were to restrict consideration to rule

sets smaller than that, the problem would be substantially reduced, although

still very large.

The threat of an inﬁnite number of possible concept descriptions seems more

serious for the second version of the weather problem in Table 1.3 because these

rules contain numbers. If they are real numbers, you can’t enumerate them, even

in principle. However, on reﬂection the problem again disappears because the

numbers really just represent breakpoints in the numeric values that appear in

the examples. For instance, consider the temperature attribute in Table 1.3. It

involves the numbers 64, 65, 68, 69, 70, 71, 72, 75, 80, 81, 83, and 85—12 dif-

3 0

C H A P T E R 1

W H AT ’ S I T A L L A B O U T ?

P088407-Ch001.qxd 4/30/05 11:11 AM Page 30

ferent numbers. There are 13 possible places in which we might want to put a

breakpoint for a rule involving temperature. The problem isn’t inﬁnite after all.

So the process of generalization can be regarded as a search through an enor-

mous, but ﬁnite, search space. In principle, the problem can be solved by enu-

merating descriptions and striking out those that do not ﬁt the examples

presented. A positive example eliminates all descriptions that it does not match,

and a negative one eliminates those it does match. With each example the set

of remaining descriptions shrinks (or stays the same). If only one is left, it is the

target description—the target concept.

If several descriptions are left, they may still be used to classify unknown

objects. An unknown object that matches all remaining descriptions should be

classiﬁed as matching the target; if it fails to match any description it should be

classiﬁed as being outside the target concept. Only when it matches some

descriptions but not others is there ambiguity. In this case if the classiﬁcation

of the unknown object were revealed, it would cause the set of remaining

descriptions to shrink because rule sets that classiﬁed the object the wrong way

would be rejected.

Enumerating the concept space

Regarding it as search is a good way of looking at the learning process. However,

the search space, although ﬁnite, is extremely big, and it is generally quite

impractical to enumerate all possible descriptions and then see which ones ﬁt.

In the weather problem there are 4

¥ 4 ¥ 3 ¥ 3 ¥ 2 = 288 possibilities for each

rule. There are four possibilities for the outlook attribute: sunny, overcast, rainy,

or it may not participate in the rule at all. Similarly, there are four for tempera-

ture, three for weather and humidity, and two for the class. If we restrict the rule

set to contain no more than 14 rules (because there are 14 examples in the train-

ing set), there are around 2.7

¥ 10

possible different rule sets. That’s a lot to

enumerate, especially for such a patently trivial problem.

Although there are ways of making the enumeration procedure more feasi-

ble, a serious problem remains: in practice, it is rare for the process to converge

on a unique acceptable description. Either many descriptions are still in the

running after the examples are processed or the descriptors are all eliminated.

The ﬁrst case arises when the examples are not sufﬁciently comprehensive to

eliminate all possible descriptions except for the “correct” one. In practice,

people often want a single “best” description, and it is necessary to apply some

other criteria to select the best one from the set of remaining descriptions. The

second problem arises either because the description language is not expressive

enough to capture the actual concept or because of noise in the examples. If an

example comes in with the “wrong” classiﬁcation because of an error in some

of the attribute values or in the class that is assigned to it, this will likely

1 . 5

G E N E R A L I ZAT I O N A S S E A RC H

3 1

P088407-Ch001.qxd 4/30/05 11:11 AM Page 31

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 20 21 22 23 24 25 26 27 ... 219