Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə24/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   20   21   22   23   24   25   26   27   ...   219

outliers, and so on. Most learning algorithms use statistical tests when con-

structing rules or trees and for correcting models that are “overfitted” in that

they depend too strongly on the details of the particular examples used to

produce them (we have already seen an example of this in the two decision trees

of Figure 1.3 for the labor negotiations problem). Statistical tests are used to

validate machine learning models and to evaluate machine learning algorithms.

In our study of practical techniques for data mining, we will learn a great deal

about statistics.



1.5 Generalization as search

One way of visualizing the problem of learning—and one that distinguishes it

from statistical approaches—is to imagine a search through a space of possible

concept descriptions for one that fits the data. Although the idea of generaliza-

tion as search is a powerful conceptual tool for thinking about machine learn-

ing, it is not essential for understanding the practical methods described in this

book. That is why this section is marked optional, as indicated by the gray bar

in the margin.

Suppose, for definiteness, that concepts—the result of learning—are

expressed as rules such as the ones given for the weather problem in Section 1.2

(although other concept description languages would do just as well). Suppose

that we list all possible sets of rules and then look for ones that satisfy a given

set of examples. A big job? Yes. An infinite job? At first glance it seems so because

there is no limit to the number of rules there might be. But actually the number

of possible rule sets is finite. Note first that each individual rule is no greater

than a fixed maximum size, with at most one term for each attribute: for the

weather data of Table 1.2 this involves four terms in all. Because the number of

possible rules is finite, the number of possible rule sets is finite, too, although

extremely large. However, we’d hardly be interested in sets that contained a very

large number of rules. In fact, we’d hardly be interested in sets that had more

rules than there are examples because it is difficult to imagine needing more

than one rule for each example. So if we were to restrict consideration to rule

sets smaller than that, the problem would be substantially reduced, although

still very large.

The threat of an infinite number of possible concept descriptions seems more

serious for the second version of the weather problem in Table 1.3 because these

rules contain numbers. If they are real numbers, you can’t enumerate them, even

in principle. However, on reflection the problem again disappears because the

numbers really just represent breakpoints in the numeric values that appear in

the examples. For instance, consider the temperature attribute in Table 1.3. It

involves the numbers 64, 65, 68, 69, 70, 71, 72, 75, 80, 81, 83, and 85—12 dif-

3 0


C H A P T E R   1

|

W H AT ’ S   I T   A L L   A B O U T ?



P088407-Ch001.qxd  4/30/05  11:11 AM  Page 30


ferent numbers. There are 13 possible places in which we might want to put a

breakpoint for a rule involving temperature. The problem isn’t infinite after all.

So the process of generalization can be regarded as a search through an enor-

mous, but finite, search space. In principle, the problem can be solved by enu-

merating descriptions and striking out those that do not fit the examples

presented. A positive example eliminates all descriptions that it does not match,

and a negative one eliminates those it does match. With each example the set

of remaining descriptions shrinks (or stays the same). If only one is left, it is the

target description—the target concept.

If several descriptions are left, they may still be used to classify unknown

objects. An unknown object that matches all remaining descriptions should be

classified as matching the target; if it fails to match any description it should be

classified as being outside the target concept. Only when it matches some

descriptions but not others is there ambiguity. In this case if the classification

of the unknown object were revealed, it would cause the set of remaining

descriptions to shrink because rule sets that classified the object the wrong way

would be rejected.

Enumerating the concept space

Regarding it as search is a good way of looking at the learning process. However,

the search space, although finite, is extremely big, and it is generally quite

impractical to enumerate all possible descriptions and then see which ones fit.

In the weather problem there are 4 

¥ 4 ¥ 3 ¥ 3 ¥ 2 = 288 possibilities for each

rule. There are four possibilities for the outlook attribute: sunny, overcast, rainy,

or it may not participate in the rule at all. Similarly, there are four for tempera-



ture, three for weather and humidity, and two for the class. If we restrict the rule

set to contain no more than 14 rules (because there are 14 examples in the train-

ing set), there are around 2.7 

¥ 10


34

possible different rule sets. That’s a lot to

enumerate, especially for such a patently trivial problem.

Although there are ways of making the enumeration procedure more feasi-

ble, a serious problem remains: in practice, it is rare for the process to converge

on a unique acceptable description. Either many descriptions are still in the

running after the examples are processed or the descriptors are all eliminated.

The first case arises when the examples are not sufficiently comprehensive to

eliminate all possible descriptions except for the “correct” one. In practice,

people often want a single “best” description, and it is necessary to apply some

other criteria to select the best one from the set of remaining descriptions. The

second problem arises either because the description language is not expressive

enough to capture the actual concept or because of noise in the examples. If an

example comes in with the “wrong” classification because of an error in some

of the attribute values or in the class that is assigned to it, this will likely 

1 . 5


G E N E R A L I ZAT I O N   A S   S E A RC H

3 1


P088407-Ch001.qxd  4/30/05  11:11 AM  Page 31


Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   20   21   22   23   24   25   26   27   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə