Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	47/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 43 44 45 46 47 48 49 50 ... 219

4.2 Statistical modeling

described overﬁtting-avoidance bias in Chapter 1 (page 35), and we will

encounter this problem repeatedly in subsequent chapters.

For 1R, overﬁtting is likely to occur whenever an attribute has a large

number of possible values. Consequently, when discretizing a numeric attrib-

ute a rule is adopted that dictates a minimum number of examples of the

majority class in each partition. Suppose that minimum is set at three. This

eliminates all but two of the preceding partitions. Instead, the partitioning

process begins

yes no yes yes

| yes . . .

ensuring that there are three occurrences of yes, the majority class, in the ﬁrst

partition. However, because the next example is also yes, we lose nothing by

including that in the ﬁrst partition, too. This leads to a new division:

yes no yes yes yes

| no no yes yes yes | no yes yes no

where each partition contains at least three instances of the majority class, except

the last one, which will usually have less. Partition boundaries always fall

between examples of different classes.

Whenever adjacent partitions have the same majority class, as do the ﬁrst two

partitions above, they can be merged together without affecting the meaning of

the rule sets. Thus the ﬁnal discretization is

yes no yes yes yes no no yes yes yes

| no yes yes no

which leads to the rule set

temperature:

£ 77.5 Æ yes

> 77.5 Æ no

The second rule involved an arbitrary choice; as it happens, no was chosen. If

we had chosen yes instead, there would be no need for any breakpoint at all—

and as this example illustrates, it might be better to use the adjacent categories

to help to break ties. In fact this rule generates ﬁve errors on the training set

and so is less effective than the preceding rule for outlook. However, the same

procedure leads to this rule for humidity:

humidity:

£ 82.5 Æ yes

> 82.5 and £ 95.5 Æ no

> 95.5 Æ yes

This generates only three errors on the training set and is the best “1-rule” for

the data in Table 1.3.

Finally, if a numeric attribute has missing values, an additional category is

created for them, and the preceding discretization procedure is applied just to

the instances for which the attribute’s value is deﬁned.

4 . 1

I N F E R R I N G RU D I M E N TA RY RU L E S

8 7

P088407-Ch004.qxd 4/30/05 11:13 AM Page 87

Discussion

In a seminal paper titled “Very simple classiﬁcation rules perform well on most

commonly used datasets” (Holte 1993), a comprehensive study of the perform-

ance of the 1R procedure was reported on 16 datasets frequently used by

machine learning researchers to evaluate their algorithms. Throughout, the

study used cross-validation, an evaluation technique that we will explain in

Chapter 5, to ensure that the results were representative of what independent

test sets would yield. After some experimentation, the minimum number of

examples in each partition of a numeric attribute was set at six, not three as

used for the preceding illustration.

Surprisingly, despite its simplicity 1R did astonishingly—even embarrass-

ingly—well in comparison with state-of-the-art learning methods, and the rules

it produced turned out to be just a few percentage points less accurate, on almost

all of the datasets, than the decision trees produced by a state-of-the-art deci-

sion tree induction scheme. These trees were, in general, considerably larger

than 1R’s rules. Rules that test a single attribute are often a viable alternative to

more complex structures, and this strongly encourages a simplicity-ﬁrst meth-

odology in which the baseline performance is established using simple, rudi-

mentary techniques before progressing to more sophisticated learning methods,

which inevitably generate output that is harder for people to interpret.

The 1R procedure learns a one-level decision tree whose leaves represent the

various different classes. A slightly more expressive technique is to use a differ-

ent rule for each class. Each rule is a conjunction of tests, one for each attribute.

For numeric attributes the test checks whether the value lies within a given inter-

val; for nominal ones it checks whether it is in a certain subset of that attribute’s

values. These two types of tests—intervals and subset—are learned from the

training data pertaining to each class. For a numeric attribute, the endpoints of

the interval are the minimum and maximum values that occur in the training

data for that class. For a nominal one, the subset contains just those values that

occur for that attribute in the training data for the class. Rules representing dif-

ferent classes usually overlap, and at prediction time the one with the most

matching tests is predicted. This simple technique often gives a useful ﬁrst

impression of a dataset. It is extremely fast and can be applied to very large

quantities of data.

4.2 Statistical modeling

The 1R method uses a single attribute as the basis for its decisions and chooses

the one that works best. Another simple technique is to use all attributes and

allow them to make contributions to the decision that are equally important and

independent of one another, given the class. This is unrealistic, of course: what

8 8

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

P088407-Ch004.qxd 4/30/05 11:13 AM Page 88

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 43 44 45 46 47 48 49 50 ... 219