described overfitting-avoidance bias in Chapter 1 (page 35), and we will
encounter this problem repeatedly in subsequent chapters.
For 1R, overfitting is likely to occur whenever an attribute has a large
number of possible values. Consequently, when discretizing a numeric attrib-
ute a rule is adopted that dictates a minimum number of examples of the
majority class in each partition. Suppose that minimum is set at three. This
eliminates all but two of the preceding partitions. Instead, the partitioning
process begins
yes no yes yes
| yes . . .
ensuring that there are three occurrences of yes, the majority class, in the first
partition. However, because the next example is also yes, we lose nothing by
including that in the first partition, too. This leads to a new division:
yes no yes yes yes
| no no yes yes yes | no yes yes no
where each partition contains at least three instances of the majority class, except
the last one, which will usually have less. Partition boundaries always fall
between examples of different classes.
Whenever adjacent partitions have the same majority class, as do the first two
partitions above, they can be merged together without affecting the meaning of
the rule sets. Thus the final discretization is
yes no yes yes yes no no yes yes yes
| no yes yes no
which leads to the rule set
temperature:
£ 77.5 Æ yes
> 77.5 Æ no
The second rule involved an arbitrary choice; as it happens, no was chosen. If
we had chosen yes instead, there would be no need for any breakpoint at all—
and as this example illustrates, it might be better to use the adjacent categories
to help to break ties. In fact this rule generates five errors on the training set
and so is less effective than the preceding rule for outlook. However, the same
procedure leads to this rule for humidity:
humidity:
£ 82.5 Æ yes
> 82.5 and £ 95.5 Æ no
> 95.5 Æ yes
This generates only three errors on the training set and is the best “1-rule” for
the data in Table 1.3.
Finally, if a numeric attribute has missing values, an additional category is
created for them, and the preceding discretization procedure is applied just to
the instances for which the attribute’s value is defined.
4 . 1
I N F E R R I N G RU D I M E N TA RY RU L E S
8 7
P088407-Ch004.qxd 4/30/05 11:13 AM Page 87
Discussion
In a seminal paper titled “Very simple classification rules perform well on most
commonly used datasets” (Holte 1993), a comprehensive study of the perform-
ance of the 1R procedure was reported on 16 datasets frequently used by
machine learning researchers to evaluate their algorithms. Throughout, the
study used cross-validation, an evaluation technique that we will explain in
Chapter 5, to ensure that the results were representative of what independent
test sets would yield. After some experimentation, the minimum number of
examples in each partition of a numeric attribute was set at six, not three as
used for the preceding illustration.
Surprisingly, despite its simplicity 1R did astonishingly—even embarrass-
ingly—well in comparison with state-of-the-art learning methods, and the rules
it produced turned out to be just a few percentage points less accurate, on almost
all of the datasets, than the decision trees produced by a state-of-the-art deci-
sion tree induction scheme. These trees were, in general, considerably larger
than 1R’s rules. Rules that test a single attribute are often a viable alternative to
more complex structures, and this strongly encourages a simplicity-first meth-
odology in which the baseline performance is established using simple, rudi-
mentary techniques before progressing to more sophisticated learning methods,
which inevitably generate output that is harder for people to interpret.
The 1R procedure learns a one-level decision tree whose leaves represent the
various different classes. A slightly more expressive technique is to use a differ-
ent rule for each class. Each rule is a conjunction of tests, one for each attribute.
For numeric attributes the test checks whether the value lies within a given inter-
val; for nominal ones it checks whether it is in a certain subset of that attribute’s
values. These two types of tests—intervals and subset—are learned from the
training data pertaining to each class. For a numeric attribute, the endpoints of
the interval are the minimum and maximum values that occur in the training
data for that class. For a nominal one, the subset contains just those values that
occur for that attribute in the training data for the class. Rules representing dif-
ferent classes usually overlap, and at prediction time the one with the most
matching tests is predicted. This simple technique often gives a useful first
impression of a dataset. It is extremely fast and can be applied to very large
quantities of data.
4.2 Statistical modeling
The 1R method uses a single attribute as the basis for its decisions and chooses
the one that works best. Another simple technique is to use all attributes and
allow them to make contributions to the decision that are equally important and
independent of one another, given the class. This
is unrealistic, of course: what
8 8
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
P088407-Ch004.qxd 4/30/05 11:13 AM Page 88