the first
test in the rule, split the space vertically as shown in the center picture.
This gives the beginnings of a rule:
If x
> 1.2 then class = a
However, the rule covers many
b’s as well as
a’s, so a new test is added to the
rule by further splitting the space horizontally as shown in the third diagram:
If x
> 1.2 and y > 2.6 then class = a
This gives a rule covering all but one of the
a’s. It’s probably appropriate to leave
it at that, but if it were felt necessary to cover the final a, another rule would be
necessary—perhaps
If x
> 1.4 and y < 2.4 then class = a
The same procedure leads to two rules covering the b’s:
If x
£ 1.2 then class = b
If x
> 1.2 and y £ 2.6 then class = b
1 0 6
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
Figure 4.6 Covering algorithm: (a) covering the instances and (b) the decision tree for
the same problem.
x > 1.2 ?
b
no
y > 2.6 ?
yes
b
no
a
yes
(b)
P088407-Ch004.qxd 4/30/05 11:13 AM Page 106
4 . 4
C OV E R I N G A LG O R I T H M S : C O N S T RU C T I N G RU L E S
1 0 7
Again, one a is erroneously covered by these rules. If it were necessary to exclude
it, more tests would have to be added to the second rule, and additional rules
would need to be introduced to cover the b’s that these new tests exclude.
Rules versus trees
A top-down divide-and-conquer algorithm operates on the same data in a
manner that is, at least superficially, quite similar to a covering algorithm. It
might first split the dataset using the x attribute and would probably end up
splitting it at the same place, x
= 1.2. However, whereas the covering algorithm
is concerned only with covering a single class, the division would take both
classes into account, because divide-and-conquer algorithms create a single
concept description that applies to all classes. The second split might also be at
the same place, y
= 2.6, leading to the decision tree in Figure 4.6(b). This tree
corresponds exactly to the set of rules, and in this case there is no difference in
effect between the covering and the divide-and-conquer algorithms.
But in many situations there is a difference between rules and trees in terms
of the perspicuity of the representation. For example, when we described the
replicated subtree problem in Section 3.3, we noted that rules can be symmet-
ric whereas trees must select one attribute to split on first, and this can lead to
trees that are much larger than an equivalent set of rules. Another difference is
that, in the multiclass case, a decision tree split takes all classes into account,
trying to maximize the purity of the split, whereas the rule-generating method
concentrates on one class at a time, disregarding what happens to the other
classes.
A simple covering algorithm
Covering algorithms operate by adding tests to the rule that is under construc-
tion, always striving to create a rule with maximum accuracy. In contrast, divide-
and-conquer algorithms operate by adding tests to the tree that is under
construction, always striving to maximize the separation among the classes.
Each of these involves finding an attribute to split on. But the criterion for the
best attribute is different in each case. Whereas divide-and-conquer algorithms
such as ID3 choose an attribute to maximize the information gain, the cover-
ing algorithm we will describe chooses an attribute–value pair to maximize the
probability of the desired classification.
Figure 4.7 gives a picture of the situation, showing the space containing all
the instances, a partially constructed rule, and the same rule after a new term
has been added. The new term restricts the coverage of the rule: the idea is to
include as many instances of the desired class as possible and exclude as many
instances of other classes as possible. Suppose the new rule will cover a total of
t instances, of which p are positive examples of the class and t
- p are in other
P088407-Ch004.qxd 4/30/05 11:13 AM Page 107
classes—that is, they are errors made by the rule. Then choose the new term to
maximize the ratio p/t.
An example will help. For a change, we use the contact lens problem of Table
1.1. We will form rules that cover each of the three classes, hard, soft, and none,
in turn. To begin, we seek a rule:
If ? then recommendation
= hard
For the unknown term ?, we have nine choices:
age
= young
2/8
age
= pre-presbyopic
1/8
age
= presbyopic
1/8
spectacle prescription
= myope
3/12
spectacle prescription
= hypermetrope
1/12
astigmatism
= no
0/12
astigmatism
= yes
4/12
tear production rate
= reduced
0/12
tear production rate
= normal
4/12
The numbers on the right show the fraction of “correct” instances in the set
singled out by that choice. In this case, correct means that the recommendation is
hard. For instance, age
= young selects eight instances, two of which recommend
hard contact lenses, so the first fraction is 2/8. (To follow this, you will need to
look back at the contact lens data in Table 1.1 on page 6 and count up the entries
in the table.) We select the largest fraction, 4/12, arbitrarily choosing between
the seventh and the last choice in the preceding list, and create the rule:
If astigmatism
= yes then recommendation = hard
This rule is an inaccurate one, getting only 4 instances correct out of the 12
that it covers, shown in Table 4.8. So we refine it further:
If astigmatism
= yes and ? then recommendation = hard
1 0 8
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
Figure 4.7 The instance space during operation of a covering algorithm.
P088407-Ch004.qxd 4/30/05 11:13 AM Page 108