4 . 3
D I V I D E - A N D - C O N Q U E R : C O N S T RU C T I N G D E C I S I O N T R E E S
9 7
attributes in the decision procedure, making a careful selection of which ones
to use. Chapter 7 shows how.
The normal-distribution assumption for numeric attributes is another
restriction on Naïve Bayes as we have formulated it here. Many features simply
aren’t normally distributed. However, there is nothing to prevent us from using
other distributions for the numeric attributes: there is nothing magic about the
normal distribution. If you know that a particular attribute is likely to follow
some other distribution, standard estimation procedures for that distribution
can be used instead. If you suspect it isn’t normal but don’t know the actual
distribution, there are procedures for “kernel density estimation” that do not
assume any particular distribution for the attribute values. Another possibility
is simply to discretize the data first.
4.3 Divide-and-conquer: Constructing decision trees
The problem of constructing a decision tree can be expressed recursively. First,
select an attribute to place at the root node and make one branch for each pos-
sible value. This splits up the example set into subsets, one for every value of
the attribute. Now the process can be repeated recursively for each branch, using
only those instances that actually reach the branch. If at any time all instances
at a node have the same classification, stop developing that part of the tree.
The only thing left to decide is how to determine which attribute to split on,
given a set of examples with different classes. Consider (again!) the weather data.
There are four possibilities for each split, and at the top level they produce trees
such as those in Figure 4.2. Which is the best choice? The number of yes and no
classes are shown at the leaves. Any leaf with only one class—yes or no—will
not have to be split further, and the recursive process down that branch will ter-
minate. Because we seek small trees, we would like this to happen as soon as
possible. If we had a measure of the purity of each node, we could choose the
attribute that produces the purest daughter nodes. Take a moment to look at
Figure 4.2 and ponder which attribute you think is the best choice.
The measure of purity that we will use is called the information and is meas-
ured in units called bits. Associated with a node of the tree, it represents the
expected amount of information that would be needed to specify whether a new
instance should be classified yes or no, given that the example reached that node.
Unlike the bits in computer memory, the expected amount of information
usually involves fractions of a bit—and is often less than one! We calculate it
based on the number of yes and no classes at the node; we will look at the details
of the calculation shortly. But first let’s see how it’s used. When evaluating the
first tree in Figure 4.2, the numbers of yes and no classes at the leaf nodes are
[2,3], [4,0], and [3,2], respectively, and the information values of these nodes are:
P088407-Ch004.qxd 4/30/05 11:13 AM Page 97
We can calculate the average information value of these, taking into account the
number of instances that go down each branch—five down the first and third
and four down the second:
This average represents the amount of information that we expect would be nec-
essary to specify the class of a new instance, given the tree structure in Figure
4.2(a).
info
bits.
2 3
4 0
3 2
5 14
0 971
4 14
0
5 14
0 971 0 693
, , , , ,
.
.
.
[
] [
] [
]
(
)
=
(
)
¥
+
(
)
¥ +
(
)
¥
=
info
bits
info
bits
info
bits
2 3
0 971
4 0
0 0
3 2
0 971
,
.
,
.
,
.
[
]
(
)
=
[
]
(
)
=
[
]
(
)
=
9 8
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
yes
yes
no
no
no
sunny
yes
yes
yes
yes
overcast
yes
yes
yes
no
no
rainy
outlook
(a)
yes
yes
yes
no
no
hot
yes
yes
no
no
yes
yes
yes
no
mild
cool
temperature
yes
(b)
yes
yes
yes
no
no
no
no
yes
yes
yes
yes
yes
yes
no
high
normal
humidity
(c)
yes
yes
yes
yes
yes
yes
no
no
yes
yes
yes
no
no
no
false
true
windy
(d)
Figure 4.2 Tree stumps for the weather data.
P088407-Ch004.qxd 4/30/05 11:13 AM Page 98