Of course, there is nothing special
about these particular numbers, and a similar
relationship must hold regardless of the actual values. Thus we can add a further
criterion to the preceding list:
3. The information must obey the multistage property illustrated previously.
Remarkably, it turns out that there is only one function that satisfies all these
properties, and it is known as the information value or entropy:
The reason for the minus signs is that logarithms of the fractions p
1
, p
2
, . . . , p
n
are negative, so the entropy is actually positive. Usually the logarithms are
expressed in base 2, then the entropy is in units called bits—just the usual kind
of bits used with computers.
The arguments p
1
, p
2
, . . . of the entropy formula are expressed as fractions
that add up to one, so that, for example,
Thus the multistage decision property can be written in general as
where p
+ q + r = 1.
Because of the way the log function works, you can calculate the information
measure without having to work out the individual fractions:
This is the way that the information measure is usually calculated in
practice. So the information value for the first leaf node of the first tree in Figure
4.2 is
as stated on page 98.
Highly branching attributes
When some attributes have a large number of possible values, giving rise to a
multiway branch with many child nodes, a problem arises with the information
gain calculation. The problem can best be appreciated in the extreme case when
an attribute has a different value for each instance in the dataset—as, for
example, an identification code attribute might.
info 2,3
bits,
[
]
(
)
= -
¥
-
¥
=
2 5
2 5 3 5
3 5
0 971
log
log
.
info 2,3, 4
[
]
(
)
= -
¥
-
¥
-
¥
= -
-
-
+
[
]
2 9
2 9 3 9
3 9 4 9
4 9
2
2 3
3 4
4 9
9 9
log
log
log
log
log
log
log
.
entropy
entropy
entropy
p q r
p q r
q r
q
q r
r
q r
, ,
,
,
(
)
=
+
(
)
+
+
(
)
◊
+
+
Ê
Ë
ˆ
¯
info 2,3, 4
entropy 2 9
[
]
(
)
=
(
)
,
,
.
3 9 4 9
entropy
p p
p
p
p
p
p
p
p
n
n
n
1
2
1
1
2
2
,
, . . . ,
log
log
. . .
log
(
)
= -
-
-
1 0 2
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
P088407-Ch004.qxd 4/30/05 11:13 AM Page 102
4 . 3
D I V I D E - A N D - C O N Q U E R : C O N S T RU C T I N G D E C I S I O N T R E E S
1 0 3
Table 4.6 gives the weather data with this extra attribute. Branching on ID
code produces the tree stump in Figure 4.5. The information required to specify
the class given the value of this attribute is
which is zero because each of the 14 terms is zero. This is not surprising: the ID
code attribute identifies the instance, which determines the class without any
ambiguity—just as Table 4.6 shows. Consequently, the information gain of this
attribute is just the information at the root, info([9,5])
= 0.940 bits. This is
greater than the information gain of any other attribute, and so ID code will
inevitably be chosen as the splitting attribute. But branching on the identifica-
tion code is no good for predicting the class of unknown instances and tells
nothing about the structure of the decision, which after all are the twin goals of
machine learning.
info 0,1
info 0,1
info 1,0
info 1,0
info 0,1
[ ]
(
)
+
[ ]
(
)
+
[
]
(
)
+
+
[
]
(
)
+
[ ]
(
)
. . .
,
Table 4.6
The weather data with identification codes.
ID code
Outlook
Temperature
Humidity
Windy
Play
a
sunny
hot
high
false
no
b
sunny
hot
high
true
no
c
overcast
hot
high
false
yes
d
rainy
mild
high
false
yes
e
rainy
cool
normal
false
yes
f
rainy
cool
normal
true
no
g
overcast
cool
normal
true
yes
h
sunny
mild
high
false
no
i
sunny
cool
normal
false
yes
j
rainy
mild
normal
false
yes
k
sunny
mild
normal
true
yes
l
overcast
mild
high
true
yes
m
overcast
hot
normal
false
yes
n
rainy
mild
high
true
no
no
yes
no
yes
no
ID code
a
b
c ...
m
n
Figure 4.5 Tree stump for the ID code attribute.
P088407-Ch004.qxd 4/30/05 11:13 AM Page 103