Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə54/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   50   51   52   53   54   55   56   57   ...   219

Of course, there is nothing special about these particular numbers, and a similar

relationship must hold regardless of the actual values. Thus we can add a further

criterion to the preceding list:

3. The information must obey the multistage property illustrated previously.

Remarkably, it turns out that there is only one function that satisfies all these

properties, and it is known as the information value or entropy:

The reason for the minus signs is that logarithms of the fractions p

1

p



2

, . . . , p



n

are negative, so the entropy is actually positive. Usually the logarithms are

expressed in base 2, then the entropy is in units called bits—just the usual kind

of bits used with computers.

The arguments p

1

p



2

, . . . of the entropy formula are expressed as fractions

that add up to one, so that, for example,

Thus the multistage decision property can be written in general as

where p

= 1.

Because of the way the log function works, you can calculate the information

measure without having to work out the individual fractions:

This is the way that the information measure is usually calculated in 

practice. So the information value for the first leaf node of the first tree in Figure

4.2 is

as stated on page 98.



Highly branching attributes

When some attributes have a large number of possible values, giving rise to a

multiway branch with many child nodes, a problem arises with the information

gain calculation. The problem can best be appreciated in the extreme case when

an attribute has a different value for each instance in the dataset—as, for

example, an identification code attribute might.

info 2,3

bits,


[

]

(



)

= -


¥

-

¥



=

2 5


2 5 3 5

3 5


0 971

log


log

.

info 2,3, 4



[

]

(



)

= -


¥

-

¥



-

¥

= -



-

-

+



[

]

2 9



2 9 3 9

3 9 4 9


4 9

2

2 3



3 4

4 9


9 9

log


log

log


log

log


log

log


.

entropy


entropy

entropy


p q r

p q r

q r

q

q r

r

q r

, ,


,

,

(



)

=

+



(

)

+



+

(

)



+

+



Ê

Ë

ˆ



¯

info 2,3, 4

entropy 2 9

[

]



(

)

=



(

)

,



,

.

3 9 4 9



entropy p p

p

p

p

p

p

p

p

n

n

n

1

2



1

1

2



2

,

, . . . ,



log

log


. . .

log


(

)

= -



-

-

1 0 2



C H A P T E R   4

|

A LG O R I T H M S : T H E   BA S I C   M E T H O D S



P088407-Ch004.qxd  4/30/05  11:13 AM  Page 102


4 . 3

D I V I D E - A N D - C O N Q U E R : C O N S T RU C T I N G   D E C I S I O N   T R E E S

1 0 3

Table 4.6 gives the weather data with this extra attribute. Branching on ID



code produces the tree stump in Figure 4.5. The information required to specify

the class given the value of this attribute is

which is zero because each of the 14 terms is zero. This is not surprising: the ID

code attribute identifies the instance, which determines the class without any

ambiguity—just as Table 4.6 shows. Consequently, the information gain of this

attribute is just the information at the root, info([9,5]) 

= 0.940 bits. This is

greater than the information gain of any other attribute, and so ID code will

inevitably be chosen as the splitting attribute. But branching on the identifica-

tion code is no good for predicting the class of unknown instances and tells

nothing about the structure of the decision, which after all are the twin goals of

machine learning.

info 0,1


info 0,1

info 1,0


info 1,0

info 0,1


[ ]

(

)



+

[ ]


(

)

+



[

]

(



)

+

+



[

]

(



)

+

[ ]



(

)

. . .



,

Table 4.6

The weather data with identification codes.

ID code


Outlook

Temperature

Humidity

Windy


Play

a

sunny



hot

high


false

no

b



sunny

hot


high

true


no

c

overcast



hot

high


false

yes


d

rainy


mild

high


false

yes


e

rainy


cool

normal


false

yes


f

rainy


cool

normal


true

no

g



overcast

cool


normal

true


yes

h

sunny



mild

high


false

no

i



sunny

cool


normal

false


yes

j

rainy



mild

normal


false

yes


k

sunny


mild

normal


true

yes


l

overcast


mild

high


true

yes


m

overcast


hot

normal


false

yes


n

rainy


mild

high


true

no

no



yes

no

yes



no

ID code


a

b

c ...



m

n

Figure 4.5 Tree stump for the ID code attribute.

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 103



Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   50   51   52   53   54   55   56   57   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə