Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	54/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 50 51 52 53 54 55 56 57 ... 219

Highly branching attributes
Table 4.6 The weather data with identiﬁcation codes.

Of course, there is nothing special about these particular numbers, and a similar

relationship must hold regardless of the actual values. Thus we can add a further

criterion to the preceding list:

3. The information must obey the multistage property illustrated previously.

Remarkably, it turns out that there is only one function that satisﬁes all these

properties, and it is known as the information value or entropy:

The reason for the minus signs is that logarithms of the fractions p

, p

, . . . , p

are negative, so the entropy is actually positive. Usually the logarithms are

expressed in base 2, then the entropy is in units called bits—just the usual kind

of bits used with computers.

The arguments p

, p

, . . . of the entropy formula are expressed as fractions

that add up to one, so that, for example,

Thus the multistage decision property can be written in general as

where p

+ q + r = 1.

Because of the way the log function works, you can calculate the information

measure without having to work out the individual fractions:

This is the way that the information measure is usually calculated in

practice. So the information value for the ﬁrst leaf node of the ﬁrst tree in Figure

4.2 is

as stated on page 98.

Highly branching attributes

When some attributes have a large number of possible values, giving rise to a

multiway branch with many child nodes, a problem arises with the information

gain calculation. The problem can best be appreciated in the extreme case when

an attribute has a different value for each instance in the dataset—as, for

example, an identiﬁcation code attribute might.

info 2,3

bits,

[

]

(

)

= -

2 5

2 5 3 5

3 5

0 971

log

info 2,3, 4

[

]

(

)

= -

[

]

2 9

2 9 3 9

3 9 4 9

4 9

2 3

3 4

4 9

9 9

log

entropy

p q r

p q r

q r

q

q r

r

q r

, ,

(

)

(

)

(

)

◊

info 2,3, 4

entropy 2 9

[

]

(

)

(

)

3 9 4 9

entropy p p

p

p

p

p

p

p

p

n

n

n

, . . . ,

log

. . .

log

(

)

= -

1 0 2

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

P088407-Ch004.qxd 4/30/05 11:13 AM Page 102

4 . 3

D I V I D E - A N D - C O N Q U E R : C O N S T RU C T I N G D E C I S I O N T R E E S

1 0 3

Table 4.6 gives the weather data with this extra attribute. Branching on ID

code produces the tree stump in Figure 4.5. The information required to specify

the class given the value of this attribute is

which is zero because each of the 14 terms is zero. This is not surprising: the ID

code attribute identiﬁes the instance, which determines the class without any

ambiguity—just as Table 4.6 shows. Consequently, the information gain of this

attribute is just the information at the root, info([9,5])

= 0.940 bits. This is

greater than the information gain of any other attribute, and so ID code will

inevitably be chosen as the splitting attribute. But branching on the identiﬁca-

tion code is no good for predicting the class of unknown instances and tells

nothing about the structure of the decision, which after all are the twin goals of

machine learning.

info 0,1

info 1,0

info 0,1

[ ]

(

)

[ ]

(

)

[

]

(

)

[

]

(

)

[ ]

(

)

. . .

,

Table 4.6

The weather data with identiﬁcation codes.

ID code

Outlook

Temperature

Humidity

Windy

Play

sunny

hot

high

false

sunny

hot

high

true

overcast

hot

high

false

yes

rainy

mild

high

false

yes

rainy

cool

normal

false

yes

rainy

cool

normal

true

overcast

cool

normal

true

yes

sunny

mild

high

false

sunny

cool

normal

false

yes

rainy

mild

normal

false

yes

sunny

mild

normal

true

yes

overcast

mild

high

true

yes

overcast

hot

normal

false

yes

rainy

mild

high

true

yes

ID code

c ...

n

Figure 4.5 Tree stump for the ID code attribute.

P088407-Ch004.qxd 4/30/05 11:13 AM Page 103

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 50 51 52 53 54 55 56 57 ... 219