Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə55/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   51   52   53   54   55   56   57   58   ...   219

The overall effect is that the information gain measure tends to prefer attri-

butes with large numbers of possible values. To compensate for this, a modifi-

cation of the measure called the gain ratio is widely used. The gain ratio is

derived by taking into account the number and size of daughter nodes into

which an attribute splits the dataset, disregarding any information about the

class. In the situation shown in Figure 4.5, all counts have a value of 1, so the

information value of the split is

because the same fraction, 1/14, appears 14 times. This amounts to log 14, or

3.807 bits, which is a very high value. This is because the information value of

a split is the number of bits needed to determine to which branch each instance

is assigned, and the more branches there are, the greater this value is. The gain

ratio is calculated by dividing the original information gain, 0.940 in this case,

by the information value of the attribute, 3.807—yielding a gain ratio value of

0.247 for the ID code attribute.

Returning to the tree stumps for the weather data in Figure 4.2, outlook splits

the dataset into three subsets of size 5, 4, and 5 and thus has an intrinsic infor-

mation value of

without paying any attention to the classes involved in the subsets. As we have

seen, this intrinsic information value is higher for a more highly branching

attribute such as the hypothesized ID code. Again we can correct the informa-

tion gain by dividing by the intrinsic information value to get the gain ratio.

The results of these calculations for the tree stumps of Figure 4.2 are sum-

marized in Table 4.7Outlook still comes out on top, but humidity is now a much

closer contender because it splits the data into two subsets instead of three. In

this particular example, the hypothetical ID code attribute, with a gain ratio of

0.247, would still be preferred to any of these four. However, its advantage is

info 5, 4,5

[

]



(

)

= 1 577



.

info 1,1, . . . ,1

[

]

(



)

= -


¥

¥

1 14



1 14 14

log


,

1 0 4


C H A P T E R   4

|

A LG O R I T H M S : T H E   BA S I C   M E T H O D S



Table 4.7

Gain ratio calculations for the tree stumps of Figure 4.2.

Outlook


Temperature

Humidity


Windy

info:


0.693

info:


0.911

info:


0.788

info:


0.892

gain: 0.940–

0.247

gain: 0.940–



0.029

gain: 0.940–

0.152

gain: 0.940–



0.048

0.693


0.911

0.788


0.892

split info: 

1.577

split info: 



1.557

split info: 

1.000

split info: 



0.985

info([5,4,5])

info([4,6,4])

info ([7,7])

info([8,6])

gain ratio: 

0.157

gain ratio: 



0.019

gain ratio: 

0.152

gain ratio: 



0.049

0.247/1.577

0.029/1.557

0.152/1


0.048/0.985

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 104




4 . 4

C OV E R I N G   A LG O R I T H M S : C O N S T RU C T I N G   RU L E S

1 0 5

greatly reduced. In practical implementations, we can use an ad hoc test to guard



against splitting on such a useless attribute.

Unfortunately, in some situations the gain ratio modification overcompen-

sates and can lead to preferring an attribute just because its intrinsic informa-

tion is much lower than that for the other attributes. A standard fix is to choose

the attribute that maximizes the gain ratio, provided that the information gain

for that attribute is at least as great as the average information gain for all the

attributes examined.

Discussion

The divide-and-conquer approach to decision tree induction, sometimes called



top-down induction of decision trees, was developed and refined over many years

by J. Ross Quinlan of the University of Sydney, Australia. Although others have

worked on similar methods, Quinlan’s research has always been at the very fore-

front of decision tree induction. The method that has been described using the

information gain criterion is essentially the same as one known as ID3. The use

of the gain ratio was one of many improvements that were made to ID3 over

several years; Quinlan described it as robust under a wide variety of circum-

stances. Although a robust and practical solution, it sacrifices some of the ele-

gance and clean theoretical motivation of the information gain criterion.

A series of improvements to ID3 culminated in a practical and influential

system for decision tree induction called C4.5. These improvements include

methods for dealing with numeric attributes, missing values, noisy data, and

generating rules from trees, and they are described in Section 6.1.

4.4 Covering algorithms: Constructing rules

As we have seen, decision tree algorithms are based on a divide-and-conquer

approach to the classification problem. They work from the top down, seeking

at each stage an attribute to split on that best separates the classes; then recur-

sively processing the subproblems that result from the split. This strategy 

generates a decision tree, which can if necessary be converted into a set of clas-

sification rules—although if it is to produce effective rules, the conversion is not

trivial.


An alternative approach is to take each class in turn and seek a way of cov-

ering all instances in it, at the same time excluding instances not in the class.

This is called a covering approach because at each stage you identify a rule that

“covers” some of the instances. By its very nature, this covering approach leads

to a set of rules rather than to a decision tree.

The covering method can readily be visualized in a two-dimensional space

of instances as shown in Figure 4.6(a). We first make a rule covering the a’s. For

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 105




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   51   52   53   54   55   56   57   58   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə