Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə52/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   48   49   50   51   52   53   54   55   ...   219

4 . 3

D I V I D E - A N D - C O N Q U E R : C O N S T RU C T I N G   D E C I S I O N   T R E E S

9 7

attributes in the decision procedure, making a careful selection of which ones



to use. Chapter 7 shows how.

The normal-distribution assumption for numeric attributes is another

restriction on Naïve Bayes as we have formulated it here. Many features simply

aren’t normally distributed. However, there is nothing to prevent us from using

other distributions for the numeric attributes: there is nothing magic about the

normal distribution. If you know that a particular attribute is likely to follow

some other distribution, standard estimation procedures for that distribution

can be used instead. If you suspect it isn’t normal but don’t know the actual 

distribution, there are procedures for “kernel density estimation” that do not

assume any particular distribution for the attribute values. Another possibility

is simply to discretize the data first.

4.3 Divide-and-conquer: Constructing decision trees

The problem of constructing a decision tree can be expressed recursively. First,

select an attribute to place at the root node and make one branch for each pos-

sible value. This splits up the example set into subsets, one for every value of

the attribute. Now the process can be repeated recursively for each branch, using

only those instances that actually reach the branch. If at any time all instances

at a node have the same classification, stop developing that part of the tree.

The only thing left to decide is how to determine which attribute to split on,

given a set of examples with different classes. Consider (again!) the weather data.

There are four possibilities for each split, and at the top level they produce trees

such as those in Figure 4.2. Which is the best choice? The number of yes and no

classes are shown at the leaves. Any leaf with only one class—yes or  no—will

not have to be split further, and the recursive process down that branch will ter-

minate. Because we seek small trees, we would like this to happen as soon as

possible. If we had a measure of the purity of each node, we could choose the

attribute that produces the purest daughter nodes. Take a moment to look at

Figure 4.2 and ponder which attribute you think is the best choice.

The measure of purity that we will use is called the information and is meas-

ured in units called bits. Associated with a node of the tree, it represents the

expected amount of information that would be needed to specify whether a new

instance should be classified yes or no, given that the example reached that node.

Unlike the bits in computer memory, the expected amount of information

usually involves fractions of a bit—and is often less than one! We calculate it

based on the number of yes and no classes at the node; we will look at the details

of the calculation shortly. But first let’s see how it’s used. When evaluating the

first tree in Figure 4.2, the numbers of yes and  no classes at the leaf nodes are

[2,3], [4,0], and [3,2], respectively, and the information values of these nodes are:

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 97




We can calculate the average information value of these, taking into account the

number of instances that go down each branch—five down the first and third

and four down the second:

This average represents the amount of information that we expect would be nec-

essary to specify the class of a new instance, given the tree structure in Figure

4.2(a).


info

bits.


2 3

4 0


3 2

5 14


0 971

4 14


0

5 14


0 971 0 693

, , , , ,

.

.

.



[

] [


] [

]

(



)

=

(



)

¥

+



(

)

¥ +



(

)

¥



=

info


bits

info


bits

info


bits

2 3


0 971

4 0


0 0

3 2


0 971

,

.



,

.

,



.

[

]



(

)

=



[

]

(



)

=

[



]

(

)



=

9 8


C H A P T E R   4

|

A LG O R I T H M S : T H E   BA S I C   M E T H O D S



yes

yes


no

no

no



sunny

yes


yes

yes


yes

overcast


yes

yes


yes

no

no



rainy

outlook


(a)

yes


yes

yes


no

no

hot



yes

yes


no

no

yes



yes

yes


no

mild


cool

temperature

yes

(b)


yes

yes


yes

no

no



no

no

yes



yes

yes


yes

yes


yes

no

high



normal

humidity


(c)

yes


yes

yes


yes

yes


yes

no

no



yes

yes


yes

no

no



no

false


true

windy


(d)

Figure 4.2 Tree stumps for the weather data.

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 98




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   48   49   50   51   52   53   54   55   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə