Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	52/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 48 49 50 51 52 53 54 55 ... 219

4.3 Divide-and-conquer: Constructing decision trees

4 . 3

D I V I D E - A N D - C O N Q U E R : C O N S T RU C T I N G D E C I S I O N T R E E S

9 7

attributes in the decision procedure, making a careful selection of which ones

to use. Chapter 7 shows how.

The normal-distribution assumption for numeric attributes is another

restriction on Naïve Bayes as we have formulated it here. Many features simply

aren’t normally distributed. However, there is nothing to prevent us from using

other distributions for the numeric attributes: there is nothing magic about the

normal distribution. If you know that a particular attribute is likely to follow

some other distribution, standard estimation procedures for that distribution

can be used instead. If you suspect it isn’t normal but don’t know the actual

distribution, there are procedures for “kernel density estimation” that do not

assume any particular distribution for the attribute values. Another possibility

is simply to discretize the data ﬁrst.

4.3 Divide-and-conquer: Constructing decision trees

The problem of constructing a decision tree can be expressed recursively. First,

select an attribute to place at the root node and make one branch for each pos-

sible value. This splits up the example set into subsets, one for every value of

the attribute. Now the process can be repeated recursively for each branch, using

only those instances that actually reach the branch. If at any time all instances

at a node have the same classiﬁcation, stop developing that part of the tree.

The only thing left to decide is how to determine which attribute to split on,

given a set of examples with different classes. Consider (again!) the weather data.

There are four possibilities for each split, and at the top level they produce trees

such as those in Figure 4.2. Which is the best choice? The number of yes and no

classes are shown at the leaves. Any leaf with only one class—yes or no—will

not have to be split further, and the recursive process down that branch will ter-

minate. Because we seek small trees, we would like this to happen as soon as

possible. If we had a measure of the purity of each node, we could choose the

attribute that produces the purest daughter nodes. Take a moment to look at

Figure 4.2 and ponder which attribute you think is the best choice.

The measure of purity that we will use is called the information and is meas-

ured in units called bits. Associated with a node of the tree, it represents the

expected amount of information that would be needed to specify whether a new

instance should be classiﬁed yes or no, given that the example reached that node.

Unlike the bits in computer memory, the expected amount of information

usually involves fractions of a bit—and is often less than one! We calculate it

based on the number of yes and no classes at the node; we will look at the details

of the calculation shortly. But ﬁrst let’s see how it’s used. When evaluating the

ﬁrst tree in Figure 4.2, the numbers of yes and no classes at the leaf nodes are

[2,3], [4,0], and [3,2], respectively, and the information values of these nodes are:

P088407-Ch004.qxd 4/30/05 11:13 AM Page 97

We can calculate the average information value of these, taking into account the

number of instances that go down each branch—ﬁve down the ﬁrst and third

and four down the second:

This average represents the amount of information that we expect would be nec-

essary to specify the class of a new instance, given the tree structure in Figure

4.2(a).

info

bits.

2 3

4 0

3 2

5 14

0 971

4 14

5 14

0 971 0 693

, , , , ,

.

.

[

] [

]

(

)

(

)

(

)

¥ +

(

)

info

bits

info

bits

info

bits

2 3

0 971

4 0

0 0

3 2

0 971

[

]

(

)

[

]

(

)

[

]

(

)

9 8

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

yes

sunny

yes

overcast

yes

rainy

outlook

(a)

yes

hot

yes

mild

cool

temperature

yes

(b)

yes

high

normal

humidity

(c)

yes

false

true

windy

(d)

Figure 4.2 Tree stumps for the weather data.

P088407-Ch004.qxd 4/30/05 11:13 AM Page 98

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 48 49 50 51 52 53 54 55 ... 219