Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	49/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 45 46 47 48 49 50 51 52 ... 219

Missing values and numeric attributes

just as we calculated previously. Again, the Pr[E] in the denominator will dis-

appear when we normalize.

This method goes by the name of Naïve Bayes, because it’s based on Bayes’s

rule and “naïvely” assumes independence—it is only valid to multiply proba-

bilities when the events are independent. The assumption that attributes are

independent (given the class) in real life certainly is a simplistic one. But despite

the disparaging name, Naïve Bayes works very well when tested on actual

datasets, particularly when combined with some of the attribute selection pro-

cedures introduced in Chapter 7 that eliminate redundant, and hence nonin-

dependent, attributes.

One thing that can go wrong with Naïve Bayes is that if a particular attribute

value does not occur in the training set in conjunction with every class value,

things go badly awry. Suppose in the example that the training data was differ-

ent and the attribute value outlook

= sunny had always been associated with

the outcome no. Then the probability of outlook

= sunny given a yes, that is,

Pr[outlook

= sunny | yes], would be zero, and because the other probabilities are

multiplied by this the ﬁnal probability of yes would be zero no matter how large

they were. Probabilities that are zero hold a veto over the other ones. This is not

a good idea. But the bug is easily ﬁxed by minor adjustments to the method of

calculating probabilities from frequencies.

For example, the upper part of Table 4.2 shows that for play

= yes, outlook is

sunny for two examples, overcast for four, and rainy for three, and the lower part

gives these events probabilities of 2/9, 4/9, and 3/9, respectively. Instead, we

could add 1 to each numerator and compensate by adding 3 to the denomina-

tor, giving probabilities of 3/12, 5/12, and 4/12, respectively. This will ensure that

an attribute value that occurs zero times receives a probability which is nonzero,

albeit small. The strategy of adding 1 to each count is a standard technique called

the Laplace estimator after the great eighteenth-century French mathematician

Pierre Laplace. Although it works well in practice, there is no particular reason

for adding 1 to the counts: we could instead choose a small constant

m and use

The value of

m, which was set to 3, effectively provides a weight that determines

how inﬂuential the a priori values of 1/3, 1/3, and 1/3 are for each of the three

possible attribute values. A large

m says that these priors are very important com-

pared with the new evidence coming in from the training set, whereas a small

one gives them less inﬂuence. Finally, there is no particular reason for dividing

m into three equal parts in the numerators: we could use

9

4

p

p

p

, and

and

4 . 2

S TAT I S T I C A L M O D E L I N G

9 1

P088407-Ch004.qxd 4/30/05 11:13 AM Page 91

instead, where p

, p

, and p

sum to 1. Effectively, these three numbers are a priori

probabilities of the values of the outlook attribute being sunny, overcast, and

rainy, respectively.

This is now a fully Bayesian formulation where prior probabilities have been

assigned to everything in sight. It has the advantage of being completely rigor-

ous, but the disadvantage that it is not usually clear just how these prior prob-

abilities should be assigned. In practice, the prior probabilities make little

difference provided that there are a reasonable number of training instances,

and people generally just estimate frequencies using the Laplace estimator by

initializing all counts to one instead of to zero.

Missing values and numeric attributes

One of the really nice things about the Bayesian formulation is that missing

values are no problem at all. For example, if the value of outlook were missing

in the example of Table 4.3, the calculation would simply omit this attribute,

yielding

These two numbers are individually a lot higher than they were before, because

one of the fractions is missing. But that’s not a problem because a fraction is

missing in both cases, and these likelihoods are subject to a further normal-

ization process. This yields probabilities for yes and no of 41% and 59%,

respectively.

If a value is missing in a training instance, it is simply not included in the

frequency counts, and the probability ratios are based on the number of values

that actually occur rather than on the total number of instances.

Numeric values are usually handled by assuming that they have a “normal”

or “Gaussian” probability distribution. Table 4.4 gives a summary of the weather

data with numeric features from Table 1.3. For nominal attributes, we calcu-

lated counts as before, and for numeric ones we simply listed the values that

occur. Then, whereas we normalized the counts for the nominal attributes into

probabilities, we calculated the mean and standard deviation for each class

and each numeric attribute. Thus the mean value of temperature over the yes

instances is 73, and its standard deviation is 6.2. The mean is simply the average

of the preceding values, that is, the sum divided by the number of values. The

standard deviation is the square root of the sample variance, which we can cal-

culate as follows: subtract the mean from each value, square the result, sum them

together, and then divide by one less than the number of values. After we have

found this sample variance, ﬁnd its square root to determine the standard devi-

ation. This is the standard way of calculating mean and standard deviation of a

likelihood of

likelihood of

yes

no

3 9 3 9 3 9 9 14

0 0238

1 5

4 5 3 5 5 14

0 0343

9 2

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

P088407-Ch004.qxd 4/30/05 11:13 AM Page 92

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 45 46 47 48 49 50 51 52 ... 219