just as we calculated previously. Again, the Pr[
E]
in the denominator will dis-
appear when we normalize.
This method goes by the name of Naïve Bayes, because it’s based on Bayes’s
rule and “naïvely” assumes independence—it is only valid to multiply proba-
bilities when the events are independent. The assumption that attributes are
independent (given the class) in real life certainly is a simplistic one. But despite
the disparaging name, Naïve Bayes works very well when tested on actual
datasets, particularly when combined with some of the attribute selection pro-
cedures introduced in Chapter 7 that eliminate redundant, and hence nonin-
dependent, attributes.
One thing that can go wrong with Naïve Bayes is that if a particular attribute
value does not occur in the training set in conjunction with every class value,
things go badly awry. Suppose in the example that the training data was differ-
ent and the attribute value outlook
= sunny had always been associated with
the outcome no. Then the probability of outlook
= sunny given a yes, that is,
Pr[outlook
= sunny | yes], would be zero, and because the other probabilities are
multiplied by this the final probability of yes would be zero no matter how large
they were. Probabilities that are zero hold a veto over the other ones. This is not
a good idea. But the bug is easily fixed by minor adjustments to the method of
calculating probabilities from frequencies.
For example, the upper part of Table 4.2 shows that for play
= yes, outlook is
sunny for two examples, overcast for four, and rainy for three, and the lower part
gives these events probabilities of 2/9, 4/9, and 3/9, respectively. Instead, we
could add 1 to each numerator and compensate by adding 3 to the denomina-
tor, giving probabilities of 3/12, 5/12, and 4/12, respectively. This will ensure that
an attribute value that occurs zero times receives a probability which is nonzero,
albeit small. The strategy of adding 1 to each count is a standard technique called
the Laplace estimator after the great eighteenth-century French mathematician
Pierre Laplace. Although it works well in practice, there is no particular reason
for adding 1 to the counts: we could instead choose a small constant
m and use
The value of
m, which was set to 3, effectively provides a weight that determines
how influential the a priori values of 1/3, 1/3, and 1/3 are for each of the three
possible attribute values. A large
m says that these priors are very important com-
pared with the new evidence coming in from the training set, whereas a small
one gives them less influence. Finally, there is no particular reason for dividing
m into three equal parts in the numerators: we could use
2
9
4
9
3
9
1
2
3
+
+
+
+
+
+
m
m
m
m
m
m
p
p
p
,
, and
2
3
9
4
3
9
3
3
9
+
+
+
+
+
+
m
m
m
m
m
m
,
,
.
and
4 . 2
S TAT I S T I C A L M O D E L I N G
9 1
P088407-Ch004.qxd 4/30/05 11:13 AM Page 91
instead, where
p
1
, p
2
, and p
3
sum to 1. Effectively, these three numbers are a priori
probabilities of the values of the
outlook attribute being
sunny, overcast, and
rainy, respectively.
This is now a fully Bayesian formulation where prior probabilities have been
assigned to everything in sight. It has the advantage of being completely rigor-
ous, but the disadvantage that it is not usually clear just how these prior prob-
abilities should be assigned. In practice, the prior probabilities make little
difference provided that there are a reasonable number of training instances,
and people generally just estimate frequencies using the Laplace estimator by
initializing all counts to one instead of to zero.
Missing values and numeric attributes
One of the really nice things about the Bayesian formulation is that missing
values are no problem at all. For example, if the value of outlook were missing
in the example of Table 4.3, the calculation would simply omit this attribute,
yielding
These two numbers are individually a lot higher than they were before, because
one of the fractions is missing. But that’s not a problem because a fraction is
missing in both cases, and these likelihoods are subject to a further normal-
ization process. This yields probabilities for yes and no of 41% and 59%,
respectively.
If a value is missing in a training instance, it is simply not included in the
frequency counts, and the probability ratios are based on the number of values
that actually occur rather than on the total number of instances.
Numeric values are usually handled by assuming that they have a “normal”
or “Gaussian” probability distribution. Table 4.4 gives a summary of the weather
data with numeric features from Table 1.3. For nominal attributes, we calcu-
lated counts as before, and for numeric ones we simply listed the values that
occur. Then, whereas we normalized the counts for the nominal attributes into
probabilities, we calculated the mean and standard deviation for each class
and each numeric attribute. Thus the mean value of temperature over the yes
instances is 73, and its standard deviation is 6.2. The mean is simply the average
of the preceding values, that is, the sum divided by the number of values. The
standard deviation is the square root of the sample variance, which we can cal-
culate as follows: subtract the mean from each value, square the result, sum them
together, and then divide by one less than the number of values. After we have
found this sample variance, find its square root to determine the standard devi-
ation. This is the standard way of calculating mean and standard deviation of a
likelihood of
likelihood of
yes
no
=
¥
¥
¥
=
=
¥
¥
¥
=
3 9 3 9 3 9 9 14
0 0238
1 5
4 5 3 5 5 14
0 0343
.
.
.
9 2
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
P088407-Ch004.qxd 4/30/05 11:13 AM Page 92