of a document. Instead, a
document can be viewed as a bag of words—a set that
contains all the words in the document, with multiple occurrences of a word
appearing multiple times (technically, a set includes each of its members just
once, whereas a bag can have repeated elements). Word frequencies can be
accommodated by applying a modified form of Naïve Bayes that is sometimes
described as multinominal Naïve Bayes.
Suppose n
1
, n
2
, . . . , n
k
is the number of times word i occurs in the document,
and P
1
, P
2
, . . . , P
k
is the probability of obtaining word i when sampling from
all the documents in category H. Assume that the probability is independent of
the word’s context and position in the document. These assumptions lead to a
multinomial distribution for document probabilities. For this distribution, the
probability of a document E given its class H—in other words, the formula for
computing the probability Pr[E
|H] in Bayes’s rule—is
where N
= n
1
+ n
2
+ . . . + n
k
is the number of words in the document. The reason
for the factorials is to account for the fact that the ordering of the occurrences
of each word is immaterial according to the bag-of-words model. P
i
is estimated
by computing the relative frequency of word i in the text of all training docu-
ments pertaining to category H. In reality there should be a further term that
gives the probability that the model for category H generates a document whose
length is the same as the length of E (that is why we use the symbol
ª instead
of
=), but it is common to assume that this is the same for all classes and hence
can be dropped.
For example, suppose there are only the two words, yellow and blue, in the
vocabulary, and a particular document class H has Pr[yellow
|H] = 75% and
Pr[blue
|H] = 25% (you might call H the class of yellowish green documents).
Suppose E is the document blue yellow blue with a length of N
= 3 words. There
are four possible bags of three words. One is {yellow yellow yellow}, and its prob-
ability according to the preceding formula is
The other three, with their probabilities, are
Pr
Pr
Pr
blue blue blue H
yellow yellow blue H
yellow blue blue H
{
}
[
]
=
{
}
[
]
=
{
}
[
]
=
1
64
27
64
9
64
Pr yellow yellow yellow H
{
}
[
]
ª ¥
¥
=
3
0 75
3
0 25
0
27
64
3
0
!
.
!
.
!
Pr E H
N
P
n
i
n
i
i
k
i
[
]
ª
¥
=
’
!
!
1
4 . 2
S TAT I S T I C A L M O D E L I N G
9 5
P088407-Ch004.qxd 4/30/05 11:13 AM Page 95
Here,
E corresponds to the last case (recall that in a bag of words the order is
immaterial); thus its probability of being generated by the yellowish green doc-
ument model is 9/64, or 14%. Suppose another class, very bluish green docu-
ments (call it H
¢), has Pr[yellow | H¢] = 10%, Pr[blue | H¢] = 90%. The probability
that E is generated by this model is 24%.
If these are the only two classes, does that mean that E is in the very bluish
green document class? Not necessarily. Bayes’s rule, given earlier, says that you
have to take into account the prior probability of each hypothesis. If you know
that in fact very bluish green documents are twice as rare as yellowish green ones,
this would be just sufficient to outweigh the preceding 14% to 24% disparity
and tip the balance in favor of the yellowish green class.
The factorials in the preceding probability formula don’t actually need to be
computed because—being the same for every class—they drop out in the nor-
malization process anyway. However, the formula still involves multiplying
together many small probabilities, which soon yields extremely small numbers
that cause underflow on large documents. The problem can be avoided by using
logarithms of the probabilities instead of the probabilities themselves.
In the multinomial Naïve Bayes formulation a document’s class is determined
not just by the words that occur in it but also by the number of times they occur.
In general it performs better than the ordinary Naïve Bayes model for docu-
ment classification, particularly for large dictionary sizes.
Discussion
Naïve Bayes gives a simple approach, with clear semantics, to representing,
using, and learning probabilistic knowledge. Impressive results can be achieved
using it. It has often been shown that Naïve Bayes rivals, and indeed outper-
forms, more sophisticated classifiers on many datasets. The moral is, always try
the simple things first. Repeatedly in machine learning people have eventually,
after an extended struggle, obtained good results using sophisticated learning
methods only to discover years later that simple methods such as 1R and Naïve
Bayes do just as well—or even better.
There are many datasets for which Naïve Bayes does not do so well, however,
and it is easy to see why. Because attributes are treated as though they were com-
pletely independent, the addition of redundant ones skews the learning process.
As an extreme example, if you were to include a new attribute with the same
values as temperature to the weather data, the effect of the temperature attri-
bute would be multiplied: all of its probabilities would be squared, giving it a
great deal more influence in the decision. If you were to add 10 such attributes,
then the decisions would effectively be made on temperature alone. Dependen-
cies between attributes inevitably reduce the power of Naïve Bayes to discern
what is going on. They can, however, be ameliorated by using a subset of the
9 6
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
P088407-Ch004.qxd 4/30/05 11:13 AM Page 96