Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	51/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 47 48 49 50 51 52 53 54 ... 219

of a document. Instead, a document can be viewed as a bag of words—a set that

contains all the words in the document, with multiple occurrences of a word

appearing multiple times (technically, a set includes each of its members just

once, whereas a bag can have repeated elements). Word frequencies can be

accommodated by applying a modiﬁed form of Naïve Bayes that is sometimes

described as multinominal Naïve Bayes.

Suppose n

, n

, . . . , n

is the number of times word i occurs in the document,

and P

, P

, . . . , P

is the probability of obtaining word i when sampling from

all the documents in category H. Assume that the probability is independent of

the word’s context and position in the document. These assumptions lead to a

multinomial distribution for document probabilities. For this distribution, the

probability of a document E given its class H—in other words, the formula for

computing the probability Pr[E

|H] in Bayes’s rule—is

where N

= n

+ n

+ . . . + n

is the number of words in the document. The reason

for the factorials is to account for the fact that the ordering of the occurrences

of each word is immaterial according to the bag-of-words model. P

is estimated

by computing the relative frequency of word i in the text of all training docu-

ments pertaining to category H. In reality there should be a further term that

gives the probability that the model for category H generates a document whose

length is the same as the length of E (that is why we use the symbol

ª instead

=), but it is common to assume that this is the same for all classes and hence

can be dropped.

For example, suppose there are only the two words, yellow and blue, in the

vocabulary, and a particular document class H has Pr[yellow

|H] = 75% and

Pr[blue

|H] = 25% (you might call H the class of yellowish green documents).

Suppose E is the document blue yellow blue with a length of N

= 3 words. There

are four possible bags of three words. One is {yellow yellow yellow}, and its prob-

ability according to the preceding formula is

The other three, with their probabilities, are

Pr

blue blue blue H

yellow yellow blue H

yellow blue blue H

{

}

[

]

{

}

[

]

{

}

[

]

Pr yellow yellow yellow H

{

}

[

]

ª ¥

0 75

0 25

Pr E H

N

P

n

i

n

i

i

k

i

[

]

’

4 . 2

S TAT I S T I C A L M O D E L I N G

9 5

P088407-Ch004.qxd 4/30/05 11:13 AM Page 95

Here, E corresponds to the last case (recall that in a bag of words the order is

immaterial); thus its probability of being generated by the yellowish green doc-

ument model is 9/64, or 14%. Suppose another class, very bluish green docu-

ments (call it H

¢), has Pr[yellow | H¢] = 10%, Pr[blue | H¢] = 90%. The probability

that E is generated by this model is 24%.

If these are the only two classes, does that mean that E is in the very bluish

green document class? Not necessarily. Bayes’s rule, given earlier, says that you

have to take into account the prior probability of each hypothesis. If you know

that in fact very bluish green documents are twice as rare as yellowish green ones,

this would be just sufﬁcient to outweigh the preceding 14% to 24% disparity

and tip the balance in favor of the yellowish green class.

The factorials in the preceding probability formula don’t actually need to be

computed because—being the same for every class—they drop out in the nor-

malization process anyway. However, the formula still involves multiplying

together many small probabilities, which soon yields extremely small numbers

that cause underﬂow on large documents. The problem can be avoided by using

logarithms of the probabilities instead of the probabilities themselves.

In the multinomial Naïve Bayes formulation a document’s class is determined

not just by the words that occur in it but also by the number of times they occur.

In general it performs better than the ordinary Naïve Bayes model for docu-

ment classiﬁcation, particularly for large dictionary sizes.

Discussion

Naïve Bayes gives a simple approach, with clear semantics, to representing,

using, and learning probabilistic knowledge. Impressive results can be achieved

using it. It has often been shown that Naïve Bayes rivals, and indeed outper-

forms, more sophisticated classiﬁers on many datasets. The moral is, always try

the simple things ﬁrst. Repeatedly in machine learning people have eventually,

after an extended struggle, obtained good results using sophisticated learning

methods only to discover years later that simple methods such as 1R and Naïve

Bayes do just as well—or even better.

There are many datasets for which Naïve Bayes does not do so well, however,

and it is easy to see why. Because attributes are treated as though they were com-

pletely independent, the addition of redundant ones skews the learning process.

As an extreme example, if you were to include a new attribute with the same

values as temperature to the weather data, the effect of the temperature attri-

bute would be multiplied: all of its probabilities would be squared, giving it a

great deal more inﬂuence in the decision. If you were to add 10 such attributes,

then the decisions would effectively be made on temperature alone. Dependen-

cies between attributes inevitably reduce the power of Naïve Bayes to discern

what is going on. They can, however, be ameliorated by using a subset of the

9 6

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

P088407-Ch004.qxd 4/30/05 11:13 AM Page 96

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 47 48 49 50 51 52 53 54 ... 219