Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	50/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 46 47 48 49 50 51 52 53 ... 219

Table 4.4 The numeric weather data with summary statistics.
Bayesian models for document classiﬁcation
Table 4.5 Another new day.

set of numbers (the “one less than” is to do with the number of degrees of

freedom in the sample, a statistical notion that we don’t want to get into here).

The probability density function for a normal distribution with mean

m and

standard deviation

s is given by the rather formidable expression:

But fear not! All this means is that if we are considering a yes outcome when

temperature has a value, say, of 66, we just need to plug x

= 66, m = 73, and s =

6.2 into the formula. So the value of the probability density function is

By the same token, the probability density of a yes outcome when humidity has

value, say, of 90 is calculated in the same way:

The probability density function for an event is very closely related to its prob-

ability. However, it is not quite the same thing. If temperature is a continuous

scale, the probability of the temperature being exactly 66—or exactly any other

value, such as 63.14159262—is zero. The real meaning of the density function

f(x) is that the probability that the quantity lies within a small region around x,

say, between x

- e/2 and x + e/2, is e f(x). What we have written above is correct

f humidity

yes

(

)

0 0221

f temperature

yes

e

(

)

◊

(

)

◊

6 2

0 0340

66 73

2 6 2

.

f x

e

x

(

)

(

)

4 . 2

S TAT I S T I C A L M O D E L I N G

9 3

Table 4.4

The numeric weather data with summary statistics.

Outlook

Temperature

Humidity

Windy

Play

yes

no

yes

no

yes

no

yes

no

yes

no

sunny

false

overcast

true

rainy

sunny

2/9

3/5

mean

74.6

mean

79.1

86.2

false

6/9

2/5

9/14

5/14

overcast

4/9

0/5

std. dev.

6.2

7.9

std. dev.

10.2

9.7

true

3/9

3/5

rainy

3/9

2/5

P088407-Ch004.qxd 4/30/05 11:13 AM Page 93

if temperature is measured to the nearest degree and humidity is measured to

the nearest percentage point. You might think we ought to factor in the accu-

racy ﬁgure

e when using these probabilities, but that’s not necessary. The same

e would appear in both the yes and no likelihoods that follow and cancel out

when the probabilities were calculated.

Using these probabilities for the new day in Table 4.5 yields

which leads to probabilities

These ﬁgures are very close to the probabilities calculated earlier for the new

day in Table 4.3, because the temperature and humidity values of 66 and 90 yield

similar probabilities to the cool and high values used before.

The normal-distribution assumption makes it easy to extend the Naïve Bayes

classiﬁer to deal with numeric attributes. If the values of any numeric attributes

are missing, the mean and standard deviation calculations are based only on the

ones that are present.

Bayesian models for document classiﬁcation

One important domain for machine learning is document classiﬁcation, in

which each instance represents a document and the instance’s class is the doc-

ument’s topic. Documents might be news items and the classes might be domes-

tic news, overseas news, ﬁnancial news, and sport. Documents are characterized

by the words that appear in them, and one way to apply machine learning to

document classiﬁcation is to treat the presence or absence of each word as

a Boolean attribute. Naïve Bayes is a popular technique for this application

because it is very fast and quite accurate.

However, this does not take into account the number of occurrences of each

word, which is potentially useful information when determining the category

Probability of no

+

=

0 000108

0 000036 0 000108

75 0

.

.

. %.

Probability of yes

0 000036

0 000036 0 000108

25 0

. %,

likelihood of

likelihood of

yes

no

2 9 0 0340 0 0221 3 9 9 14

0 000036

3 5 0 0221 0 0381 3 5 5 14

0 000108

;

9 4

C H A P T E R 4

A LG O R I T H M S : T H E BA S I C M E T H O D S

Table 4.5

Another new day.

Outlook

Temperature

Humidity

Windy

Play

sunny

true

P088407-Ch004.qxd 4/30/05 11:13 AM Page 94

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 46 47 48 49 50 51 52 53 ... 219