Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə50/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   46   47   48   49   50   51   52   53   ...   219

set of numbers (the “one less than” is to do with the number of degrees of

freedom in the sample, a statistical notion that we don’t want to get into here).

The probability density function for a normal distribution with mean 

m and


standard deviation 

s is given by the rather formidable expression:

But fear not! All this means is that if we are considering a yes outcome when

temperature has a value, say, of 66, we just need to plug x

= 66, m = 73, and s =

6.2 into the formula. So the value of the probability density function is

By the same token, the probability density of a yes outcome when humidity has

value, say, of 90 is calculated in the same way:

The probability density function for an event is very closely related to its prob-

ability. However, it is not quite the same thing. If temperature is a continuous

scale, the probability of the temperature being exactly 66—or exactly any other

value, such as 63.14159262—is zero. The real meaning of the density function

f(x) is that the probability that the quantity lies within a small region around x,

say, between x

- e/2 and + e/2, is e f(x). What we have written above is correct

f humidity

yes

=

(



)

=

90



0 0221

.

.



f temperature

yes

e

=

(



)

=



=

-

(



)

66



1

2

6 2



0 0340

66 73


2 6 2

2

2



p

.

.



.

.

f x



e

x

(

)



=

-

(



)

1

2



2

2

2



ps

m

s



.

4 . 2


S TAT I S T I C A L   M O D E L I N G

9 3


Table 4.4

The numeric weather data with summary statistics.

Outlook


Temperature

Humidity


Windy

Play


yes

no

yes

no

yes

no

yes

no

yes

no

sunny


2

3

83



85

86

85



false

6

2



9

5

overcast



4

0

70



80

96

90



true

3

3



rainy

3

2



68

65

80



70

64

72



65

95

69



71

70

91



75

80

75



70

72

90



81

75

sunny



2/9

3/5


mean

73

74.6



mean

79.1


86.2

false


6/9

2/5


9/14

5/14


overcast

4/9


0/5

std. dev.

6.2


7.9

std. dev.

10.2


9.7

true


3/9

3/5


rainy

3/9


2/5

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 93




if temperature is measured to the nearest degree and humidity is measured to

the nearest percentage point. You might think we ought to factor in the accu-

racy figure 

e when using these probabilities, but that’s not necessary. The same

e would appear in both the yes and  no likelihoods that follow and cancel out

when the probabilities were calculated.

Using these probabilities for the new day in Table 4.5 yields

which leads to probabilities

These figures are very close to the probabilities calculated earlier for the new

day in Table 4.3, because the temperature and humidity values of 66 and 90 yield

similar probabilities to the cool and high values used before.

The normal-distribution assumption makes it easy to extend the Naïve Bayes

classifier to deal with numeric attributes. If the values of any numeric attributes

are missing, the mean and standard deviation calculations are based only on the

ones that are present.

Bayesian models for document classification

One important domain for machine learning is document classification, in

which each instance represents a document and the instance’s class is the doc-

ument’s topic. Documents might be news items and the classes might be domes-

tic news, overseas news, financial news, and sport. Documents are characterized

by the words that appear in them, and one way to apply machine learning to

document classification is to treat the presence or absence of each word as 

a Boolean attribute. Naïve Bayes is a popular technique for this application

because it is very fast and quite accurate.

However, this does not take into account the number of occurrences of each

word, which is potentially useful information when determining the category

Probability of no

=

+

=



0 000108

0 000036 0 000108

75 0

.

.



.

. %.


Probability of yes

=

+



=

0 000036


0 000036 0 000108

25 0


.

.

.



. %,

likelihood of 

likelihood of 

yes

no

=

¥



¥

¥

¥



=

=

¥



¥

¥

¥



=

2 9 0 0340 0 0221 3 9 9 14

0 000036

3 5 0 0221 0 0381 3 5 5 14

0 000108

.

.



.

,

.



.

.

;



9 4

C H A P T E R   4

|

A LG O R I T H M S : T H E   BA S I C   M E T H O D S



Table 4.5

Another new day.

Outlook


Temperature

Humidity


Windy

Play


sunny

66

90



true

?

P088407-Ch004.qxd  4/30/05  11:13 AM  Page 94




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   46   47   48   49   50   51   52   53   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə