Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	55/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 51 52 53 54 55 56 57 58 ... 343

Example 3.1 Correlation analysis of nominal attributes using χ 2 .
Correlation Coefﬁcient for Numeric Data

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 95

#13

3.3 Data Integration

χ

2

Correlation Test for Nominal Data

For nominal data, a correlation relationship between two attributes, A and B, can be

discovered by a

(chi-square) test. Suppose A has c distinct values, namely a

, a

...a

.

B has r distinct values, namely b

, b

...b

. The data tuples described by A and B can be

shown as a contingency table, with the c values of A making up the columns and the r

values of B making up the rows. Let

(A

i

, B

) denote the joint event that attribute A takes

on value a

i

and attribute B takes on value b

, that is, where

(A = a

i

, B = b

). Each and

every possible

, B

) joint event has its own cell (or slot) in the table. The χ

value

(also known as the Pearson

statistic) is computed as

=

c

i=1

r

j=1

− e

)

e

ij

(3.1)

where o

ij

is the observed frequency (i.e., actual count) of the joint event

(A

i

, B

) and e

the expected frequency of

(A

i

, B

), which can be computed as

e

ij

=

count

(A = a

i

) × count(B = b

)

n

(3.2)

where n is the number of data tuples, count

(A = a

) is the number of tuples having value

a

i

for A, and count

(B = b

j

) is the number of tuples having value b

for B. The sum in

Eq. (3.1) is computed over all of the r × c cells. Note that the cells that contribute the

most to the

2

value are those for which the actual count is very different from that

expected.

The

statistic tests the hypothesis that A and B are independent, that is, there is no

correlation between them. The test is based on a signiﬁcance level, with

(r − 1) × (c − 1)

degrees of freedom. We illustrate the use of this statistic in Example 3.1. If the hypothesis

can be rejected, then we say that A and B are statistically correlated.

Example 3.1

Correlation analysis of nominal attributes using

χ

2

. Suppose that a group of 1500

people was surveyed. The gender of each person was noted. Each person was polled as

to whether his or her preferred type of reading material was ﬁction or nonﬁction. Thus,

we have two attributes, gender and preferred reading. The observed frequency (or count)

of each possible joint event is summarized in the contingency table shown in Table 3.1,

where the numbers in parentheses are the expected frequencies. The expected frequen-

cies are calculated based on the data distribution for both attributes using Eq. (3.2).

Using Eq. (3.2), we can verify the expected frequencies for each cell. For example,

the expected frequency for the cell (male, ﬁction) is

e

count(male

) × count(ﬁction)

300 × 450

1500

= 90,

and so on. Notice that in any row, the sum of the expected frequencies must equal the

total observed frequency for that row, and the sum of the expected frequencies in any

column must also equal the total observed frequency for that column.

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 96

#14

96

Chapter 3 Data Preprocessing

Table 3.1

Example 2.1’s 2 × 2 Contingency Table Data

male

female

Total

ﬁction

250 (90)

200 (360)

450

non ﬁction

50 (210)

1000 (840)

1050

Total

300

1200

1500

Note: Are gender and preferred reading correlated?

Using Eq. (3.1) for

2

computation, we get

(250 − 90)

(50 − 210)

210

(200 − 360)

360

(1000 − 840)

840

= 284.44 + 121.90 + 71.11 + 30.48 = 507.93.

For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1. For 1 degree of free-

dom, the

value needed to reject the hypothesis at the 0.001 signiﬁcance level is 10.828

(taken from the table of upper percentage points of the

2

distribution, typically avail-

able from any textbook on statistics). Since our computed value is above this, we can

reject the hypothesis that gender and preferred reading are independent and conclude

that the two attributes are (strongly) correlated for the given group of people.

Correlation Coefﬁcient for Numeric Data

For numeric attributes, we can evaluate the correlation between two attributes, A and B,

by computing the correlation coefﬁcient (also known as Pearson’s product moment

coefﬁcient, named after its inventer, Karl Pearson). This is

r

A,B

=

n

i=1

− ¯A)(b

− ¯B)

σ

A

σ

B

=

n

i=1

i

b

i

) − n ¯A ¯B

σ

A

σ

B

(3.3)

where n is the number of tuples, a

i

and b

are the respective values of A and B in tuple i,

¯A and ¯B are the respective mean values of A and B,

σ

A

and

σ

B

are the respective standard

deviations of A and B (as deﬁned in Section 2.2.2), and

(a

i

b

i

) is the sum of the AB

cross-product (i.e., for each tuple, the value for A is multiplied by the value for B in that

tuple). Note that −1 ≤ r

A,B

≤ +1. If r

A,B

is greater than 0, then A and B are positively

correlated, meaning that the values of A increase as the values of B increase. The higher

the value, the stronger the correlation (i.e., the more each attribute implies the other).

Hence, a higher value may indicate that A (or B) may be removed as a redundancy.

If the resulting value is equal to 0, then A and B are independent and there is no

correlation between them. If the resulting value is less than 0, then A and B are negatively

correlated, where the values of one attribute increase as the values of the other attribute

decrease. This means that each attribute discourages the other. Scatter plots can also be

used to view correlations between attributes (Section 2.2.3). For example, Figure 2.8’s

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 51 52 53 54 55 56 57 58 ... 343