HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 95
#13
3.3 Data Integration
95
χ
2
Correlation Test for Nominal Data
For nominal data, a correlation relationship between two attributes, A and B, can be
discovered by a
χ
2
(
chi-square) test. Suppose
A has
c distinct values, namely
a
1
, a
2
,
...a
c
.
B has r distinct values, namely b
1
, b
2
,
...b
r
. The data tuples described by A and B can be
shown as a contingency table, with the c values of A making up the columns and the r
values of B making up the rows. Let
(A
i
, B
j
) denote the joint event that attribute A takes
on value a
i
and attribute B takes on value b
j
, that is, where
(A = a
i
, B = b
j
). Each and
every possible
(A
i
, B
j
) joint event has its own cell (or slot) in the table. The χ
2
value
(also known as the
Pearson
χ
2
statistic) is computed as
χ
2
=
c
i=1
r
j=1
(o
ij
− e
ij
)
2
e
ij
,
(3.1)
where
o
ij
is the observed frequency (i.e., actual count) of the joint event
(A
i
, B
j
) and e
ij
is
the expected frequency of
(
A
i
, B
j
), which can be computed as
e
ij
=
count
(A = a
i
) × count(B = b
j
)
n
,
(3.2)
where
n is the number of data tuples,
count
(A = a
i
) is the number of tuples having value
a
i
for A, and count
(B = b
j
) is the number of tuples having value b
j
for B. The sum in
Eq. (3.1) is computed over all of the r × c cells. Note that the cells that contribute the
most to the
χ
2
value are those for which the actual count is very different from that
expected.
The
χ
2
statistic tests the hypothesis that A and B are independent, that is, there is no
correlation between them. The test is based on a significance level, with
(r − 1) × (c − 1)
degrees of freedom. We illustrate the use of this statistic in Example 3.1. If the hypothesis
can be rejected, then we say that A and B are statistically correlated.
Example 3.1
Correlation analysis of nominal attributes using
χ
2
. Suppose that a group of 1500
people was surveyed. The gender of each person was noted. Each person was polled as
to whether his or her preferred type of reading material was fiction or nonfiction. Thus,
we have two attributes, gender and preferred reading. The observed frequency (or count)
of each possible joint event is summarized in the contingency table shown in Table 3.1,
where the numbers in parentheses are the expected frequencies. The expected frequen-
cies are calculated based on the data distribution for both attributes using Eq. (3.2).
Using Eq. (3.2), we can verify the expected frequencies for each cell. For example,
the expected frequency for the cell (male, fiction) is
e
11
=
count(male
) × count(fiction)
n
=
300 × 450
1500
= 90,
and so on. Notice that in any row, the sum of the expected frequencies must equal the
total observed frequency for that row, and the sum of the expected frequencies in any
column must also equal the total observed frequency for that column.
HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 96
#14
96
Chapter 3 Data Preprocessing
Table 3.1
Example 2.1’s 2 × 2 Contingency Table Data
male
female
Total
fiction
250 (90)
200 (360)
450
non fiction
50 (210)
1000 (840)
1050
Total
300
1200
1500
Note: Are
gender and
preferred reading correlated?
Using Eq. (3.1) for
χ
2
computation, we get
χ
2
=
(250 − 90)
2
90
+
(50 − 210)
2
210
+
(200 − 360)
2
360
+
(1000 − 840)
2
840
= 284.44 + 121.90 + 71.11 + 30.48 = 507.93.
For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1. For 1 degree of free-
dom, the
χ
2
value needed to reject the hypothesis at the 0.001 significance level is 10.828
(taken from the table of upper percentage points of the
χ
2
distribution, typically avail-
able from any textbook on statistics). Since our computed value is above this, we can
reject the hypothesis that gender and preferred reading are independent and conclude
that the two attributes are (strongly) correlated for the given group of people.
Correlation Coefficient for Numeric Data
For numeric attributes, we can evaluate the correlation between two attributes, A and B,
by computing the correlation coefficient (also known as Pearson’s product moment
coefficient, named after its inventer, Karl Pearson). This is
r
A,B
=
n
i=1
(a
i
− ¯A)(b
i
− ¯B)
n
σ
A
σ
B
=
n
i=1
(a
i
b
i
) − n ¯A ¯B
n
σ
A
σ
B
,
(3.3)
where
n is the number of tuples,
a
i
and b
i
are the respective values of A and B in tuple i,
¯A and ¯B are the respective mean values of A and B,
σ
A
and
σ
B
are the respective standard
deviations of A and B (as defined in Section 2.2.2), and
(a
i
b
i
) is the sum of the AB
cross-product (i.e., for each tuple, the value for A is multiplied by the value for B in that
tuple). Note that −1 ≤ r
A,
B
≤ +1. If r
A,
B
is greater than 0, then A and B are positively
correlated, meaning that the values of
A increase as the values of
B increase. The higher
the value, the stronger the correlation (i.e., the more each attribute implies the other).
Hence, a higher value may indicate that A (or B) may be removed as a redundancy.
If the resulting value is equal to 0, then A and B are independent and there is no
correlation between them. If the resulting value is less than 0, then A and B are negatively
correlated, where the values of one attribute increase as the values of the other attribute
decrease. This means that each attribute discourages the other. Scatter plots can also be
used to view correlations between attributes (Section 2.2.3). For example, Figure 2.8’s