HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 97
#15
3.3 Data Integration
97
scatter plots respectively show positively correlated data and negatively correlated data,
while Figure 2.9 displays uncorrelated data.
Note that correlation does not imply causality. That is, if A and B are correlated, this
does not necessarily imply that A causes B or that B causes A. For example, in analyzing a
demographic database, we may find that attributes representing the number of hospitals
and the number of car thefts in a region are correlated. This does not mean that one
causes the other. Both are actually causally linked to a third attribute, namely, population.
Covariance of Numeric Data
In probability theory and statistics, correlation and covariance are two similar measures
for assessing how much two attributes change together. Consider two numeric attributes
A and B, and a set of n observations {(a
1
, b
1
),...,(a
n
, b
n
)}. The mean values of A and B,
respectively, are also known as the expected values on A and B, that is,
E
(A) = ¯A =
n
i=1
a
i
n
and
E
(B) = ¯B =
n
i=1
b
i
n
.
The covariance between A and B is defined as
Cov
(A,B) = E((A − ¯A)(B − ¯B)) =
n
i=1
(a
i
− ¯A)(b
i
− ¯B)
n
.
(3.4)
If we compare Eq. (3.3) for
r
A,
B
(correlation coefficient) with Eq. (3.4) for covariance,
we see that
r
A,B
=
Cov
(A,B)
σ
A
σ
B
,
(3.5)
where
σ
A
and
σ
B
are the standard deviations of
A and
B, respectively. It can also be
shown that
Cov
(A,B) = E(A · B) − ¯A ¯B.
(3.6)
This equation may simplify calculations.
For two attributes
A and
B that tend to change together, if
A is larger than ¯
A (the
expected value of A), then B is likely to be larger than ¯B (the expected value of B).
Therefore, the covariance between A and B is positive. On the other hand, if one of
the attributes tends to be above its expected value when the other attribute is below its
expected value, then the covariance of A and B is negative.
If A and B are independent (i.e., they do not have correlation), then E
(A · B) = E(A) ·
E
(B). Therefore, the covariance is Cov(A,B) = E(A · B) − ¯A ¯B = E(A) · E(B) − ¯A ¯B = 0.
However, the converse is not true. Some pairs of random variables (attributes) may have
a covariance of 0 but are not independent. Only under some additional assumptions
HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 98
#16
98
Chapter 3 Data Preprocessing
Table 3.2
Stock Prices for AllElectronics and HighTech
Time point
AllElectronics
HighTech
t1
6
20
t2
5
10
t3
4
14
t4
3
5
t5
2
5
(e.g., the data follow multivariate normal distributions) does a covariance of 0 imply
independence.
Example 3.2
Covariance analysis of numeric attributes. Consider Table 3.2, which presents a sim-
plified example of stock prices observed at five time points for AllElectronics and
HighTech, a high-tech company. If the stocks are affected by the same industry trends,
will their prices rise or fall together?
E
(AllElectronics) =
6 + 5 + 4 + 3 + 2
5
=
20
5
= $4
and
E
(HighTech) =
20 + 10 + 14 + 5 + 5
5
=
54
5
= $10.80.
Thus, using Eq. (3.4), we compute
Cov
(AllElectroncis,HighTech) =
6 × 20 + 5 × 10 + 4 × 14 + 3 × 5 + 2 × 5
5
− 4 × 10.80
= 50.2 − 43.2 = 7.
Therefore, given the positive covariance we can say that stock prices for both companies
rise together.
Variance is a special case of covariance, where the two attributes are identical (i.e., the
covariance of an attribute with itself). Variance was discussed in Chapter 2.
3.3.3
Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level (e.g., where there are two or more identical tuples for a given
unique data entry case). The use of denormalized tables (often done to improve per-
formance by avoiding
join
s) is another source of data redundancy. Inconsistencies often
arise between various duplicates, due to inaccurate data entry or updating some but not
all data occurrences. For example, if a purchase order database contains attributes for