Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	56/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 52 53 54 55 56 57 58 59 ... 343

Covariance of Numeric Data
Example 3.2 Covariance analysis of numeric attributes.
Tuple Duplication

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 97

#15

3.3 Data Integration

scatter plots respectively show positively correlated data and negatively correlated data,

while Figure 2.9 displays uncorrelated data.

Note that correlation does not imply causality. That is, if A and B are correlated, this

does not necessarily imply that A causes B or that B causes A. For example, in analyzing a

demographic database, we may ﬁnd that attributes representing the number of hospitals

and the number of car thefts in a region are correlated. This does not mean that one

causes the other. Both are actually causally linked to a third attribute, namely, population.

Covariance of Numeric Data

In probability theory and statistics, correlation and covariance are two similar measures

for assessing how much two attributes change together. Consider two numeric attributes

A and B, and a set of n observations {(a

, b

),...,(a

, b

)}. The mean values of A and B,

respectively, are also known as the expected values on A and B, that is,

E

(A) = ¯A =

n

i=1

a

i

n

and

(B) = ¯B =

n

i=1

b

i

n

The covariance between A and B is deﬁned as

Cov

(A,B) = E((A − ¯A)(B − ¯B)) =

n

i=1

− ¯A)(b

− ¯B)

(3.4)

If we compare Eq. (3.3) for r

A,B

(correlation coefﬁcient) with Eq. (3.4) for covariance,

we see that

r

A,B

=

Cov

(A,B)

σ

A

σ

B

(3.5)

where

σ

A

and

σ

B

are the standard deviations of A and B, respectively. It can also be

shown that

Cov

(A,B) = E(A · B) − ¯A ¯B.

(3.6)

This equation may simplify calculations.

For two attributes A and B that tend to change together, if A is larger than ¯

A (the

expected value of A), then B is likely to be larger than ¯B (the expected value of B).

Therefore, the covariance between A and B is positive. On the other hand, if one of

the attributes tends to be above its expected value when the other attribute is below its

expected value, then the covariance of A and B is negative.

If A and B are independent (i.e., they do not have correlation), then E

(A · B) = E(A) ·

E

(B). Therefore, the covariance is Cov(A,B) = E(A · B) − ¯A ¯B = E(A) · E(B) − ¯A ¯B = 0.

However, the converse is not true. Some pairs of random variables (attributes) may have

a covariance of 0 but are not independent. Only under some additional assumptions

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 98

#16

98

Chapter 3 Data Preprocessing

Table 3.2

Stock Prices for AllElectronics and HighTech

Time point

AllElectronics

HighTech

(e.g., the data follow multivariate normal distributions) does a covariance of 0 imply

independence.

Example 3.2

Covariance analysis of numeric attributes. Consider Table 3.2, which presents a sim-

pliﬁed example of stock prices observed at ﬁve time points for AllElectronics and

HighTech, a high-tech company. If the stocks are affected by the same industry trends,

will their prices rise or fall together?

(AllElectronics) =

6 + 5 + 4 + 3 + 2

= $4

and

E

(HighTech) =

20 + 10 + 14 + 5 + 5

= $10.80.

Thus, using Eq. (3.4), we compute

Cov

(AllElectroncis,HighTech) =

6 × 20 + 5 × 10 + 4 × 14 + 3 × 5 + 2 × 5

− 4 × 10.80

= 50.2 − 43.2 = 7.

Therefore, given the positive covariance we can say that stock prices for both companies

rise together.

Variance is a special case of covariance, where the two attributes are identical (i.e., the

covariance of an attribute with itself). Variance was discussed in Chapter 2.

3.3.3

Tuple Duplication

In addition to detecting redundancies between attributes, duplication should also be

detected at the tuple level (e.g., where there are two or more identical tuples for a given

unique data entry case). The use of denormalized tables (often done to improve per-

formance by avoiding

join

s) is another source of data redundancy. Inconsistencies often

arise between various duplicates, due to inaccurate data entry or updating some but not

all data occurrences. For example, if a purchase order database contains attributes for

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 52 53 54 55 56 57 58 59 ... 343