Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	36/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 32 33 34 35 36 37 38 39 ... 343

Graphic Displays of Basic Statistical Descriptions of Data
Quantile Plot
Example 2.13 Quantile plot.

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 51

#13

2.2 Basic Statistical Descriptions of Data

The variance of N observations, x

, x

...,x

, for a numeric attribute X is

2

=

1

N

N

i=1

− ¯x)

=

1

N

N

i=1

x

2

i

− ¯x

(2.6)

where ¯x is the mean value of the observations, as deﬁned in Eq. (2.1). The standard

deviation,

σ , of the observations is the square root of the variance, σ

.

Example 2.12

Variance and standard deviation. In Example 2.6, we found ¯x = $58,000 using Eq. (2.1)

for the mean. To determine the variance and standard deviation of the data from that

example, we set N = 12 and use Eq. (2.6) to obtain

(30

+ 36

+ 47

... + 110

) − 58

≈ 379.17

σ ≈

√

379.17 ≈ 19.47.

The basic properties of the standard deviation,

σ , as a measure of spread are as

follows:

σ measures spread about the mean and should be considered only when the mean is

chosen as the measure of center.

σ = 0 only when there is no spread, that is, when all observations have the same

value. Otherwise,

σ > 0.

Importantly, an observation is unlikely to be more than several standard deviations

away from the mean. Mathematically, using Chebyshev’s inequality, it can be shown that

at least 1 −

1

k

× 100% of the observations are no more than k standard deviations

from the mean. Therefore, the standard deviation is a good indicator of the spread of a

data set.

The computation of the variance and standard deviation is scalable in large databases.

2.2.3

Graphic Displays of Basic Statistical Descriptions of Data

In this section, we study graphic displays of basic statistical descriptions. These include

quantile plots, quantile–quantile plots, histograms, and scatter plots. Such graphs are help-

ful for the visual inspection of data, which is useful for data preprocessing. The ﬁrst

three of these show univariate distributions (i.e., data for one attribute), while scatter

plots show bivariate distributions (i.e., involving two attributes).

Quantile Plot

In this and the following subsections, we cover common graphic displays of data distri-

butions. A quantile plot is a simple and effective way to have a ﬁrst look at a univariate

data distribution. First, it displays all of the data for the given attribute (allowing the user

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 52

#14

52

Chapter 2 Getting to Know Your Data

to assess both the overall behavior and unusual occurrences). Second, it plots quantile

information (see Section 2.2.2). Let x

, for i = 1 to N, be the data sorted in increasing

order so that x

is the smallest observation and x

is the largest for some ordinal or

numeric attribute X. Each observation, x

i

, is paired with a percentage, f

, which indicates

that approximately f

i

× 100% of the data are below the value, x

. We say “approximately”

because there may not be a value with exactly a fraction, f

i

, of the data below x

. Note

that the 0.25 percentile corresponds to quartile Q

, the 0.50 percentile is the median,

and the 0.75 percentile is Q

Let

f

i

=

i − 0.5

(2.7)

These numbers increase in equal steps of 1

/N, ranging from

(which is slightly

above 0) to 1 −

(which is slightly below 1). On a quantile plot, x

i

is graphed against

f

i

. This allows us to compare different distributions based on their quantiles. For exam-

ple, given the quantile plots of sales data for two different time periods, we can compare

their Q

, median, Q

, and other f

values at a glance.

Example 2.13

Quantile plot. Figure 2.4 shows a quantile plot for the unit price data of Table 2.1.

Quantile–Quantile Plot

A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution

against the corresponding quantiles of another. It is a powerful visualization tool in that it

allows the user to view whether there is a shift in going from one distribution to another.

Suppose that we have two sets of observations for the attribute or variable unit price,

taken from two different branch locations. Let x

,

...,x

be the data from the ﬁrst

branch, and y

...,y

M

be the data from the second, where each data set is sorted in

increasing order. If M = N (i.e., the number of points in each set is the same), then we

simply plot y

against x

, where y

and x

are both

(i − 0.5)/N quantiles of their respec-

tive data sets. If M

< N (i.e., the second branch has fewer observations than the ﬁrst),

there can be only M points on the q-q plot. Here, y

is the

(i − 0.5)/M quantile of the y

140

120

100

0.00

0.25

0.50

0.75

1.00

f-value

Unit price ($)

Median

Q

1

Q

3

Figure 2.4

A quantile plot for the unit price data of Table 2.1.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 32 33 34 35 36 37 38 39 ... 343