HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 51
#13
2.2 Basic Statistical Descriptions of Data
51
The variance of N observations, x
1
, x
2
,
...,x
N
, for a numeric attribute X is
σ
2
=
1
N
N
i=1
(x
i
− ¯x)
2
=
1
N
N
i=1
x
2
i
− ¯x
2
,
(2.6)
where ¯x is the mean value of the observations, as defined in Eq. (2.1). The standard
deviation,
σ , of the observations is the square root of the variance, σ
2
.
Example 2.12
Variance and standard deviation. In Example 2.6, we found ¯
x = $58,000 using Eq. (2.1)
for the mean. To determine the variance and standard deviation of the data from that
example, we set N = 12 and use Eq. (2.6) to obtain
σ
2
=
1
12
(30
2
+ 36
2
+ 47
2
... + 110
2
) − 58
2
≈ 379.17
σ ≈
√
379.17 ≈ 19.47.
The basic properties of the standard deviation,
σ , as a measure of spread are as
follows:
σ measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
σ = 0 only when there is no spread, that is, when all observations have the same
value. Otherwise,
σ > 0.
Importantly, an observation is unlikely to be more than several standard deviations
away from the mean. Mathematically, using Chebyshev’s inequality, it can be shown that
at least 1 −
1
k
2
× 100% of the observations are no more than k standard deviations
from the mean. Therefore, the standard deviation is a good indicator of the spread of a
data set.
The computation of the variance and standard deviation is scalable in large databases.
2.2.3
Graphic Displays of Basic Statistical Descriptions of Data
In this section, we study graphic displays of basic statistical descriptions. These include
quantile plots,
quantile–quantile plots,
histograms, and
scatter plots. Such graphs are help-
ful for the visual inspection of data, which is useful for data preprocessing. The first
three of these show univariate distributions (i.e., data for one attribute), while scatter
plots show bivariate distributions (i.e., involving two attributes).
Quantile Plot
In this and the following subsections, we cover common graphic displays of data distri-
butions. A quantile plot is a simple and effective way to have a first look at a univariate
data distribution. First, it displays all of the data for the given attribute (allowing the user
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 52
#14
52
Chapter 2 Getting to Know Your Data
to assess both the overall behavior and unusual occurrences). Second, it plots quantile
information (see Section 2.2.2). Let x
i
, for i = 1 to N, be the data sorted in increasing
order so that x
1
is the smallest observation and x
N
is the largest for some ordinal or
numeric attribute X. Each observation, x
i
, is paired with a percentage, f
i
, which indicates
that approximately f
i
× 100% of the data are below the value, x
i
. We say “approximately”
because there may not be a value with exactly a fraction, f
i
, of the data below x
i
. Note
that the 0.25 percentile corresponds to quartile
Q
1
, the 0.50 percentile is the median,
and the 0.75 percentile is
Q
3
.
Let
f
i
=
i − 0.5
N
.
(2.7)
These numbers increase in equal steps of 1
/N, ranging from
1
2N
(which is slightly
above 0) to 1 −
1
2N
(which is slightly below 1). On a quantile plot,
x
i
is graphed against
f
i
. This allows us to compare different distributions based on their quantiles. For exam-
ple, given the quantile plots of sales data for two different time periods, we can compare
their Q
1
, median, Q
3
, and other f
i
values at a glance.
Example 2.13
Quantile plot. Figure 2.4 shows a quantile plot for the
unit price data of Table 2.1.
Quantile–Quantile Plot
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution
against the corresponding quantiles of another. It is a powerful visualization tool in that it
allows the user to view whether there is a shift in going from one distribution to another.
Suppose that we have two sets of observations for the attribute or variable unit price,
taken from two different branch locations. Let x
1
,
...,x
N
be the data from the first
branch, and y
1
,
...,
y
M
be the data from the second, where each data set is sorted in
increasing order. If M = N (i.e., the number of points in each set is the same), then we
simply plot y
i
against x
i
, where y
i
and x
i
are both
(
i − 0.5)/
N quantiles of their respec-
tive data sets. If M
<
N (i.e., the second branch has fewer observations than the first),
there can be only M points on the q-q plot. Here, y
i
is the
(
i − 0.5)/
M quantile of the
y
140
120
100
80
60
40
20
0
0.00
0.25
0.50
0.75
1.00
f-value
Unit price ($)
Median
Q
1
Q
3
Figure 2.4
A quantile plot for the unit price data of Table 2.1.