HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 47
#9
2.2 Basic Statistical Descriptions of Data
47
lower than the median interval, freq
median
is the frequency of the median interval, and
width is the width of the median interval.
The mode is another measure of central tendency. The mode for a set of data is the
value that occurs most frequently in the set. Therefore, it can be determined for qualita-
tive and quantitative attributes. It is possible for the greatest frequency to correspond to
several different values, which results in more than one mode. Data sets with one, two,
or three modes are respectively called unimodal, bimodal, and trimodal. In general, a
data set with two or more modes is multimodal. At the other extreme, if each data value
occurs only once, then there is no mode.
Example 2.8
Mode. The data from Example 2.6 are bimodal. The two modes are $52,000 and
$70,000.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the
following empirical relation:
mean −
mode ≈ 3 × (
mean −
median).
(2.4)
This implies that the mode for unimodal frequency curves that are moderately skewed
can easily be approximated if the mean and median values are known.
The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set. This measure is easy to
compute using the SQL aggregate functions,
max()
and
min()
.
Example 2.9
Midrange. The midrange of the data of Example 2.6 is
30,000+110,000
2
= $70,000.
In a unimodal frequency curve with perfect
symmetric data distribution, the mean,
median, and mode are all at the same center value, as shown in Figure 2.1(a).
Data in most real applications are not symmetric. They may instead be either posi-
tively skewed, where the mode occurs at a value that is smaller than the median
(Figure 2.1b), or negatively skewed, where the mode occurs at a value greater than the
median (Figure 2.1c).
Mode
Median
Mean
Mode
Median
Mean
Mean
Median
Mode
(a) Symmetric data
(b) Positively
skewed data
(c) Negatively skewed data
Figure 2.1
Mean, median, and mode of symmetric versus positively and negatively skewed data.
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 48
#10
48
Chapter 2 Getting to Know Your Data
2.2.2
Measuring the Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range
We now look at measures to assess the dispersion or spread of numeric data. The mea-
sures include range, quantiles, quartiles, percentiles, and the interquartile range. The
five-number summary, which can be displayed as a boxplot, is useful in identifying
outliers. Variance and standard deviation also indicate the spread of a data distribution.
Range, Quartiles, and Interquartile Range
To start off, let’s study the range, quantiles, quartiles, percentiles, and the interquartile
range as measures of data dispersion.
Let x
1
, x
2
,
...,x
N
be a set of observations for some numeric attribute, X. The range
of the set is the difference between the largest (max()) and smallest (min()) values.
Suppose that the data for attribute X are sorted in increasing numeric order. Imagine
that we can pick certain data points so as to split the data distribution into equal-size
consecutive sets, as in Figure 2.2. These data points are called quantiles. Quantiles are
points taken at regular intervals of a data distribution, dividing it into essentially equal-
size consecutive sets. (We say “essentially” because there may not be data values of X that
divide the data into exactly equal-sized subsets. For readability, we will refer to them as
equal.) The kth q-quantile for a given data distribution is the value x such that at most
k
/q of the data values are less than x and at most (q − k)/q of the data values are more
than x, where k is an integer such that 0
< k < q. There are q − 1 q-quantiles.
The 2-quantile is the data point dividing the lower and upper halves of the data dis-
tribution. It corresponds to the median. The 4-quantiles are the three data points that
split the data distribution into four equal parts; each part represents one-fourth of the
data distribution. They are more commonly referred to as quartiles. The 100-quantiles
are more commonly referred to as percentiles; they divide the data distribution into 100
equal-sized consecutive sets. The median, quartiles, and percentiles are the most widely
used forms of quantiles.
Q
2
Q
3
Q
1
25th
percentile
75th
percentile
Median
25%
Figure 2.2
A plot of the data distribution for some attribute X. The quantiles plotted are quartiles. The
three quartiles divide the distribution into four equal-size consecutive subsets. The second
quartile corresponds to the median.