HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 49
#11
2.2 Basic Statistical Descriptions of Data
49
The quartiles give an indication of a distribution’s center, spread, and shape. The first
quartile, denoted by
Q
1
, is the 25th percentile. It cuts off the lowest 25% of the data.
The
third quartile, denoted by
Q
3
, is the 75th percentile—it cuts off the lowest 75% (or
highest 25%) of the data. The second quartile is the 50th percentile. As the median, it
gives the center of the data distribution.
The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is called the
interquartile range (
IQR) and is defined as
IQR =
Q
3
− Q
1
.
(2.5)
Example 2.10
Interquartile range. The quartiles are the three values that split the sorted data set into
four equal parts. The data of Example 2.6 contain 12 observations, already sorted in
increasing order. Thus, the quartiles for this data are the third, sixth, and ninth val-
ues, respectively, in the sorted list. Therefore, Q
1
= $47,000 and Q
3
is $63,000. Thus,
the interquartile range is IQR = 63 − 47 = $16,000. (Note that the sixth value is a
median, $52,000, although this data set has two medians since the number of data values
is even.)
Five-Number Summary, Boxplots, and Outliers
No single numeric measure of spread (e.g., IQR) is very useful for describing skewed
distributions. Have a look at the symmetric and skewed data distributions of Figure 2.1.
In the symmetric distribution, the median (and other measures of central tendency)
splits the data into equal-size halves. This does not occur for skewed distributions.
Therefore, it is more informative to also provide the two quartiles Q
1
and Q
3
, along
with the median. A common rule of thumb for identifying suspected
outliers is to
single out values falling at least 1.5 × IQR above the third quartile or below the first
quartile.
Because Q
1
, the median, and Q
3
together contain no information about the end-
points (e.g., tails) of the data, a fuller summary of the shape of a distribution can be
obtained by providing the lowest and highest data values as well. This is known as
the five-number summary. The five-number summary of a distribution consists of the
median (Q
2
), the quartiles Q
1
and Q
3
, and the smallest and largest individual obser-
vations, written in the order of
Minimum,
Q
1
, Median, Q
3
, Maximum.
Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the
five-number summary as follows:
Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations.
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 50
#12
50
Chapter 2 Getting to Know Your Data
20
40
60
80
100
120
140
160
180
200
220
Unit price ($)
Branch 1
Branch 4
Branch 3
Branch 2
Figure 2.3
Boxplot for the unit price data for items sold at four branches of AllElectronics during a given
time period.
When dealing with a moderate number of observations, it is worthwhile to plot
potential outliers individually. To do this in a boxplot, the whiskers are extended to the
extreme low and high observations only if these values are less than 1.5 × IQR beyond
the quartiles. Otherwise, the whiskers terminate at the most extreme observations occur-
ring within 1.5 × IQR of the quartiles. The remaining cases are plotted individually.
Boxplots can be used in the comparisons of several sets of compatible data.
Example 2.11
Boxplot. Figure 2.3 shows boxplots for unit price data for items sold at four branches of
AllElectronics during a given time period. For branch 1, we see that the median price of
items sold is $80, Q
1
is $60, and Q
3
is $100. Notice that two outlying observations for
this branch were plotted individually, as their values of 175 and 202 are more than 1.5
times the IQR here of 40.
Boxplots can be computed in O
(nlogn) time. Approximate boxplots can be com-
puted in linear or sublinear time depending on the quality guarantee required.
Variance and Standard Deviation
Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data observa-
tions tend to be very close to the mean, while a high standard deviation indicates that
the data are spread out over a large range of values.