HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 53
#15
2.2 Basic Statistical Descriptions of Data
53
Table 2.1
A Set of Unit Price Data for Items
Sold at a Branch of AllElectronics
Unit price
Count of
(
$)
items sold
40
275
43
300
47
250
−
−
74
360
75
515
78
540
−
−
115
320
117
270
120
350
40
120
110
100
90
80
70
60
50
40
50
60
Median
Q
1
Q
3
70
80
Branch 1 (unit price $)
Branch 2 (unit price $)
90
100
110
120
Figure 2.5
A q-q plot for unit price data from two AllElectronics branches.
data, which is plotted against the
(i − 0.5)/M quantile of the x data. This computation
typically involves interpolation.
Example 2.14
Quantile–quantile plot. Figure 2.5 shows a quantile–quantile plot for unit price data of
items sold at two branches of AllElectronics during a given time period. Each point cor-
responds to the same quantile for each data set and shows the unit price of items sold at
branch 1 versus branch 2 for that quantile. (To aid in comparison, the straight line rep-
resents the case where, for each given quantile, the unit price at each branch is the same.
The darker points correspond to the data for Q
1
, the median, and Q
3
, respectively.)
We see, for example, that at Q
1
, the unit price of items sold at branch 1 was slightly
less than that at branch 2. In other words, 25% of items sold at branch 1 were less than or
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 54
#16
54
Chapter 2 Getting to Know Your Data
equal to $60, while 25% of items sold at branch 2 were less than or equal to $64. At the
50th percentile (marked by the median, which is also Q
2
), we see that 50% of items
sold at branch 1 were less than $78, while 50% of items at branch 2 were less than $85.
In general, we note that there is a shift in the distribution of branch 1 with respect to
branch 2 in that the unit prices of items sold at branch 1 tend to be lower than those at
branch 2.
Histograms
Histograms (or
frequency histograms) are at least a century old and are widely used.
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of
poles. Plotting histograms is a graphical method for summarizing the distribution of a
given attribute, X. If X is nominal, such as automobile model or item type, then a pole
or vertical bar is drawn for each known value of X. The height of the bar indicates the
frequency (i.e., count) of that X value. The resulting graph is more commonly known as
a bar chart.
If X is numeric, the term histogram is preferred. The range of values for X is parti-
tioned into disjoint consecutive subranges. The subranges, referred to as buckets or bins,
are disjoint subsets of the data distribution for X. The range of a bucket is known as
the width. Typically, the buckets are of equal width. For example, a price attribute with
a value range of $1 to $200 (rounded up to the nearest dollar) can be partitioned into
subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn with a
height that represents the total count of items observed within the subrange. Histograms
and partitioning rules are further discussed in Chapter 3 on data reduction.
Example 2.15
Histogram. Figure 2.6 shows a histogram for the data set of Table 2.1, where buckets (or
bins) are defined by equal-width ranges representing $20 increments and the frequency
is the count of items sold.
Although histograms are widely used, they may not be as effective as the quantile
plot, q-q plot, and boxplot methods in comparing groups of univariate observations.
Scatter Plots and Data Correlation
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes. To con-
struct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic
sense and plotted as points in the plane. Figure 2.7 shows a scatter plot for the set of data
in Table 2.1.
The scatter plot is a useful method for providing a first look at bivariate data to see
clusters of points and outliers, or to explore the possibility of correlation relationships.
Two attributes, X, and Y , are correlated if one attribute implies the other. Correlations
can be positive, negative, or null (uncorrelated). Figure 2.8 shows examples of positive
and negative correlations between two attributes. If the plotted points pattern slopes