Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	35/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 31 32 33 34 35 36 37 38 ... 343

Five-Number Summary, Boxplots, and Outliers
ﬁve-number summary
Example 2.11 Boxplot.
Variance and Standard Deviation

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 49

#11

2.2 Basic Statistical Descriptions of Data

The quartiles give an indication of a distribution’s center, spread, and shape. The ﬁrst

quartile, denoted by Q

, is the 25th percentile. It cuts off the lowest 25% of the data.

The third quartile, denoted by Q

, is the 75th percentile—it cuts off the lowest 75% (or

highest 25%) of the data. The second quartile is the 50th percentile. As the median, it

gives the center of the data distribution.

The distance between the ﬁrst and third quartiles is a simple measure of spread

that gives the range covered by the middle half of the data. This distance is called the

interquartile range (IQR) and is deﬁned as

IQR = Q

− Q

(2.5)

Example 2.10

Interquartile range. The quartiles are the three values that split the sorted data set into

four equal parts. The data of Example 2.6 contain 12 observations, already sorted in

increasing order. Thus, the quartiles for this data are the third, sixth, and ninth val-

ues, respectively, in the sorted list. Therefore, Q

= $47,000 and Q

is $63,000. Thus,

the interquartile range is IQR = 63 − 47 = $16,000. (Note that the sixth value is a

median, $52,000, although this data set has two medians since the number of data values

is even.)

Five-Number Summary, Boxplots, and Outliers

No single numeric measure of spread (e.g., IQR) is very useful for describing skewed

distributions. Have a look at the symmetric and skewed data distributions of Figure 2.1.

In the symmetric distribution, the median (and other measures of central tendency)

splits the data into equal-size halves. This does not occur for skewed distributions.

Therefore, it is more informative to also provide the two quartiles Q

and Q

, along

with the median. A common rule of thumb for identifying suspected outliers is to

single out values falling at least 1.5 × IQR above the third quartile or below the ﬁrst

quartile.

Because Q

, the median, and Q

together contain no information about the end-

points (e.g., tails) of the data, a fuller summary of the shape of a distribution can be

obtained by providing the lowest and highest data values as well. This is known as

the ﬁve-number summary. The ﬁve-number summary of a distribution consists of the

median (Q

), the quartiles Q

and Q

, and the smallest and largest individual obser-

vations, written in the order of Minimum, Q

, Median, Q

, Maximum.

Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the

ﬁve-number summary as follows:

Typically, the ends of the box are at the quartiles so that the box length is the

interquartile range.

The median is marked by a line within the box.

Two lines (called whiskers) outside the box extend to the smallest (Minimum) and

largest (Maximum) observations.

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 50

#12

50

Chapter 2 Getting to Know Your Data

40

60

100

120

140

160

180

200

220

Unit price ($)

Branch 1

Branch 4

Branch 3

Branch 2

Figure 2.3

Boxplot for the unit price data for items sold at four branches of AllElectronics during a given

time period.

When dealing with a moderate number of observations, it is worthwhile to plot

potential outliers individually. To do this in a boxplot, the whiskers are extended to the

extreme low and high observations only if these values are less than 1.5 × IQR beyond

the quartiles. Otherwise, the whiskers terminate at the most extreme observations occur-

ring within 1.5 × IQR of the quartiles. The remaining cases are plotted individually.

Boxplots can be used in the comparisons of several sets of compatible data.

Example 2.11

Boxplot. Figure 2.3 shows boxplots for unit price data for items sold at four branches of

AllElectronics during a given time period. For branch 1, we see that the median price of

items sold is $80, Q

is $60, and Q

is $100. Notice that two outlying observations for

this branch were plotted individually, as their values of 175 and 202 are more than 1.5

times the IQR here of 40.

Boxplots can be computed in O

(nlogn) time. Approximate boxplots can be com-

puted in linear or sublinear time depending on the quality guarantee required.

Variance and Standard Deviation

Variance and standard deviation are measures of data dispersion. They indicate how

spread out a data distribution is. A low standard deviation means that the data observa-

tions tend to be very close to the mean, while a high standard deviation indicates that

the data are spread out over a large range of values.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 31 32 33 34 35 36 37 38 ... 343