Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	34/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 30 31 32 33 34 35 36 37 ... 343

Example 2.9 Midrange.
Range, Quartiles, and Interquartile Range

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 47

#9

2.2 Basic Statistical Descriptions of Data

lower than the median interval, freq

median

is the frequency of the median interval, and

width is the width of the median interval.

The mode is another measure of central tendency. The mode for a set of data is the

value that occurs most frequently in the set. Therefore, it can be determined for qualita-

tive and quantitative attributes. It is possible for the greatest frequency to correspond to

several different values, which results in more than one mode. Data sets with one, two,

or three modes are respectively called unimodal, bimodal, and trimodal. In general, a

data set with two or more modes is multimodal. At the other extreme, if each data value

occurs only once, then there is no mode.

Example 2.8

Mode. The data from Example 2.6 are bimodal. The two modes are $52,000 and

$70,000.

For unimodal numeric data that are moderately skewed (asymmetrical), we have the

following empirical relation:

mean − mode ≈ 3 × (mean − median).

(2.4)

This implies that the mode for unimodal frequency curves that are moderately skewed

can easily be approximated if the mean and median values are known.

The midrange can also be used to assess the central tendency of a numeric data set.

It is the average of the largest and smallest values in the set. This measure is easy to

compute using the SQL aggregate functions,

max()

and

min()

.

Example 2.9

Midrange. The midrange of the data of Example 2.6 is

30,000+110,000

= $70,000.

In a unimodal frequency curve with perfect symmetric data distribution, the mean,

median, and mode are all at the same center value, as shown in Figure 2.1(a).

Data in most real applications are not symmetric. They may instead be either posi-

tively skewed, where the mode occurs at a value that is smaller than the median

(Figure 2.1b), or negatively skewed, where the mode occurs at a value greater than the

median (Figure 2.1c).

Mode

Median

Mean

Mode

Median

Mean

Median

Mode

(a) Symmetric data

(b) Positively skewed data

(c) Negatively skewed data

Figure 2.1

Mean, median, and mode of symmetric versus positively and negatively skewed data.

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 48

#10

48

Chapter 2 Getting to Know Your Data

2.2.2

Measuring the Dispersion of Data: Range, Quartiles, Variance,

Standard Deviation, and Interquartile Range

We now look at measures to assess the dispersion or spread of numeric data. The mea-

sures include range, quantiles, quartiles, percentiles, and the interquartile range. The

ﬁve-number summary, which can be displayed as a boxplot, is useful in identifying

outliers. Variance and standard deviation also indicate the spread of a data distribution.

Range, Quartiles, and Interquartile Range

To start off, let’s study the range, quantiles, quartiles, percentiles, and the interquartile

range as measures of data dispersion.

Let x

, x

...,x

be a set of observations for some numeric attribute, X. The range

of the set is the difference between the largest (max()) and smallest (min()) values.

Suppose that the data for attribute X are sorted in increasing numeric order. Imagine

that we can pick certain data points so as to split the data distribution into equal-size

consecutive sets, as in Figure 2.2. These data points are called quantiles. Quantiles are

points taken at regular intervals of a data distribution, dividing it into essentially equal-

size consecutive sets. (We say “essentially” because there may not be data values of X that

divide the data into exactly equal-sized subsets. For readability, we will refer to them as

equal.) The kth q-quantile for a given data distribution is the value x such that at most

/q of the data values are less than x and at most (q − k)/q of the data values are more

than x, where k is an integer such that 0

< k < q. There are q − 1 q-quantiles.

The 2-quantile is the data point dividing the lower and upper halves of the data dis-

tribution. It corresponds to the median. The 4-quantiles are the three data points that

split the data distribution into four equal parts; each part represents one-fourth of the

data distribution. They are more commonly referred to as quartiles. The 100-quantiles

are more commonly referred to as percentiles; they divide the data distribution into 100

equal-sized consecutive sets. The median, quartiles, and percentiles are the most widely

used forms of quantiles.

2

Q

3

Q

25th

percentile

75th

percentile

Median

25%

Figure 2.2

A plot of the data distribution for some attribute X. The quantiles plotted are quartiles. The

three quartiles divide the distribution into four equal-size consecutive subsets. The second

quartile corresponds to the median.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 30 31 32 33 34 35 36 37 ... 343