Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	37/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 33 34 35 36 37 38 39 40 ... 343

Example 2.14 Quantile–quantile plot.
Histograms Histograms (or frequency histograms
Example 2.15 Histogram.
Scatter Plots and Data Correlation

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 53

#15

2.2 Basic Statistical Descriptions of Data

53

Table 2.1

A Set of Unit Price Data for Items

Sold at a Branch of AllElectronics

Unit price

Count of

(

items sold

275

300

250

−

360

515

540

−

115

320

117

270

120

350

120

110

100

Median

1

Q

70

80

Branch 1 (unit price $)

Branch 2 (unit price $)

100

110

120

Figure 2.5

A q-q plot for unit price data from two AllElectronics branches.

data, which is plotted against the

(i − 0.5)/M quantile of the x data. This computation

typically involves interpolation.

Example 2.14

Quantile–quantile plot. Figure 2.5 shows a quantile–quantile plot for unit price data of

items sold at two branches of AllElectronics during a given time period. Each point cor-

responds to the same quantile for each data set and shows the unit price of items sold at

branch 1 versus branch 2 for that quantile. (To aid in comparison, the straight line rep-

resents the case where, for each given quantile, the unit price at each branch is the same.

The darker points correspond to the data for Q

, the median, and Q

, respectively.)

We see, for example, that at Q

, the unit price of items sold at branch 1 was slightly

less than that at branch 2. In other words, 25% of items sold at branch 1 were less than or

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 54

#16

54

Chapter 2 Getting to Know Your Data

equal to $60, while 25% of items sold at branch 2 were less than or equal to $64. At the

50th percentile (marked by the median, which is also Q

), we see that 50% of items

sold at branch 1 were less than $78, while 50% of items at branch 2 were less than $85.

In general, we note that there is a shift in the distribution of branch 1 with respect to

branch 2 in that the unit prices of items sold at branch 1 tend to be lower than those at

branch 2.

Histograms

Histograms (or frequency histograms) are at least a century old and are widely used.

“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of

poles. Plotting histograms is a graphical method for summarizing the distribution of a

given attribute, X. If X is nominal, such as automobile model or item type, then a pole

or vertical bar is drawn for each known value of X. The height of the bar indicates the

frequency (i.e., count) of that X value. The resulting graph is more commonly known as

a bar chart.

If X is numeric, the term histogram is preferred. The range of values for X is parti-

tioned into disjoint consecutive subranges. The subranges, referred to as buckets or bins,

are disjoint subsets of the data distribution for X. The range of a bucket is known as

the width. Typically, the buckets are of equal width. For example, a price attribute with

a value range of $1 to $200 (rounded up to the nearest dollar) can be partitioned into

subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn with a

height that represents the total count of items observed within the subrange. Histograms

and partitioning rules are further discussed in Chapter 3 on data reduction.

Example 2.15

Histogram. Figure 2.6 shows a histogram for the data set of Table 2.1, where buckets (or

bins) are deﬁned by equal-width ranges representing $20 increments and the frequency

is the count of items sold.

Although histograms are widely used, they may not be as effective as the quantile

plot, q-q plot, and boxplot methods in comparing groups of univariate observations.

Scatter Plots and Data Correlation

A scatter plot is one of the most effective graphical methods for determining if there

appears to be a relationship, pattern, or trend between two numeric attributes. To con-

struct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic

sense and plotted as points in the plane. Figure 2.7 shows a scatter plot for the set of data

in Table 2.1.

The scatter plot is a useful method for providing a ﬁrst look at bivariate data to see

clusters of points and outliers, or to explore the possibility of correlation relationships.

Two attributes, X, and Y , are correlated if one attribute implies the other. Correlations

can be positive, negative, or null (uncorrelated). Figure 2.8 shows examples of positive

and negative correlations between two attributes. If the plotted points pattern slopes

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 33 34 35 36 37 38 39 40 ... 343