Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	64/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 60 61 62 63 64 65 66 67 ... 343

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 114

#32

114

Chapter 3 Data Preprocessing

attributes with initially large ranges (e.g., income) from outweighing attributes with

initially smaller ranges (e.g., binary attributes). It is also useful when given no prior

knowledge of the data.

There are many methods for data normalization. We study min-max normalization,

z-score normalization, and normalization by decimal scaling. For our discussion, let A be

a numeric attribute with n observed values, v

, v

...,v

.

Min-max normalization performs a linear transformation on the original data. Sup-

pose that min

A

and max

are the minimum and maximum values of an attribute, A.

Min-max normalization maps a value, v

i

, of A to v

in the range [new min

, new max

]

by computing

v

i

=

v

− min

A

max

A

− min

(new max

− new min

) + new min

(3.8)

Min-max normalization preserves the relationships among the original data values. It

will encounter an “out-of-bounds” error if a future input case for normalization falls

outside of the original data range for A.

Example 3.4

Min-max normalization. Suppose that the minimum and maximum values for the

attribute income are $12,000 and $98,000, respectively. We would like to map income

to the range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is

transformed to

73,600 − 12,000

98,000 − 12,000

(1.0 − 0) + 0 = 0.716.

In z-score normalization (or zero-mean normalization), the values for an attribute,

A, are normalized based on the mean (i.e., average) and standard deviation of A. A value,

v

i

, of A is normalized to v

by computing

v

i

=

v

− ¯A

σ

A

(3.9)

where ¯

A and

σ

A

are the mean and standard deviation, respectively, of attribute A. The

mean and standard deviation were discussed in Section 2.2, where ¯

A =

1

n

+ v

+ · · · +

v

n

) and σ

is computed as the square root of the variance of A (see Eq. (2.6)). This

method of normalization is useful when the actual minimum and maximum of attribute

A are unknown, or when there are outliers that dominate the min-max normalization.

Example 3.5

z-score normalization. Suppose that the mean and standard deviation of the values for

the attribute income are $54,000 and $16,000, respectively. With z-score normalization,

a value of $73,600 for income is transformed to

73,600 − 54,000

16,000

= 1.225.

A variation of this z-score normalization replaces the standard deviation of Eq. (3.9)

by the mean absolute deviation of A. The mean absolute deviation of A, denoted s

, is

s

A

(|v

− ¯A| + |v

− ¯A| + · · · + |v

− ¯A|).

(3.10)

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 115

#33

3.5 Data Transformation and Data Discretization

115

Thus, z-score normalization using the mean absolute deviation is

v

i

=

v

− ¯A

s

A

(3.11)

The mean absolute deviation, s

A

, is more robust to outliers than the standard deviation,

σ

A

. When computing the mean absolute deviation, the deviations from the mean (i.e.,

|x

i

− ¯x|) are not squared; hence, the effect of outliers is somewhat reduced.

Normalization by decimal scaling normalizes by moving the decimal point of values

of attribute A. The number of decimal points moved depends on the maximum absolute

value of A. A value, v

i

, of A is normalized to v

by computing

v

i

=

v

10

j

(3.12)

where j is the smallest integer such that max

(|v

|) < 1.

Example 3.6

Decimal scaling. Suppose that the recorded values of A range from −986 to 917. The

maximum absolute value of A is 986. To normalize by decimal scaling, we therefore

divide each value by 1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917

normalizes to 0.917.

Note that normalization can change the original data quite a bit, especially when

using z-score normalization or decimal scaling. It is also necessary to save the normaliza-

tion parameters (e.g., the mean and standard deviation if using z-score normalization)

so that future data can be normalized in a uniform manner.

3.5.3

Discretization by Binning

Binning is a top-down splitting technique based on a speciﬁed number of bins.

Section 3.2.2 discussed binning methods for data smoothing. These methods are also

used as discretization methods for data reduction and concept hierarchy generation. For

example, attribute values can be discretized by applying equal-width or equal-frequency

binning, and then replacing each bin value by the bin mean or median, as in smoothing

by bin means or smoothing by bin medians, respectively. These techniques can be applied

recursively to the resulting partitions to generate concept hierarchies.

Binning does not use class information and is therefore an unsupervised discretiza-

tion technique. It is sensitive to the user-speciﬁed number of bins, as well as the presence

of outliers.

3.5.4

Discretization by Histogram Analysis

Like binning, histogram analysis is an unsupervised discretization technique because it

does not use class information. Histograms were introduced in Section 2.2.3. A his-

togram partitions the values of an attribute, A, into disjoint ranges called buckets

or bins.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 60 61 62 63 64 65 66 67 ... 343