HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 114
#32
114
Chapter 3 Data Preprocessing
attributes with initially large ranges (e.g., income) from outweighing attributes with
initially smaller ranges (e.g., binary attributes). It is also useful when given no prior
knowledge of the data.
There are many methods for data normalization. We study min-max normalization,
z-score normalization, and
normalization by decimal scaling. For our discussion, let
A be
a numeric attribute with n observed values, v
1
, v
2
,
...,v
n
.
Min-max normalization performs a linear transformation on the original data. Sup-
pose that min
A
and max
A
are the minimum and maximum values of an attribute, A.
Min-max normalization maps a value, v
i
, of A to v
i
in the range [new min
A
, new max
A
]
by computing
v
i
=
v
i
− min
A
max
A
− min
A
(new max
A
− new min
A
) + new min
A
.
(3.8)
Min-max normalization preserves the relationships among the original data values. It
will encounter an “out-of-bounds” error if a future input case for normalization falls
outside of the original data range for A.
Example 3.4
Min-max normalization. Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. We would like to map income
to the range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is
transformed to
73,600 − 12,000
98,000 − 12,000
(1.0 − 0) + 0 = 0.716.
In z-score normalization (or zero-mean normalization), the values for an attribute,
A, are normalized based on the mean (i.e., average) and standard deviation of
A. A value,
v
i
, of A is normalized to v
i
by computing
v
i
=
v
i
− ¯A
σ
A
,
(3.9)
where ¯
A and
σ
A
are the mean and standard deviation, respectively, of attribute A. The
mean and standard deviation were discussed in Section 2.2, where ¯
A =
1
n
(v
1
+ v
2
+ · · · +
v
n
) and σ
A
is computed as the square root of the variance of A (see Eq. (2.6)). This
method of normalization is useful when the actual minimum and maximum of attribute
A are unknown, or when there are outliers that dominate the min-max normalization.
Example 3.5
z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization,
a value of $73,600 for income is transformed to
73,600 − 54,000
16,000
= 1.225.
A variation of this z-score normalization replaces the standard deviation of Eq. (3.9)
by the mean absolute deviation of A. The mean absolute deviation of A, denoted s
A
, is
s
A
=
1
n
(|v
1
− ¯A| + |v
2
− ¯A| + · · · + |v
n
− ¯A|).
(3.10)
HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 115
#33
3.5 Data Transformation and Data Discretization
115
Thus, z-score normalization using the mean absolute deviation is
v
i
=
v
i
− ¯A
s
A
.
(3.11)
The mean absolute deviation,
s
A
, is more robust to outliers than the standard deviation,
σ
A
. When computing the mean absolute deviation, the deviations from the mean (i.e.,
|x
i
− ¯x|) are not squared; hence, the effect of outliers is somewhat reduced.
Normalization by decimal scaling normalizes by moving the decimal point of values
of attribute A. The number of decimal points moved depends on the maximum absolute
value of A. A value, v
i
, of A is normalized to v
i
by computing
v
i
=
v
i
10
j
,
(3.12)
where
j is the smallest integer such that
max
(|v
i
|) < 1.
Example 3.6
Decimal scaling. Suppose that the recorded values of
A range from −986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we therefore
divide each value by 1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917
normalizes to 0.917.
Note that normalization can change the original data quite a bit, especially when
using z-score normalization or decimal scaling. It is also necessary to save the normaliza-
tion parameters (e.g., the mean and standard deviation if using z-score normalization)
so that future data can be normalized in a uniform manner.
3.5.3
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins.
Section 3.2.2 discussed binning methods for data smoothing. These methods are also
used as discretization methods for data reduction and concept hierarchy generation. For
example, attribute values can be discretized by applying equal-width or equal-frequency
binning, and then replacing each bin value by the bin mean or median, as in smoothing
by bin means or
smoothing by bin medians, respectively. These techniques can be applied
recursively to the resulting partitions to generate concept hierarchies.
Binning does not use class information and is therefore an unsupervised discretiza-
tion technique. It is sensitive to the user-specified number of bins, as well as the presence
of outliers.
3.5.4
Discretization by Histogram Analysis
Like binning, histogram analysis is an unsupervised discretization technique because it
does not use class information. Histograms were introduced in Section 2.2.3. A his-
togram partitions the values of an attribute, A, into disjoint ranges called buckets
or bins.