HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 103
#21
3.4 Data Reduction
103
X
2
X
1
Y
1
Y
2
Figure 3.5
Principal components analysis. Y
1
and Y
2
are the first two principal components for the
given data.
providing important information about variance. That is, the sorted axes are such
that the first axis shows the most variance among the data, the second axis shows the
next highest variance, and so on. For example, Figure 3.5 shows the first two princi-
pal components, Y
1
and Y
2
, for the given set of data originally mapped to the axes X
1
and X
2
. This information helps identify groups or patterns within the data.
4.
Because the components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low vari-
ance. Using the strongest principal components, it should be possible to reconstruct
a good approximation of the original data.
PCA can be applied to ordered and unordered attributes, and can handle sparse data
and skewed data. Multidimensional data of more than two dimensions can be han-
dled by reducing the problem to two dimensions. Principal components may be used
as inputs to multiple regression and cluster analysis. In comparison with wavelet trans-
forms, PCA tends to be better at handling sparse data, whereas wavelet transforms are
more suitable for data of high dimensionality.
3.4.4
Attribute Subset Selection
Data sets for analysis may contain hundreds of attributes, many of which may be irrel-
evant to the mining task or redundant. For example, if the task is to classify customers
based on whether or not they are likely to purchase a popular new CD at AllElectronics
when notified of a sale, attributes such as the customer’s telephone number are likely to
be irrelevant, unlike attributes such as age or music taste. Although it may be possible for
a domain expert to pick out some of the useful attributes, this can be a difficult and time-
consuming task, especially when the data’s behavior is not well known. (Hence, a reason
behind its analysis!) Leaving out relevant attributes or keeping irrelevant attributes may
be detrimental, causing confusion for the mining algorithm employed. This can result
in discovered patterns of poor quality. In addition, the added volume of irrelevant or
redundant attributes can slow down the mining process.
HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 104
#22
104
Chapter 3 Data Preprocessing
Attribute subset selection
4
reduces the data set size by removing irrelevant or
redundant attributes (or dimensions). The goal of attribute subset selection is to find
a minimum set of attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using all attributes.
Mining on a reduced set of attributes has an additional benefit: It reduces the number
of attributes appearing in the discovered patterns, helping to make the patterns easier to
understand.
“How can we find a ‘good’ subset of the original attributes?” For
n attributes, there are
2
n
possible subsets. An exhaustive search for the optimal subset of attributes can be pro-
hibitively expensive, especially as n and the number of data classes increase. Therefore,
heuristic methods that explore a reduced search space are commonly used for attribute
subset selection. These methods are typically greedy in that, while searching through
attribute space, they always make what looks to be the best choice at the time. Their
strategy is to make a locally optimal choice in the hope that this will lead to a globally
optimal solution. Such greedy methods are effective in practice and may come close to
estimating an optimal solution.
The “best” (and “worst”) attributes are typically determined using tests of statistical
significance, which assume that the attributes are independent of one another. Many
other attribute evaluation measures can be used such as the information gain measure
used in building decision trees for classification.
5
Basic heuristic methods of attribute subset selection include the techniques that
follow, some of which are illustrated in Figure 3.6.
Forward selection
Initial attribute set:
{A
1
, A
2
, A
3
, A
4
, A
5
, A
6
}
Initial reduced set:
{}
=> {A
1
}
=> {A
1
,
A
4
}
=> Reduced attribute set:
{A
1
, A
4
, A
6
}
Initial attribute set:
{
A
1
, A
2
, A
3
, A
4
, A
5
, A
6
}
=> {A
1
, A
3
, A
4
, A
5
, A
6
}
=> {A
1
, A
4
, A
5
, A
6
}
=> Reduced attribute set:
{
A
1
, A
4
, A
6
}
Initial attribute set:
{
A
1
, A
2
, A
3
, A
4
, A
5
, A
6
}
=> Reduced attribute set:
{
A
1
, A
4
, A
6
}
Backward elimination
Decision
tree induction
A
4
?
A
1
?
A
6
?
Class 1
Class 2
Class 1
Class 2
Y
N
Y
N
Y
N
Figure 3.6
Greedy (heuristic) methods for attribute subset selection.
4
In machine learning, attribute subset selection is known as feature subset selection.
5
The information gain measure is described in detail in Chapter 8.