Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	58/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 54 55 56 57 58 59 60 61 ... 343

Principal Components Analysis
Principal components analysis (PCA

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 101

#19

3.4 Data Reduction

101

cleaning as well. Given a set of coefﬁcients, an approximation of the original data can be

constructed by applying the inverse of the DWT used.

The DWT is closely related to the discrete Fourier transform (DFT), a signal process-

ing technique involving sines and cosines. In general, however, the DWT achieves better

lossy compression. That is, if the same number of coefﬁcients is retained for a DWT and

a DFT of a given data vector, the DWT version will provide a more accurate approxima-

tion of the original data. Hence, for an equivalent approximation, the DWT requires less

space than the DFT. Unlike the DFT, wavelets are quite localized in space, contributing

to the conservation of local detail.

There is only one DFT, yet there are several families of DWTs. Figure 3.4 shows

some wavelet families. Popular wavelet transforms include the Haar-2, Daubechies-4,

and Daubechies-6. The general procedure for applying a discrete wavelet transform uses

a hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast

computational speed. The method is as follows:

1.

The length, L, of the input data vector must be an integer power of 2. This condition

can be met by padding the data vector with zeros as necessary (L ≥ n).

2.

Each transform involves applying two functions. The ﬁrst applies some data smooth-

ing, such as a sum or weighted average. The second performs a weighted difference,

which acts to bring out the detailed features of the data.

The two functions are applied to pairs of data points in X, that is, to all pairs of

measurements

, x

2i+1

). This results in two data sets of length L/2. In general,

these represent a smoothed or low-frequency version of the input data and the high-

frequency content of it, respectively.

The two functions are recursively applied to the data sets obtained in the previous

loop, until the resulting data sets obtained are of length 2.

5.

Selected values from the data sets obtained in the previous iterations are designated

the wavelet coefﬁcients of the transformed data.

0.8

0.6

0.4

0.2

0.0

Ϫ1.0 Ϫ0.5 0.0 0.5

(a) Haar-2

(b) Daubechies-4

1.0 1.5 2.0

0.6

0.4

0.2

0.0

Figure 3.4

Examples of wavelet families. The number next to a wavelet name is the number of vanishing

moments of the wavelet. This is a set of mathematical relationships that the coefﬁcients must

satisfy and is related to the number of coefﬁcients.

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 102

#20

102

Chapter 3 Data Preprocessing

Equivalently, a matrix multiplication can be applied to the input data in order to

obtain the wavelet coefﬁcients, where the matrix used depends on the given DWT. The

matrix must be orthonormal, meaning that the columns are unit vectors and are mutu-

ally orthogonal, so that the matrix inverse is just its transpose. Although we do not have

room to discuss it here, this property allows the reconstruction of the data from the

smooth and smooth-difference data sets. By factoring the matrix used into a product of

a few sparse matrices, the resulting “fast DWT” algorithm has a complexity of O

(n) for

an input vector of length n.

Wavelet transforms can be applied to multidimensional data such as a data cube. This

is done by ﬁrst applying the transform to the ﬁrst dimension, then to the second, and so

on. The computational complexity involved is linear with respect to the number of cells

in the cube. Wavelet transforms give good results on sparse or skewed data and on data

with ordered attributes. Lossy compression by wavelets is reportedly better than JPEG

compression, the current commercial standard. Wavelet transforms have many real-

world applications, including the compression of ﬁngerprint images, computer vision,

analysis of time-series data, and data cleaning.

3.4.3

Principal Components Analysis

In this subsection we provide an intuitive introduction to principal components analy-

sis as a method of dimesionality reduction. A detailed theoretical explanation is beyond

the scope of this book. For additional references, please see the bibliographic notes

(Section 3.8) at the end of this chapter.

Suppose that the data to be reduced consist of tuples or data vectors described

by n attributes or dimensions. Principal components analysis (PCA; also called the

Karhunen-Loeve, or K-L, method) searches for k n-dimensional orthogonal vectors that

can best be used to represent the data, where k ≤ n. The original data are thus projected

onto a much smaller space, resulting in dimensionality reduction. Unlike attribute sub-

set selection (Section 3.4.4), which reduces the attribute set size by retaining a subset of

the initial set of attributes, PCA “combines” the essence of attributes by creating an alter-

native, smaller set of variables. The initial data can then be projected onto this smaller

set. PCA often reveals relationships that were not previously suspected and thereby

allows interpretations that would not ordinarily result.

The basic procedure is as follows:

The input data are normalized, so that each attribute falls within the same range. This

step helps ensure that attributes with large domains will not dominate attributes with

smaller domains.

PCA computes k orthonormal vectors that provide a basis for the normalized input

data. These are unit vectors that each point in a direction perpendicular to the others.

These vectors are referred to as the principal components. The input data are a linear

combination of the principal components.

3.

The principal components are sorted in order of decreasing “signiﬁcance” or

strength. The principal components essentially serve as a new set of axes for the data,

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 54 55 56 57 58 59 60 61 ... 343