Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	69/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 65 66 67 68 69 70 71 72 ... 343

Data Warehouse: Basic Concepts
Data Mining: Concepts and Techniques 125

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 124

#42

124

Chapter 3 Data Preprocessing

was proposed in Siedlecki and Sklansky [SS88]. A wrapper approach to attribute selec-

tion is described in Kohavi and John [KJ97]. Unsupervised attribute subset selection is

described in Dash, Liu, and Yao [DLY97].

For a description of wavelets for dimensionality reduction, see Press, Teukolosky, Vet-

terling, and Flannery [PTVF07]. A general account of wavelets can be found in Hubbard

[Hub96]. For a list of wavelet software packages, see Bruce, Donoho, and Gao [BDG96].

Daubechies transforms are described in Daubechies [Dau92]. The book by Press et al.

[PTVF07] includes an introduction to singular value decomposition for principal com-

ponents analysis. Routines for PCA are included in most statistical software packages

such as SAS (www.sas.com/SASHome.html).

An introduction to regression and log-linear models can be found in several

textbooks such as James [Jam85]; Dobson [Dob90]; Johnson and Wichern [JW92];

Devore [Dev95]; and Neter, Kutner, Nachtsheim, and Wasserman [NKNW96]. For log-

linear models (known as multiplicative models in the computer science literature), see

Pearl [Pea88]. For a general introduction to histograms, see Barbar´a et al. [BDF

97]

and Devore and Peck [DP97]. For extensions of single-attribute histograms to multiple

attributes, see Muralikrishna and DeWitt [MD88] and Poosala and Ioannidis [PI97].

Several references to clustering algorithms are given in Chapters 10 and 11 of this book,

which are devoted to the topic.

A survey of multidimensional indexing structures is given in Gaede and G¨unther

[GG98]. The use of multidimensional index trees for data aggregation is discussed in

Aoki [Aok98]. Index trees include R-trees (Guttman [Gut84]), quad-trees (Finkel and

Bentley [FB74]), and their variations. For discussion on sampling and data mining, see

Kivinen and Mannila [KM94] and John and Langley [JL96].

There are many methods for assessing attribute relevance. Each has its own bias. The

information gain measure is biased toward attributes with many values. Many alterna-

tives have been proposed, such as gain ratio (Quinlan [Qui93]), which considers the

probability of each attribute value. Other relevance measures include the Gini index

(Breiman, Friedman, Olshen, and Stone [BFOS84]), the

2

contingency table statis-

tic, and the uncertainty coefﬁcient (Johnson and Wichern [JW92]). For a comparison

of attribute selection measures for decision tree induction, see Buntine and Niblett

[BN92]. For additional methods, see Liu and Motoda [LM98a], Dash and Liu [DL97],

and Almuallim and Dietterich [AD91].

Liu et al. [LHTD02] performed a comprehensive survey of data discretization

methods. Entropy-based discretization with the C4.5 algorithm is described in Quin-

lan [Qui93]. In Catlett [Cat91], the D-2 system binarizes a numeric feature recursively.

ChiMerge by Kerber [Ker92] and Chi2 by Liu and Setiono [LS95] are methods for the

automatic discretization of numeric attributes that both employ the

statistic. Fayyad

and Irani [FI93] apply the minimum description length principle to determine the num-

ber of intervals for numeric discretization. Concept hierarchies and their automatic

generation from categorical data are described in Han and Fu [HF94].

HAN

11-ch04-125-186-9780123814791

2011/6/1

3:17

Page 125

#1

4

Data Warehousing and Online

Analytical Processing

Data warehouses generalize

and consolidate data in multidimensional space. The construction

of data warehouses involves data cleaning, data integration, and data transformation,

and can be viewed as an important preprocessing step for data mining. Moreover, data

warehouses provide online analytical processing (OLAP) tools for the interactive analysis

of multidimensional data of varied granularities, which facilitates effective data gene-

ralization and data mining. Many other data mining functions, such as association,

classiﬁcation, prediction, and clustering, can be integrated with OLAP operations to

enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the

data warehouse has become an increasingly important platform for data analysis and

OLAP and will provide an effective platform for data mining. Therefore, data warehous-

ing and OLAP form an essential step in the knowledge discovery process. This chapter

presents an overview of data warehouse and OLAP technology. This overview is essential

for understanding the overall data mining and knowledge discovery process.

In this chapter, we study a well-accepted deﬁnition of the data warehouse and see

why more and more organizations are building data warehouses for the analysis of

their data (Section 4.1). In particular, we study the data cube, a multidimensional data

model for data warehouses and OLAP, as well as OLAP operations such as roll-up, drill-

down, slicing, and dicing (Section 4.2). We also look at data warehouse design and

usage (Section 4.3). In addition, we discuss multidimensional data mining, a power-

ful paradigm that integrates data warehouse and OLAP technology with that of data

mining. An overview of data warehouse implementation examines general strategies

for efﬁcient data cube computation, OLAP data indexing, and OLAP query process-

ing (Section 4.4). Finally, we study data generalization by attribute-oriented induction

(Section 4.5). This method uses concept hierarchies to generalize data to multiple levels

of abstraction.

4.1

Data Warehouse: Basic Concepts

This section gives an introduction to data warehouses. We begin with a deﬁnition of the

data warehouse (Section 4.1.1). We outline the differences between operational database

Data Mining: Concepts and Techniques

125

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 65 66 67 68 69 70 71 72 ... 343