HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 124
#42
124
Chapter 3 Data Preprocessing
was proposed in Siedlecki and Sklansky [SS88]. A wrapper approach to attribute selec-
tion is described in Kohavi and John [KJ97]. Unsupervised attribute subset selection is
described in Dash, Liu, and Yao [DLY97].
For a description of wavelets for dimensionality reduction, see Press, Teukolosky, Vet-
terling, and Flannery [PTVF07]. A general account of wavelets can be found in Hubbard
[Hub96]. For a list of wavelet software packages, see Bruce, Donoho, and Gao [BDG96].
Daubechies transforms are described in Daubechies [Dau92]. The book by Press et al.
[PTVF07] includes an introduction to singular value decomposition for principal com-
ponents analysis. Routines for PCA are included in most statistical software packages
such as SAS (www.sas.com/SASHome.html).
An introduction to regression and log-linear models can be found in several
textbooks such as James [Jam85]; Dobson [Dob90]; Johnson and Wichern [JW92];
Devore [Dev95]; and Neter, Kutner, Nachtsheim, and Wasserman [NKNW96]. For log-
linear models (known as multiplicative models in the computer science literature), see
Pearl [Pea88]. For a general introduction to histograms, see Barbar´a et al. [BDF
+
97]
and Devore and Peck [DP97]. For extensions of single-attribute histograms to multiple
attributes, see Muralikrishna and DeWitt [MD88] and Poosala and Ioannidis [PI97].
Several references to clustering algorithms are given in Chapters 10 and 11 of this book,
which are devoted to the topic.
A survey of multidimensional indexing structures is given in Gaede and G¨unther
[GG98]. The use of multidimensional index trees for data aggregation is discussed in
Aoki [Aok98]. Index trees include R-trees (Guttman [Gut84]), quad-trees (Finkel and
Bentley [FB74]), and their variations. For discussion on sampling and data mining, see
Kivinen and Mannila [KM94] and John and Langley [JL96].
There are many methods for assessing attribute relevance. Each has its own bias. The
information gain measure is biased toward attributes with many values. Many alterna-
tives have been proposed, such as gain ratio (Quinlan [Qui93]), which considers the
probability of each attribute value. Other relevance measures include the Gini index
(Breiman, Friedman, Olshen, and Stone [BFOS84]), the
χ
2
contingency table statis-
tic, and the uncertainty coefficient (Johnson and Wichern [JW92]). For a comparison
of attribute selection measures for decision tree induction, see Buntine and Niblett
[BN92]. For additional methods, see Liu and Motoda [LM98a], Dash and Liu [DL97],
and Almuallim and Dietterich [AD91].
Liu et al. [LHTD02] performed a comprehensive survey of data discretization
methods. Entropy-based discretization with the C4.5 algorithm is described in Quin-
lan [Qui93]. In Catlett [Cat91], the D-2 system binarizes a numeric feature recursively.
ChiMerge by Kerber [Ker92] and Chi2 by Liu and Setiono [LS95] are methods for the
automatic discretization of numeric attributes that both employ the
χ
2
statistic. Fayyad
and Irani [FI93] apply the minimum description length principle to determine the num-
ber of intervals for numeric discretization. Concept hierarchies and their automatic
generation from categorical data are described in Han and Fu [HF94].
HAN
11-ch04-125-186-9780123814791
2011/6/1
3:17
Page 125
#1
4
Data Warehousing and Online
Analytical Processing
Data warehouses generalize
and consolidate data in multidimensional space. The construction
of data warehouses involves data cleaning, data integration, and data transformation,
and can be viewed as an important preprocessing step for data mining. Moreover, data
warehouses provide online analytical processing (OLAP) tools for the interactive analysis
of multidimensional data of varied granularities, which facilitates effective data gene-
ralization and data mining. Many other data mining functions, such as association,
classification, prediction, and clustering, can be integrated with OLAP operations to
enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the
data warehouse has become an increasingly important platform for data analysis and
OLAP and will provide an effective platform for data mining. Therefore, data warehous-
ing and OLAP form an essential step in the knowledge discovery process. This chapter
presents an overview of data warehouse and OLAP technology. This overview is essential
for understanding the overall data mining and knowledge discovery process.
In this chapter, we study a well-accepted definition of the data warehouse and see
why more and more organizations are building data warehouses for the analysis of
their data (Section 4.1). In particular, we study the data cube, a multidimensional data
model for data warehouses and OLAP, as well as OLAP operations such as roll-up, drill-
down, slicing, and dicing (Section 4.2). We also look at data warehouse design and
usage (Section 4.3). In addition, we discuss multidimensional data mining, a power-
ful paradigm that integrates data warehouse and OLAP technology with that of data
mining. An overview of data warehouse implementation examines general strategies
for efficient data cube computation, OLAP data indexing, and OLAP query process-
ing (Section 4.4). Finally, we study data generalization by attribute-oriented induction
(Section 4.5). This method uses concept hierarchies to generalize data to multiple levels
of abstraction.
4.1
Data Warehouse: Basic Concepts
This section gives an introduction to data warehouses. We begin with a definition of the
data warehouse (Section 4.1.1). We outline the differences between operational database
c 2012 Elsevier Inc. All rights reserved.
Data Mining: Concepts and Techniques
125