HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 81
#43
2.7 Bibliographic Notes
81
(c) Numeric attributes
(d) Term-frequency vectors
2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q = 3.
(d) Compute the supremum distance between the two objects.
2.7 The median is one of the most important holistic measures in data analysis. Pro-
pose several methods for median approximation. Analyze their respective complexity
under different parameter settings and decide to what extent the real value can be
approximated. Moreover, suggest a heuristic strategy to balance between accuracy and
complexity and then apply it to all methods you have given.
2.8 It is important to define or select similarity measures in data analysis. However, there
is no commonly accepted subjective similarity measure. Results can vary depending on
the similarity measures used. Nonetheless, seemingly different similarity measures may
be equivalent after some transformation.
Suppose we have the following 2-D data set:
A
1
A
2
x
1
1.5
1.7
x
2
2
1.9
x
3
1.6
1.8
x
4
1.2
1.5
x
5
1.5
1.0
(a) Consider the data as 2-D data points. Given a new data point, x =
(1.4,1.6) as a
query, rank the database points based on similarity with the query using Euclidean
distance, Manhattan distance, supremum distance, and cosine similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points.
2.7
Bibliographic Notes
Methods for descriptive data summarization have been studied in the statistics literature
long before the onset of computers. Good summaries of statistical descriptive data min-
ing methods include Freedman, Pisani, and Purves [FPP07] and Devore [Dev95]. For
HAN
09-ch02-039-082-9780123814791
2011/6/1
3:15
Page 82
#44
82
Chapter 2 Getting to Know Your Data
statistics-based visualization of data using boxplots, quantile plots, quantile–quantile
plots, scatter plots, and loess curves, see Cleveland [Cle93].
Pioneering work on data visualization techniques is described in The Visual Dis-
play of Quantitative Information [Tuf83], Envisioning Information [Tuf90], and Visual
Explanations: Images and Quantities, Evidence and Narrative [Tuf97], all by Tufte, in
addition to Graphics and Graphic Information Processing by Bertin [Ber81], Visualizing
Data by Cleveland [Cle93], and Information Visualization in Data Mining and Knowledge
Discovery edited by Fayyad, Grinstein, and Wierse [FGW01].
Major conferences and symposiums on visualization include ACM Human Factors
in Computing Systems (CHI), Visualization, and the International Symposium on Infor-
mation Visualization. Research on visualization is also published in Transactions on
Visualization and Computer Graphics, Journal of Computational and Graphical Statistics,
and IEEE Computer Graphics and Applications.
Many graphical user interfaces and visualization tools have been developed and can
be found in various data mining products. Several books on data mining (e.g., Data
Mining Solutions by Westphal and Blaxton [WB98]) present many good examples and
visual snapshots. For a survey of visualization techniques, see “Visual techniques for
exploring databases” by Keim [Kei97].
Similarity and distance measures among various variables have been introduced in
many textbooks that study cluster analysis, including Hartigan [Har75]; Jain and Dubes
[JD88]; Kaufman and Rousseeuw [KR90]; and Arabie, Hubert, and de Soete [AHS96].
Methods for combining attributes of different types into a single dissimilarity matrix
were introduced by Kaufman and Rousseeuw [KR90].
HAN
10-ch03-083-124-9780123814791
2011/6/1
3:16
Page 83
#1
3
Data Preprocessing
Today’s real-world databases are
highly susceptible to noisy, missing, and inconsistent data
due to their typically huge size (often several gigabytes or more) and their likely origin
from multiple, heterogenous sources. Low-quality data will lead to low-quality mining
results. “How can the data be preprocessed in order to help improve the quality of the data
and, consequently, of the mining results? How can the data be preprocessed so as to improve
the efficiency and ease of the mining process?”
There are several data preprocessing techniques. Data cleaning can be applied to
remove noise and correct inconsistencies in data. Data integration merges data from
multiple sources into a coherent data store such as a data warehouse. Data reduction
can reduce data size by, for instance, aggregating, eliminating redundant features, or
clustering. Data transformations (e.g., normalization) may be applied, where data are
scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
efficiency of mining algorithms involving distance measurements. These techniques are
not mutually exclusive; they may work together. For example, data cleaning can involve
transformations to correct wrong data, such as by transforming all entries for a date field
to a common format.
In Chapter 2, we learned about the different attribute types and how to use basic
statistical descriptions to study data characteristics. These can help identify erroneous
values and outliers, which will be useful in the data cleaning and integration steps.
Data processing techniques, when applied before mining, can substantially improve the
overall quality of the patterns mined and/or the time required for the actual mining.
In this chapter, we introduce the basic concepts of data preprocessing in Section 3.1.
The methods for data preprocessing are organized into the following categories: data
cleaning (Section 3.2), data integration (Section 3.3), data reduction (Section 3.4), and
data transformation (Section 3.5).
c 2012 Elsevier Inc. All rights reserved.
Data Mining: Concepts and Techniques
83
Dostları ilə paylaş: |