Data Mining. Concepts and Techniques, 3rd Edition

HAN 09-ch02-039-082-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	49/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 45 46 47 48 49 50 51 52 ... 343

Bibliographic Notes
Data Mining: Concepts and Techniques 83

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 81

#43

2.7 Bibliographic Notes

(d) Term-frequency vectors

2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):

(a) Compute the Euclidean distance between the two objects.

(b) Compute the Manhattan distance between the two objects.

(d) Compute the supremum distance between the two objects.

2.7 The median is one of the most important holistic measures in data analysis. Pro-

pose several methods for median approximation. Analyze their respective complexity

under different parameter settings and decide to what extent the real value can be

approximated. Moreover, suggest a heuristic strategy to balance between accuracy and

complexity and then apply it to all methods you have given.

2.8 It is important to deﬁne or select similarity measures in data analysis. However, there

is no commonly accepted subjective similarity measure. Results can vary depending on

the similarity measures used. Nonetheless, seemingly different similarity measures may

be equivalent after some transformation.

Suppose we have the following 2-D data set:

A

1

A

2

x

1

1.5

1.7

x

2

1.9

x

3

1.6

1.8

x

4

1.2

1.5

x

5

1.5

1.0

(a) Consider the data as 2-D data points. Given a new data point, x =

(1.4,1.6) as a

query, rank the database points based on similarity with the query using Euclidean

distance, Manhattan distance, supremum distance, and cosine similarity.

(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean

distance on the transformed data to rank the data points.

2.7

Bibliographic Notes

Methods for descriptive data summarization have been studied in the statistics literature

long before the onset of computers. Good summaries of statistical descriptive data min-

ing methods include Freedman, Pisani, and Purves [FPP07] and Devore [Dev95]. For

HAN

09-ch02-039-082-9780123814791

2011/6/1

3:15

Page 82

#44

82

Chapter 2 Getting to Know Your Data

statistics-based visualization of data using boxplots, quantile plots, quantile–quantile

plots, scatter plots, and loess curves, see Cleveland [Cle93].

Pioneering work on data visualization techniques is described in The Visual Dis-

play of Quantitative Information [Tuf83], Envisioning Information [Tuf90], and Visual

Explanations: Images and Quantities, Evidence and Narrative [Tuf97], all by Tufte, in

addition to Graphics and Graphic Information Processing by Bertin [Ber81], Visualizing

Data by Cleveland [Cle93], and Information Visualization in Data Mining and Knowledge

Discovery edited by Fayyad, Grinstein, and Wierse [FGW01].

Major conferences and symposiums on visualization include ACM Human Factors

in Computing Systems (CHI), Visualization, and the International Symposium on Infor-

mation Visualization. Research on visualization is also published in Transactions on

Visualization and Computer Graphics, Journal of Computational and Graphical Statistics,

and IEEE Computer Graphics and Applications.

Many graphical user interfaces and visualization tools have been developed and can

be found in various data mining products. Several books on data mining (e.g., Data

Mining Solutions by Westphal and Blaxton [WB98]) present many good examples and

visual snapshots. For a survey of visualization techniques, see “Visual techniques for

exploring databases” by Keim [Kei97].

Similarity and distance measures among various variables have been introduced in

many textbooks that study cluster analysis, including Hartigan [Har75]; Jain and Dubes

[JD88]; Kaufman and Rousseeuw [KR90]; and Arabie, Hubert, and de Soete [AHS96].

Methods for combining attributes of different types into a single dissimilarity matrix

were introduced by Kaufman and Rousseeuw [KR90].

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 83

#1

3

Data Preprocessing

Today’s real-world databases are

highly susceptible to noisy, missing, and inconsistent data

due to their typically huge size (often several gigabytes or more) and their likely origin

from multiple, heterogenous sources. Low-quality data will lead to low-quality mining

results. “How can the data be preprocessed in order to help improve the quality of the data

and, consequently, of the mining results? How can the data be preprocessed so as to improve

the efﬁciency and ease of the mining process?”

There are several data preprocessing techniques. Data cleaning can be applied to

remove noise and correct inconsistencies in data. Data integration merges data from

multiple sources into a coherent data store such as a data warehouse. Data reduction

can reduce data size by, for instance, aggregating, eliminating redundant features, or

clustering. Data transformations (e.g., normalization) may be applied, where data are

scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and

efﬁciency of mining algorithms involving distance measurements. These techniques are

not mutually exclusive; they may work together. For example, data cleaning can involve

transformations to correct wrong data, such as by transforming all entries for a date ﬁeld

to a common format.

In Chapter 2, we learned about the different attribute types and how to use basic

statistical descriptions to study data characteristics. These can help identify erroneous

values and outliers, which will be useful in the data cleaning and integration steps.

Data processing techniques, when applied before mining, can substantially improve the

overall quality of the patterns mined and/or the time required for the actual mining.

In this chapter, we introduce the basic concepts of data preprocessing in Section 3.1.

The methods for data preprocessing are organized into the following categories: data

cleaning (Section 3.2), data integration (Section 3.3), data reduction (Section 3.4), and

data transformation (Section 3.5).

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 45 46 47 48 49 50 51 52 ... 343