Data Mining. Concepts and Techniques, 3rd Edition

HAN 08-ch01-001-038-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	22/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 18 19 20 21 22 23 24 25 ... 343

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 23

#23

1.5 Which Technologies Are Used?

Methods to assess pattern interestingness, and their use to improve data mining efﬁ-

ciency, are discussed throughout the book with respect to each kind of pattern that can

be mined.

1.5

Which Technologies Are Used?

As a highly application-driven domain, data mining has incorporated many techniques

from other domains such as statistics, machine learning, pattern recognition, database

and data warehouse systems, information retrieval, visualization, algorithms, high-

performance computing, and many application domains (Figure 1.11). The interdisci-

plinary nature of data mining research and development contributes signiﬁcantly to the

success of data mining and its extensive applications. In this section, we give examples

of several disciplines that strongly inﬂuence the development of data mining methods.

1.5.1

Statistics

Statistics studies the collection, analysis, interpretation or explanation, and presentation

of data. Data mining has an inherent connection with statistics.

A statistical model is a set of mathematical functions that describe the behavior of

the objects in a target class in terms of random variables and their associated proba-

bility distributions. Statistical models are widely used to model data and data classes.

For example, in data mining tasks like data characterization and classiﬁcation, statistical

Statistics

Machine learning

Pattern recognition

Visualization

Algorithms

High-performance

computing

Applications

Information

retrieval

Data warehouse

Database systems

Data Mining

Figure 1.11

Data mining adopts techniques from many domains.

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 24

#24

24

Chapter 1 Introduction

models of target classes can be built. In other words, such statistical models can be the

outcome of a data mining task. Alternatively, data mining tasks can be built on top of

statistical models. For example, we can use statistics to model noise and missing data

values. Then, when mining patterns in a large data set, the data mining process can use

the model to help identify and handle noisy or missing values in the data.

Statistics research develops tools for prediction and forecasting using data and sta-

tistical models. Statistical methods can be used to summarize or describe a collection

of data. Basic statistical descriptions of data are introduced in Chapter 2. Statistics is

useful for mining various patterns from data as well as for understanding the underlying

mechanisms generating and affecting the patterns. Inferential statistics (or predictive

statistics) models data in a way that accounts for randomness and uncertainty in the

observations and is used to draw inferences about the process or population under

investigation.

Statistical methods can also be used to verify data mining results. For example, after

a classiﬁcation or prediction model is mined, the model should be veriﬁed by statisti-

cal hypothesis testing. A statistical hypothesis test (sometimes called conﬁrmatory data

analysis) makes statistical decisions using experimental data. A result is called statistically

signiﬁcant if it is unlikely to have occurred by chance. If the classiﬁcation or prediction

model holds true, then the descriptive statistics of the model increases the soundness of

the model.

Applying statistical methods in data mining is far from trivial. Often, a serious chal-

lenge is how to scale up a statistical method over a large data set. Many statistical

methods have high complexity in computation. When such methods are applied on

large data sets that are also distributed on multiple logical or physical sites, algorithms

should be carefully designed and tuned to reduce the computational cost. This challenge

becomes even tougher for online applications, such as online query suggestions in

search engines, where data mining is required to continuously handle fast, real-time

data streams.

1.5.2

Machine Learning

Machine learning investigates how computers can learn (or improve their performance)

based on data. A main research area is for computer programs to automatically learn to

recognize complex patterns and make intelligent decisions based on data. For example, a

typical machine learning problem is to program a computer so that it can automatically

recognize handwritten postal codes on mail after learning from a set of examples.

Machine learning is a fast-growing discipline. Here, we illustrate classic problems in

machine learning that are highly related to data mining.

Supervised learning is basically a synonym for classiﬁcation. The supervision in the

learning comes from the labeled examples in the training data set. For example, in

the postal code recognition problem, a set of handwritten postal code images and

their corresponding machine-readable translations are used as the training examples,

which supervise the learning of the classiﬁcation model.

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 18 19 20 21 22 23 24 25 ... 343