HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 23
#23
1.5 Which Technologies Are Used?
23
Methods to assess pattern interestingness, and their use to improve data mining effi-
ciency, are discussed throughout the book with respect to each kind of pattern that can
be mined.
1.5
Which Technologies Are Used?
As a highly application-driven domain, data mining has incorporated many techniques
from other domains such as statistics, machine learning, pattern recognition, database
and data warehouse systems, information retrieval, visualization, algorithms, high-
performance computing, and many application domains (Figure 1.11). The interdisci-
plinary nature of data mining research and development contributes significantly to the
success of data mining and its extensive applications. In this section, we give examples
of several disciplines that strongly influence the development of data mining methods.
1.5.1
Statistics
Statistics studies the collection, analysis, interpretation or explanation, and presentation
of data. Data mining has an inherent connection with statistics.
A statistical model is a set of mathematical functions that describe the behavior of
the objects in a target class in terms of random variables and their associated proba-
bility distributions. Statistical models are widely used to model data and data classes.
For example, in data mining tasks like data characterization and classification, statistical
Statistics
Machine learning
Pattern recognition
Visualization
Algorithms
High-performance
computing
Applications
Information
retrieval
Data warehouse
Database systems
Data Mining
Figure 1.11
Data mining adopts techniques from many domains.
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 24
#24
24
Chapter 1 Introduction
models of target classes can be built. In other words, such statistical models can be the
outcome of a data mining task. Alternatively, data mining tasks can be built on top of
statistical models. For example, we can use statistics to model noise and missing data
values. Then, when mining patterns in a large data set, the data mining process can use
the model to help identify and handle noisy or missing values in the data.
Statistics research develops tools for prediction and forecasting using data and sta-
tistical models. Statistical methods can be used to summarize or describe a collection
of data. Basic statistical descriptions of data are introduced in Chapter 2. Statistics is
useful for mining various patterns from data as well as for understanding the underlying
mechanisms generating and affecting the patterns. Inferential statistics (or predictive
statistics) models data in a way that accounts for randomness and uncertainty in the
observations and is used to draw inferences about the process or population under
investigation.
Statistical methods can also be used to verify data mining results. For example, after
a classification or prediction model is mined, the model should be verified by statisti-
cal hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data
analysis) makes statistical decisions using experimental data. A result is called
statistically
significant if it is unlikely to have occurred by chance. If the classification or prediction
model holds true, then the descriptive statistics of the model increases the soundness of
the model.
Applying statistical methods in data mining is far from trivial. Often, a serious chal-
lenge is how to scale up a statistical method over a large data set. Many statistical
methods have high complexity in computation. When such methods are applied on
large data sets that are also distributed on multiple logical or physical sites, algorithms
should be carefully designed and tuned to reduce the computational cost. This challenge
becomes even tougher for online applications, such as online query suggestions in
search engines, where data mining is required to continuously handle fast, real-time
data streams.
1.5.2
Machine Learning
Machine learning investigates how computers can learn (or improve their performance)
based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data. For example, a
typical machine learning problem is to program a computer so that it can automatically
recognize handwritten postal codes on mail after learning from a set of examples.
Machine learning is a fast-growing discipline. Here, we illustrate classic problems in
machine learning that are highly related to data mining.
Supervised learning is basically a synonym for classification. The supervision in the
learning comes from the labeled examples in the training data set. For example, in
the postal code recognition problem, a set of handwritten postal code images and
their corresponding machine-readable translations are used as the training examples,
which supervise the learning of the classification model.