HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 18
#18
18
Chapter 1 Introduction
Typically, association rules are discarded as uninteresting if they do not satisfy both a
minimum support threshold and a minimum confidence threshold. Additional anal-
ysis can be performed to uncover interesting statistical correlations between associated
attribute–value pairs.
Frequent itemset mining is a fundamental form of frequent pattern mining. The min-
ing of frequent patterns, associations, and correlations is discussed in Chapters 6 and 7,
where particular emphasis is placed on efficient algorithms for frequent itemset min-
ing. Sequential pattern mining and structured pattern mining are considered advanced
topics.
1.4.3
Classification and Regression for Predictive Analysis
Classification is the process of finding a
model (or function) that describes and distin-
guishes data classes or concepts. The model are derived based on the analysis of a set of
training data (i.e., data objects for which the class labels are known). The model is used
to predict the class label of objects for which the the class label is unknown.
“How is the derived model presented?” The derived model may be represented in var-
ious forms, such as classification rules (i.e., IF-THEN rules), decision trees, mathematical
formulae, or
neural networks (Figure 1.9). A
decision tree is a flowchart-like tree structure,
where each node denotes a test on an attribute value, each branch represents an outcome
of the test, and tree leaves represent classes or class distributions. Decision trees can easily
(a)
age(X,
“youth”) AND income(X, “high”)
age(X,
“youth”) AND income(X, “low”)
age(X,
“middle_aged”)
age(X,
“senior”)
class(X,
“A”)
class(X,
“B”)
class(X,
“C”)
class(X,
“C”)
middle_aged, senior
(b)
(c)
age?
age
f
1
f
2
f
3
f
4
f
5
f
6
f
7
f
8
income?
income
youth
high
low
class A
class A
class C
class C
class B
class B
Figure 1.9
A classification model can be represented in various forms: (a) IF-THEN rules, (b) a decision
tree, or (c) a neural network.
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 19
#19
1.4 What Kinds of Patterns Can Be Mined?
19
be converted to classification rules. A neural network, when used for classification, is typ-
ically a collection of neuron-like processing units with weighted connections between the
units. There are many other methods for constructing classification models, such as na¨ıve
Bayesian classification, support vector machines, and k-nearest-neighbor classification.
Whereas classification predicts categorical (discrete, unordered) labels, regression
models continuous-valued functions. That is, regression is used to predict missing or
unavailable numerical data values rather than (discrete) class labels. The term prediction
refers to both numeric prediction and class label prediction. Regression analysis is a
statistical methodology that is most often used for numeric prediction, although other
methods exist as well. Regression also encompasses the identification of distribution
trends based on the available data.
Classification and regression may need to be preceded by relevance analysis, which
attempts to identify attributes that are significantly relevant to the classification and
regression process. Such attributes will be selected for the classification and regression
process. Other attributes, which are irrelevant, can then be excluded from consideration.
Example 1.8
Classification and regression. Suppose as a sales manager of AllElectronics you want to
classify a large set of items in the store, based on three kinds of responses to a sales cam-
paign: good response, mild response and no response. You want to derive a model for each
of these three classes based on the descriptive features of the items, such as price, brand,
place made, type, and
category. The resulting classification should maximally distinguish
each class from the others, presenting an organized picture of the data set.
Suppose that the resulting classification is expressed as a decision tree. The decision
tree, for instance, may identify price as being the single factor that best distinguishes the
three classes. The tree may reveal that, in addition to price, other features that help to
further distinguish objects of each class from one another include brand and place made.
Such a decision tree may help you understand the impact of the given sales campaign
and design a more effective campaign in the future.
Suppose instead, that rather than predicting categorical response labels for each store
item, you would like to predict the amount of revenue that each item will generate
during an upcoming sale at AllElectronics, based on the previous sales data. This is an
example of regression analysis because the regression model constructed will predict a
continuous function (or ordered value.)
Chapters 8 and 9 discuss classification in further detail. Regression analysis is beyond
the scope of this book. Sources for further information are given in the bibliographic
notes.
1.4.4
Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data sets,
clustering analyzes data objects without consulting class labels. In many cases, class-
labeled data may simply not exist at the beginning. Clustering can be used to generate