Data Mining. Concepts and Techniques, 3rd Edition

HAN 08-ch01-001-038-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	20/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 16 17 18 19 20 21 22 23 ... 343

Classiﬁcation and Regression for Predictive Analysis Classiﬁcation
Example 1.8 Classiﬁcation and regression.
Cluster Analysis

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 18

#18

18

Chapter 1 Introduction

Typically, association rules are discarded as uninteresting if they do not satisfy both a

minimum support threshold and a minimum conﬁdence threshold. Additional anal-

ysis can be performed to uncover interesting statistical correlations between associated

attribute–value pairs.

Frequent itemset mining is a fundamental form of frequent pattern mining. The min-

ing of frequent patterns, associations, and correlations is discussed in Chapters 6 and 7,

where particular emphasis is placed on efﬁcient algorithms for frequent itemset min-

ing. Sequential pattern mining and structured pattern mining are considered advanced

topics.

1.4.3

Classiﬁcation and Regression for Predictive Analysis

Classiﬁcation is the process of ﬁnding a model (or function) that describes and distin-

guishes data classes or concepts. The model are derived based on the analysis of a set of

training data (i.e., data objects for which the class labels are known). The model is used

to predict the class label of objects for which the the class label is unknown.

“How is the derived model presented?” The derived model may be represented in var-

ious forms, such as classiﬁcation rules (i.e., IF-THEN rules), decision trees, mathematical

formulae, or neural networks (Figure 1.9). A decision tree is a ﬂowchart-like tree structure,

where each node denotes a test on an attribute value, each branch represents an outcome

of the test, and tree leaves represent classes or class distributions. Decision trees can easily

(a)

age(X,

“youth”) AND income(X, “high”)

age(X,

“youth”) AND income(X, “low”)

age(X,

“middle_aged”)

age(X,

“senior”)

class(X,

“A”)

class(X,

“B”)

class(X,

“C”)

class(X,

“C”)

middle_aged, senior

(b)

(c)

age?

age

f

1

f

2

f

3

f

4

f

5

f

6

f

7

f

8

income?

income

youth

high

low

class A

class C

class B

Figure 1.9

A classiﬁcation model can be represented in various forms: (a) IF-THEN rules, (b) a decision

tree, or (c) a neural network.

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 19

#19

1.4 What Kinds of Patterns Can Be Mined?

be converted to classiﬁcation rules. A neural network, when used for classiﬁcation, is typ-

ically a collection of neuron-like processing units with weighted connections between the

units. There are many other methods for constructing classiﬁcation models, such as na¨ıve

Bayesian classiﬁcation, support vector machines, and k-nearest-neighbor classiﬁcation.

Whereas classiﬁcation predicts categorical (discrete, unordered) labels, regression

models continuous-valued functions. That is, regression is used to predict missing or

unavailable numerical data values rather than (discrete) class labels. The term prediction

refers to both numeric prediction and class label prediction. Regression analysis is a

statistical methodology that is most often used for numeric prediction, although other

methods exist as well. Regression also encompasses the identiﬁcation of distribution

trends based on the available data.

Classiﬁcation and regression may need to be preceded by relevance analysis, which

attempts to identify attributes that are signiﬁcantly relevant to the classiﬁcation and

regression process. Such attributes will be selected for the classiﬁcation and regression

process. Other attributes, which are irrelevant, can then be excluded from consideration.

Example 1.8

Classiﬁcation and regression. Suppose as a sales manager of AllElectronics you want to

classify a large set of items in the store, based on three kinds of responses to a sales cam-

paign: good response, mild response and no response. You want to derive a model for each

of these three classes based on the descriptive features of the items, such as price, brand,

place made, type, and category. The resulting classiﬁcation should maximally distinguish

each class from the others, presenting an organized picture of the data set.

Suppose that the resulting classiﬁcation is expressed as a decision tree. The decision

tree, for instance, may identify price as being the single factor that best distinguishes the

three classes. The tree may reveal that, in addition to price, other features that help to

further distinguish objects of each class from one another include brand and place made.

Such a decision tree may help you understand the impact of the given sales campaign

and design a more effective campaign in the future.

Suppose instead, that rather than predicting categorical response labels for each store

item, you would like to predict the amount of revenue that each item will generate

during an upcoming sale at AllElectronics, based on the previous sales data. This is an

example of regression analysis because the regression model constructed will predict a

continuous function (or ordered value.)

Chapters 8 and 9 discuss classiﬁcation in further detail. Regression analysis is beyond

the scope of this book. Sources for further information are given in the bibliographic

notes.

1.4.4

Cluster Analysis

Unlike classiﬁcation and regression, which analyze class-labeled (training) data sets,

clustering analyzes data objects without consulting class labels. In many cases, class-

labeled data may simply not exist at the beginning. Clustering can be used to generate

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 16 17 18 19 20 21 22 23 ... 343