Data Mining. Concepts and Techniques, 3rd Edition

HAN 10-ch03-083-124-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	65/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 61 62 63 64 65 66 67 68 ... 343

Discretization by Cluster, Decision Tree, and Correlation Analyses
Concept Hierarchy Generation for Nominal Data

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 116

#34

116

Chapter 3 Data Preprocessing

Various partitioning rules can be used to deﬁne histograms (Section 3.4.6). In an

equal-width histogram, for example, the values are partitioned into equal-size partitions

or ranges (e.g., earlier in Figure 3.8 for price, where each bucket has a width of $10).

With an equal-frequency histogram, the values are partitioned so that, ideally, each par-

tition contains the same number of data tuples. The histogram analysis algorithm can be

applied recursively to each partition in order to automatically generate a multilevel con-

cept hierarchy, with the procedure terminating once a prespeciﬁed number of concept

levels has been reached. A minimum interval size can also be used per level to control the

recursive procedure. This speciﬁes the minimum width of a partition, or the minimum

number of values for each partition at each level. Histograms can also be partitioned

based on cluster analysis of the data distribution, as described next.

3.5.5

Discretization by Cluster, Decision Tree,

and Correlation Analyses

Clustering, decision tree analysis, and correlation analysis can be used for data dis-

cretization. We brieﬂy study each of these approaches.

Cluster analysis is a popular data discretization method. A clustering algorithm can

be applied to discretize a numeric attribute, A, by partitioning the values of A into clus-

ters or groups. Clustering takes the distribution of A into consideration, as well as the

closeness of data points, and therefore is able to produce high-quality discretization

results.

Clustering can be used to generate a concept hierarchy for A by following either a

top-down splitting strategy or a bottom-up merging strategy, where each cluster forms

a node of the concept hierarchy. In the former, each initial cluster or partition may

be further decomposed into several subclusters, forming a lower level of the hiera-

rchy. In the latter, clusters are formed by repeatedly grouping neighboring clusters in

order to form higher-level concepts. Clustering methods for data mining are studied in

Chapters 10 and 11.

Techniques to generate decision trees for classiﬁcation (Chapter 8) can be applied to

discretization. Such techniques employ a top-down splitting approach. Unlike the other

methods mentioned so far, decision tree approaches to discretization are supervised,

that is, they make use of class label information. For example, we may have a data set of

patient symptoms (the attributes) where each patient has an associated diagnosis class

label. Class distribution information is used in the calculation and determination of

split-points (data values for partitioning an attribute range). Intuitively, the main idea

is to select split-points so that a given resulting partition contains as many tuples of the

same class as possible. Entropy is the most commonly used measure for this purpose. To

discretize a numeric attribute, A, the method selects the value of A that has the minimum

entropy as a split-point, and recursively partitions the resulting intervals to arrive at a

hierarchical discretization. Such discretization forms a concept hierarchy for A.

Because decision tree–based discretization uses class information, it is more likely

that the interval boundaries (split-points) are deﬁned to occur in places that may help

improve classiﬁcation accuracy. Decision trees and the entropy measure are described in

greater detail in Section 8.2.2.

HAN

10-ch03-083-124-9780123814791

2011/6/1

3:16

Page 117

#35

3.5 Data Transformation and Data Discretization

117

Measures of correlation can be used for discretization. ChiMerge is a

2

-based

discretization method. The discretization methods that we have studied up to this

point have all employed a top-down, splitting strategy. This contrasts with ChiMerge,

which employs a bottom-up approach by ﬁnding the best neighboring intervals and

then merging them to form larger intervals, recursively. As with decision tree analysis,

ChiMerge is supervised in that it uses class information. The basic notion is that for

accurate discretization, the relative class frequencies should be fairly consistent within

an interval. Therefore, if two adjacent intervals have a very similar distribution of classes,

then the intervals can be merged. Otherwise, they should remain separate.

ChiMerge proceeds as follows. Initially, each distinct value of a numeric attribute A is

considered to be one interval.

2

tests are performed for every pair of adjacent intervals.

Adjacent intervals with the least

values are merged together, because low

values

for a pair indicate similar class distributions. This merging process proceeds recursively

until a predeﬁned stopping criterion is met.

3.5.6

Concept Hierarchy Generation for Nominal Data

We now look at data transformation for nominal data. In particular, we study concept

hierarchy generation for nominal attributes. Nominal attributes have a ﬁnite (but pos-

sibly large) number of distinct values, with no ordering among the values. Examples

include geographic location, job category, and item type.

Manual deﬁnition of concept hierarchies can be a tedious and time-consuming task

for a user or a domain expert. Fortunately, many hierarchies are implicit within the

database schema and can be automatically deﬁned at the schema deﬁnition level. The

concept hierarchies can be used to transform the data into multiple levels of granular-

ity. For example, data mining patterns regarding sales may be found relating to speciﬁc

regions or countries, in addition to individual branch locations.

We study four methods for the generation of concept hierarchies for nominal data,

as follows.

1.

Speciﬁcation of a partial ordering of attributes explicitly at the schema level by

users or experts: Concept hierarchies for nominal attributes or dimensions typically

involve a group of attributes. A user or expert can easily deﬁne a concept hierarchy by

specifying a partial or total ordering of the attributes at the schema level. For exam-

ple, suppose that a relational database contains the following group of attributes:

street, city, province or state, and country. Similarly, a data warehouse location dimen-

sion may contain the same attributes. A hierarchy can be deﬁned by specifying the

total ordering among these attributes at the schema level such as street

< city <

province or state

< country.

2.

Speciﬁcation of a portion of a hierarchy by explicit data grouping: This is essen-

tially the manual deﬁnition of a portion of a concept hierarchy. In a large database,

it is unrealistic to deﬁne an entire concept hierarchy by explicit value enumera-

tion. On the contrary, we can easily specify explicit groupings for a small portion

of intermediate-level data. For example, after specifying that province and country

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 61 62 63 64 65 66 67 68 ... 343