Data Mining. Concepts and Techniques, 3rd Edition

HAN 08-ch01-001-038-9780123814791

Yüklə 7,95 Mb.

Pdf görüntüsü

səhifə	21/343
tarix	08.10.2017
ölçüsü	7,95 Mb.
	#3817

1 ... 17 18 19 20 21 22 23 24 ... 343

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 20

#20

20

Chapter 1 Introduction

Figure 1.10

A 2-D plot of customer data with respect to customer locations in a city, showing three data

clusters.

class labels for a group of data. The objects are clustered or grouped based on the princi-

ple of maximizing the intraclass similarity and minimizing the interclass similarity. That is,

clusters of objects are formed so that objects within a cluster have high similarity in com-

parison to one another, but are rather dissimilar to objects in other clusters. Each cluster

so formed can be viewed as a class of objects, from which rules can be derived. Clus-

tering can also facilitate taxonomy formation, that is, the organization of observations

into a hierarchy of classes that group similar events together.

Example 1.9

Cluster analysis. Cluster analysis can be performed on AllElectronics customer data to

identify homogeneous subpopulations of customers. These clusters may represent indi-

vidual target groups for marketing. Figure 1.10 shows a 2-D plot of customers with

respect to customer locations in a city. Three clusters of data points are evident.

Cluster analysis forms the topic of Chapters 10 and 11.

1.4.5

Outlier Analysis

A data set may contain objects that do not comply with the general behavior or model

of the data. These data objects are outliers. Many data mining methods discard outliers

as noise or exceptions. However, in some applications (e.g., fraud detection) the rare

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 21

#21

1.4 What Kinds of Patterns Can Be Mined?

events can be more interesting than the more regularly occurring ones. The analysis of

outlier data is referred to as outlier analysis or anomaly mining.

Outliers may be detected using statistical tests that assume a distribution or proba-

bility model for the data, or using distance measures where objects that are remote from

any other cluster are considered outliers. Rather than using statistical or distance mea-

sures, density-based methods may identify outliers in a local region, although they look

normal from a global statistical distribution view.

Example 1.10

Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by

detecting purchases of unusually large amounts for a given account number in compari-

son to regular charges incurred by the same account. Outlier values may also be detected

with respect to the locations and types of purchase, or the purchase frequency.

Outlier analysis is discussed in Chapter 12.

1.4.6

Are All Patterns Interesting?

A data mining system has the potential to generate thousands or even millions of

patterns, or rules.

You may ask, “Are all of the patterns interesting?” Typically, the answer is no—only

a small fraction of the patterns potentially generated would actually be of interest to a

given user.

This raises some serious questions for data mining. You may wonder, “What makes a

pattern interesting? Can a data mining system generate all of the interesting patterns? Or,

Can the system generate only the interesting ones?”

To answer the ﬁrst question, a pattern is interesting if it is (1) easily understood by

humans, (2) valid on new or test data with some degree of certainty, (3) potentially

useful, and (4) novel. A pattern is also interesting if it validates a hypothesis that the user

sought to conﬁrm. An interesting pattern represents knowledge.

Several objective measures of pattern interestingness exist. These are based on

the structure of discovered patterns and the statistics underlying them. An objective

measure for association rules of the form X ⇒ Y is rule support, representing the per-

centage of transactions from a transaction database that the given rule satisﬁes. This is

taken to be the probability P

(X ∪ Y), where X ∪ Y indicates that a transaction contains

both X and Y , that is, the union of itemsets X and Y . Another objective measure for

association rules is conﬁdence, which assesses the degree of certainty of the detected

association. This is taken to be the conditional probability P

(Y|X), that is, the prob-

ability that a transaction containing X also contains Y . More formally, support and

conﬁdence are deﬁned as

support

(X ⇒ Y) = P(X ∪ Y),

conﬁdence

(X ⇒ Y) = P(Y|X).

In general, each interestingness measure is associated with a threshold, which may be

controlled by the user. For example, rules that do not satisfy a conﬁdence threshold of,

HAN

08-ch01-001-038-9780123814791

2011/6/1

3:12

Page 22

#22

22

Chapter 1 Introduction

say, 50% can be considered uninteresting. Rules below the threshold likely reﬂect noise,

exceptions, or minority cases and are probably of less value.

Other objective interestingness measures include accuracy and coverage for classiﬁca-

tion (IF-THEN) rules. In general terms, accuracy tells us the percentage of data that are

correctly classiﬁed by a rule. Coverage is similar to support, in that it tells us the per-

centage of data to which a rule applies. Regarding understandability, we may use simple

objective measures that assess the complexity or length in bits of the patterns mined.

Although objective measures help identify interesting patterns, they are often insufﬁ-

cient unless combined with subjective measures that reﬂect a particular user’s needs and

interests. For example, patterns describing the characteristics of customers who shop

frequently at AllElectronics should be interesting to the marketing manager, but may be

of little interest to other analysts studying the same database for patterns on employee

performance. Furthermore, many patterns that are interesting by objective standards

may represent common sense and, therefore, are actually uninteresting.

Subjective interestingness measures are based on user beliefs in the data. These

measures ﬁnd patterns interesting if the patterns are unexpected (contradicting a user’s

belief) or offer strategic information on which the user can act. In the latter case, such

patterns are referred to as actionable. For example, patterns like “a large earthquake

often follows a cluster of small quakes” may be highly actionable if users can act on the

information to save lives. Patterns that are expected can be interesting if they conﬁrm a

hypothesis that the user wishes to validate or they resemble a user’s hunch.

The second question—“Can a data mining system generate all of the interesting pat-

terns?”—refers to the completeness of a data mining algorithm. It is often unrealistic

and inefﬁcient for data mining systems to generate all possible patterns. Instead, user-

provided constraints and interestingness measures should be used to focus the search.

For some mining tasks, such as association, this is often sufﬁcient to ensure the com-

pleteness of the algorithm. Association rule mining is an example where the use of

constraints and interestingness measures can ensure the completeness of mining. The

methods involved are examined in detail in Chapter 6.

Finally, the third question—“Can a data mining system generate only interesting pat-

terns?”—is an optimization problem in data mining. It is highly desirable for data

mining systems to generate only interesting patterns. This would be efﬁcient for users

and data mining systems because neither would have to search through the patterns gen-

erated to identify the truly interesting ones. Progress has been made in this direction;

however, such optimization remains a challenging issue in data mining.

Measures of pattern interestingness are essential for the efﬁcient discovery of patterns

by target users. Such measures can be used after the data mining step to rank the dis-

covered patterns according to their interestingness, ﬁltering out the uninteresting ones.

More important, such measures can be used to guide and constrain the discovery pro-

cess, improving the search efﬁciency by pruning away subsets of the pattern space that

do not satisfy prespeciﬁed interestingness constraints. Examples of such a constraint-

based mining process are described in Chapter 7 (with respect to pattern discovery) and

Chapter 11 (with respect to clustering).

Yüklə 7,95 Mb.

Dostları ilə paylaş:

1 ... 17 18 19 20 21 22 23 24 ... 343