HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 20
#20
20
Chapter 1 Introduction
Figure 1.10
A 2-D plot of customer data with respect to customer locations in a city, showing three data
clusters.
class labels for a group of data. The objects are clustered or grouped based on the princi-
ple of maximizing the intraclass similarity and minimizing the interclass similarity. That is,
clusters of objects are formed so that objects within a cluster have high similarity in com-
parison to one another, but are rather dissimilar to objects in other clusters. Each cluster
so formed can be viewed as a class of objects, from which rules can be derived. Clus-
tering can also facilitate taxonomy formation, that is, the organization of observations
into a hierarchy of classes that group similar events together.
Example 1.9
Cluster analysis. Cluster analysis can be performed on
AllElectronics customer data to
identify homogeneous subpopulations of customers. These clusters may represent indi-
vidual target groups for marketing. Figure 1.10 shows a 2-D plot of customers with
respect to customer locations in a city. Three clusters of data points are evident.
Cluster analysis forms the topic of Chapters 10 and 11.
1.4.5
Outlier Analysis
A data set may contain objects that do not comply with the general behavior or model
of the data. These data objects are outliers. Many data mining methods discard outliers
as noise or exceptions. However, in some applications (e.g., fraud detection) the rare
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 21
#21
1.4 What Kinds of Patterns Can Be Mined?
21
events can be more interesting than the more regularly occurring ones. The analysis of
outlier data is referred to as outlier analysis or anomaly mining.
Outliers may be detected using statistical tests that assume a distribution or proba-
bility model for the data, or using distance measures where objects that are remote from
any other cluster are considered outliers. Rather than using statistical or distance mea-
sures, density-based methods may identify outliers in a local region, although they look
normal from a global statistical distribution view.
Example 1.10
Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of unusually large amounts for a given account number in compari-
son to regular charges incurred by the same account. Outlier values may also be detected
with respect to the locations and types of purchase, or the purchase frequency.
Outlier analysis is discussed in Chapter 12.
1.4.6
Are All Patterns Interesting?
A data mining system has the potential to generate thousands or even millions of
patterns, or rules.
You may ask, “Are all of the patterns interesting?” Typically, the answer is no—only
a small fraction of the patterns potentially generated would actually be of interest to a
given user.
This raises some serious questions for data mining. You may wonder, “What makes a
pattern interesting? Can a data mining system generate all of the interesting patterns? Or,
Can the system generate only the interesting ones?”
To answer the first question, a pattern is interesting if it is (1) easily understood by
humans, (2) valid on new or test data with some degree of certainty, (3) potentially
useful, and (4) novel. A pattern is also interesting if it validates a hypothesis that the user
sought to confirm. An interesting pattern represents knowledge.
Several objective measures of pattern interestingness exist. These are based on
the structure of discovered patterns and the statistics underlying them. An objective
measure for association rules of the form X ⇒ Y is rule support, representing the per-
centage of transactions from a transaction database that the given rule satisfies. This is
taken to be the probability P
(X ∪ Y), where X ∪ Y indicates that a transaction contains
both X and Y , that is, the union of itemsets X and Y . Another objective measure for
association rules is confidence, which assesses the degree of certainty of the detected
association. This is taken to be the conditional probability P
(Y|X), that is, the prob-
ability that a transaction containing X also contains Y . More formally, support and
confidence are defined as
support
(X ⇒ Y) = P(X ∪ Y),
confidence
(X ⇒ Y) = P(Y|X).
In general, each interestingness measure is associated with a threshold, which may be
controlled by the user. For example, rules that do not satisfy a confidence threshold of,
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 22
#22
22
Chapter 1 Introduction
say, 50% can be considered uninteresting. Rules below the threshold likely reflect noise,
exceptions, or minority cases and are probably of less value.
Other objective interestingness measures include accuracy and coverage for classifica-
tion (IF-THEN) rules. In general terms, accuracy tells us the percentage of data that are
correctly classified by a rule. Coverage is similar to support, in that it tells us the per-
centage of data to which a rule applies. Regarding understandability, we may use simple
objective measures that assess the complexity or length in bits of the patterns mined.
Although objective measures help identify interesting patterns, they are often insuffi-
cient unless combined with subjective measures that reflect a particular user’s needs and
interests. For example, patterns describing the characteristics of customers who shop
frequently at AllElectronics should be interesting to the marketing manager, but may be
of little interest to other analysts studying the same database for patterns on employee
performance. Furthermore, many patterns that are interesting by objective standards
may represent common sense and, therefore, are actually uninteresting.
Subjective interestingness measures are based on user beliefs in the data. These
measures find patterns interesting if the patterns are unexpected (contradicting a user’s
belief) or offer strategic information on which the user can act. In the latter case, such
patterns are referred to as actionable. For example, patterns like “a large earthquake
often follows a cluster of small quakes” may be highly actionable if users can act on the
information to save lives. Patterns that are expected can be interesting if they confirm a
hypothesis that the user wishes to validate or they resemble a user’s hunch.
The second question—“Can a data mining system generate all of the interesting pat-
terns?”—refers to the
completeness of a data mining algorithm. It is often unrealistic
and inefficient for data mining systems to generate all possible patterns. Instead, user-
provided constraints and interestingness measures should be used to focus the search.
For some mining tasks, such as association, this is often sufficient to ensure the com-
pleteness of the algorithm. Association rule mining is an example where the use of
constraints and interestingness measures can ensure the completeness of mining. The
methods involved are examined in detail in Chapter 6.
Finally, the third question—“Can a data mining system generate only interesting pat-
terns?”—is an optimization problem in data mining. It is highly desirable for data
mining systems to generate only interesting patterns. This would be efficient for users
and data mining systems because neither would have to search through the patterns gen-
erated to identify the truly interesting ones. Progress has been made in this direction;
however, such optimization remains a challenging issue in data mining.
Measures of pattern interestingness are essential for the efficient discovery of patterns
by target users. Such measures can be used after the data mining step to rank the dis-
covered patterns according to their interestingness, filtering out the uninteresting ones.
More important, such measures can be used to guide and constrain the discovery pro-
cess, improving the search efficiency by pruning away subsets of the pattern space that
do not satisfy prespecified interestingness constraints. Examples of such a constraint-
based mining process are described in Chapter 7 (with respect to pattern discovery) and
Chapter 11 (with respect to clustering).