buyer card when the
customer does not supply one, either so that the customer
can get a discount that would otherwise be unavailable or simply to accumulate
credit points in the cashier’s account. Only a deep semantic knowledge of what is
going on will be able to explain systematic data errors such as these.
Finally, data goes stale. Many items change as circumstances change. For
example, items in mailing lists—names, addresses, telephone numbers, and so
on—change frequently. You need to consider whether the data you are mining
is still current.
Getting to know your data
There is no substitute for getting to know your data. Simple tools that show his-
tograms of the distribution of values of nominal attributes, and graphs of the
values of numeric attributes (perhaps sorted or simply graphed against instance
number), are very helpful. These graphical visualizations of the data make it
easy to identify outliers, which may well represent errors in the data file—or
arcane conventions for coding unusual situations, such as a missing year as 9999
or a missing weight as
-1 kg, that no one has thought to tell you about. Domain
experts need to be consulted to explain anomalies, missing values, the signifi-
cance of integers that represent categories rather than numeric quantities, and
so on. Pairwise plots of one attribute against another, or each attribute against
the class value, can be extremely revealing.
Data cleaning is a time-consuming and labor-intensive procedure but one
that is absolutely necessary for successful data mining. With a large dataset,
people often give up—how can they possibly check it all? Instead, you should
sample a few instances and examine them carefully. You’ll be surprised at what
you find. Time looking at your data is always well spent.
2.5 Further reading
Pyle (1999) provides an extensive guide to data preparation for data mining.
There is also a great deal of current interest in data warehousing and the prob-
lems it entails. Kimball (1996) offers the best introduction to these that we know
of. Cabena et al. (1998) estimate that data preparation accounts for 60% of the
effort involved in a data mining application, and they write at some length about
the problems involved.
The area of inductive logic programming, which deals with finite and infi-
nite relations, is covered by Bergadano and Gunetti (1996). The different “levels
of measurement” for attributes were introduced by Stevens (1946) and are well
described in the manuals for statistical packages such as SPSS (Nie et al. 1970).
6 0
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
P088407-Ch002.qxd 4/30/05 11:10 AM Page 60
Most of the techniques in this book produce easily comprehensible descriptions
of the structural patterns in the data. Before looking at how these techniques
work, we have to see how structural patterns can be expressed. There are many
different ways for representing the patterns that can be discovered by machine
learning, and each one dictates the kind of technique that can be used to infer
that output structure from data. Once you understand how the output is
represented, you have come a long way toward understanding how it can be
generated.
We saw many examples of data mining in Chapter 1. In these cases the output
took the form of decision trees and classification rules, which are basic knowl-
edge representation styles that many machine learning methods use. Knowledge
is really too imposing a word for a decision tree or a collection of rules, and by
using it we don’t really mean to imply that these structures vie with the real kind
of knowledge that we carry in our heads: it’s just that we need some word to
refer to the structures that learning methods produce. There are more complex
varieties of rules that allow exceptions to be specified, and ones that can express
c h a p t e r
3
Output:
Knowledge Representation
6 1
P088407-Ch003.qxd 4/30/05 11:09 AM Page 61
relations among the values of the attributes of different instances. Special forms
of trees can be used for numeric prediction, too. Instance-based representations
focus on the instances themselves rather than rules that govern their attribute
values. Finally, some learning methods generate clusters of instances. These dif-
ferent knowledge representation methods parallel the different kinds of learn-
ing problems introduced in Chapter 2.
3.1 Decision tables
The simplest, most rudimentary way of representing the output from machine
learning is to make it just the same as the input—a decision table. For example,
Table 1.2 is a decision table for the weather data: you just look up the appro-
priate conditions to decide whether or not to play. Less trivially, creating a deci-
sion table might involve selecting some of the attributes. If temperature is
irrelevant to the decision, for example, a smaller, condensed table with that
attribute missing would be a better guide. The problem is, of course, to decide
which attributes to leave out without affecting the final decision.
3.2 Decision trees
A “divide-and-conquer” approach to the problem of learning from a set of inde-
pendent instances leads naturally to a style of representation called a decision
tree. We have seen some examples of decision trees, for the contact lens (Figure
1.2) and labor negotiations (Figure 1.3) datasets. Nodes in a decision tree involve
testing a particular attribute. Usually, the test at a node compares an attribute
value with a constant. However, some trees compare two attributes with each
other, or use some function of one or more attributes. Leaf nodes give a classi-
fication that applies to all instances that reach the leaf, or a set of classifications,
or a probability distribution over all possible classifications. To classify an
unknown instance, it is routed down the tree according to the values of the
attributes tested in successive nodes, and when a leaf is reached the instance is
classified according to the class assigned to the leaf.
If the attribute that is tested at a node is a nominal one, the number of chil-
dren is usually the number of possible values of the attribute. In this case,
because there is one branch for each possible value, the same attribute will not
be retested further down the tree. Sometimes the attribute values are divided
into two subsets, and the tree branches just two ways depending on which subset
the value lies in the tree; in that case, the attribute might be tested more than
once in a path.
If the attribute is numeric, the test at a node usually determines whether its
value is greater or less than a predetermined constant, giving a two-way split.
6 2
C H A P T E R 3
|
O U T P U T: K N OW L E D G E R E P R E S E N TAT I O N
P088407-Ch003.qxd 4/30/05 11:09 AM Page 62