Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	37/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 33 34 35 36 37 38 39 40 ... 219

buyer card when the customer does not supply one, either so that the customer

can get a discount that would otherwise be unavailable or simply to accumulate

credit points in the cashier’s account. Only a deep semantic knowledge of what is

going on will be able to explain systematic data errors such as these.

Finally, data goes stale. Many items change as circumstances change. For

example, items in mailing lists—names, addresses, telephone numbers, and so

on—change frequently. You need to consider whether the data you are mining

is still current.

Getting to know your data

There is no substitute for getting to know your data. Simple tools that show his-

tograms of the distribution of values of nominal attributes, and graphs of the

values of numeric attributes (perhaps sorted or simply graphed against instance

number), are very helpful. These graphical visualizations of the data make it

easy to identify outliers, which may well represent errors in the data ﬁle—or

arcane conventions for coding unusual situations, such as a missing year as 9999

or a missing weight as

-1 kg, that no one has thought to tell you about. Domain

experts need to be consulted to explain anomalies, missing values, the signiﬁ-

cance of integers that represent categories rather than numeric quantities, and

so on. Pairwise plots of one attribute against another, or each attribute against

the class value, can be extremely revealing.

Data cleaning is a time-consuming and labor-intensive procedure but one

that is absolutely necessary for successful data mining. With a large dataset,

people often give up—how can they possibly check it all? Instead, you should

sample a few instances and examine them carefully. You’ll be surprised at what

you ﬁnd. Time looking at your data is always well spent.

2.5 Further reading

Pyle (1999) provides an extensive guide to data preparation for data mining.

There is also a great deal of current interest in data warehousing and the prob-

lems it entails. Kimball (1996) offers the best introduction to these that we know

of. Cabena et al. (1998) estimate that data preparation accounts for 60% of the

effort involved in a data mining application, and they write at some length about

the problems involved.

The area of inductive logic programming, which deals with ﬁnite and inﬁ-

nite relations, is covered by Bergadano and Gunetti (1996). The different “levels

of measurement” for attributes were introduced by Stevens (1946) and are well

described in the manuals for statistical packages such as SPSS (Nie et al. 1970).

6 0

C H A P T E R 2

I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S

P088407-Ch002.qxd 4/30/05 11:10 AM Page 60

Most of the techniques in this book produce easily comprehensible descriptions

of the structural patterns in the data. Before looking at how these techniques

work, we have to see how structural patterns can be expressed. There are many

different ways for representing the patterns that can be discovered by machine

learning, and each one dictates the kind of technique that can be used to infer

that output structure from data. Once you understand how the output is

represented, you have come a long way toward understanding how it can be

generated.

We saw many examples of data mining in Chapter 1. In these cases the output

took the form of decision trees and classiﬁcation rules, which are basic knowl-

edge representation styles that many machine learning methods use. Knowledge

is really too imposing a word for a decision tree or a collection of rules, and by

using it we don’t really mean to imply that these structures vie with the real kind

of knowledge that we carry in our heads: it’s just that we need some word to

refer to the structures that learning methods produce. There are more complex

varieties of rules that allow exceptions to be speciﬁed, and ones that can express

c h a p t e r

Output:

Knowledge Representation

6 1

P088407-Ch003.qxd 4/30/05 11:09 AM Page 61

relations among the values of the attributes of different instances. Special forms

of trees can be used for numeric prediction, too. Instance-based representations

focus on the instances themselves rather than rules that govern their attribute

values. Finally, some learning methods generate clusters of instances. These dif-

ferent knowledge representation methods parallel the different kinds of learn-

ing problems introduced in Chapter 2.

3.1 Decision tables

The simplest, most rudimentary way of representing the output from machine

learning is to make it just the same as the input—a decision table. For example,

Table 1.2 is a decision table for the weather data: you just look up the appro-

priate conditions to decide whether or not to play. Less trivially, creating a deci-

sion table might involve selecting some of the attributes. If temperature is

irrelevant to the decision, for example, a smaller, condensed table with that

attribute missing would be a better guide. The problem is, of course, to decide

which attributes to leave out without affecting the ﬁnal decision.

3.2 Decision trees

A “divide-and-conquer” approach to the problem of learning from a set of inde-

pendent instances leads naturally to a style of representation called a decision

tree. We have seen some examples of decision trees, for the contact lens (Figure

1.2) and labor negotiations (Figure 1.3) datasets. Nodes in a decision tree involve

testing a particular attribute. Usually, the test at a node compares an attribute

value with a constant. However, some trees compare two attributes with each

other, or use some function of one or more attributes. Leaf nodes give a classi-

ﬁcation that applies to all instances that reach the leaf, or a set of classiﬁcations,

or a probability distribution over all possible classiﬁcations. To classify an

unknown instance, it is routed down the tree according to the values of the

attributes tested in successive nodes, and when a leaf is reached the instance is

classiﬁed according to the class assigned to the leaf.

If the attribute that is tested at a node is a nominal one, the number of chil-

dren is usually the number of possible values of the attribute. In this case,

because there is one branch for each possible value, the same attribute will not

be retested further down the tree. Sometimes the attribute values are divided

into two subsets, and the tree branches just two ways depending on which subset

the value lies in the tree; in that case, the attribute might be tested more than

once in a path.

If the attribute is numeric, the test at a node usually determines whether its

value is greater or less than a predetermined constant, giving a two-way split.

6 2

C H A P T E R 3

O U T P U T: K N OW L E D G E R E P R E S E N TAT I O N

P088407-Ch003.qxd 4/30/05 11:09 AM Page 62

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 33 34 35 36 37 38 39 40 ... 219