Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	45/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 41 42 43 44 45 46 47 48 ... 219

3.9 Clusters
3.10 Further reading

3.9 Clusters

When clusters rather than a classiﬁer is learned, the output takes the form of a

diagram that shows how the instances fall into clusters. In the simplest case this

involves associating a cluster number with each instance, which might be

depicted by laying the instances out in two dimensions and partitioning the

space to show each cluster, as illustrated in Figure 3.9(a).

Some clustering algorithms allow one instance to belong to more than one

cluster, so the diagram might lay the instances out in two dimensions and draw

overlapping subsets representing each cluster—a Venn diagram. Some algo-

rithms associate instances with clusters probabilistically rather than categori-

cally. In this case, for every instance there is a probability or degree of

membership with which it belongs to each of the clusters. This is shown in

Figure 3.9(c). This particular association is meant to be a probabilistic one, so

the numbers for each example sum to one—although that is not always the

case. Other algorithms produce a hierarchical structure of clusters so that at

the top level the instance space divides into just a few clusters, each of which

divides into its own subclusters at the next level down, and so on. In this case a

diagram such as the one in Figure 3.9(d) is used, in which elements joined

together at lower levels are more tightly clustered than ones joined together at

3 . 9

C LU S T E R S

8 1

(a)

(b)

0.1

0.8

0.3

0.1

0.2

0.4

0.2

0.4

0.5

0.1

0.4

0.8

0.4

0.5

0.1

0.4

0.1

0.3

0.1

0.4

0.1

0.7

0.5

(c)

g a c

i e d k b j f h

(d)

Figure 3.9 Different ways of representing clusters.

P088407-Ch003.qxd 4/30/05 11:09 AM Page 81

higher levels. Diagrams such as this are called dendrograms. This term means

just the same thing as tree diagrams (the Greek word dendron means “a tree”),

but in clustering the more exotic version seems to be preferred—perhaps

because biologic species are a prime application area for clustering techniques,

and ancient languages are often used for naming in biology.

Clustering is often followed by a stage in which a decision tree or rule set is

inferred that allocates each instance to the cluster in which it belongs. Then, the

clustering operation is just one step on the way to a structural description.

3.10 Further reading

Knowledge representation is a key topic in classical artiﬁcial intelligence and is

well represented by a comprehensive series of papers edited by Brachman and

Levesque (1985). However, these are about ways of representing handcrafted,

not learned knowledge, and the kind of representations that can be learned from

examples are quite rudimentary in comparison. In particular, the shortcomings

of propositional rules, which in logic are referred to as the propositional calcu-

lus, and the extra expressive power of relational rules, or the predicate calculus,

are well described in introductions to logic such as that in Chapter 2 of the book

by Genesereth and Nilsson (1987).

We mentioned the problem of dealing with conﬂict among different rules.

Various ways of doing this, called conﬂict resolution strategies, have been devel-

oped for use with rule-based programming systems. These are described in

books on rule-based programming, such as that by Brownstown et al. (1985).

Again, however, they are designed for use with handcrafted rule sets rather than

ones that have been learned. The use of hand-crafted rules with exceptions for

a large dataset has been studied by Gaines and Compton (1995), and Richards

and Compton (1998) describe their role as an alternative to classic knowledge

engineering.

Further information on the various styles of concept representation can be

found in the papers that describe machine learning methods of inferring con-

cepts from examples, and these are covered in the Further reading section of

Chapter 4 and the Discussion sections of Chapter 6.

8 2

C H A P T E R 3

O U T P U T: K N OW L E D G E R E P R E S E N TAT I O N

P088407-Ch003.qxd 4/30/05 11:09 AM Page 82

Now that we’ve seen how the inputs and outputs can be represented, it’s time

to look at the learning algorithms themselves. This chapter explains the basic

ideas behind the techniques that are used in practical data mining. We will not

delve too deeply into the trickier issues—advanced versions of the algorithms,

optimizations that are possible, complications that arise in practice. These topics

are deferred to Chapter 6, where we come to grips with real implementations

of machine learning methods such as the ones included in data mining toolkits

and used for real-world applications. It is important to understand these more

advanced issues so that you know what is really going on when you analyze a

particular dataset.

In this chapter we look at the basic ideas. One of the most instructive lessons

is that simple ideas often work very well, and we strongly recommend the adop-

tion of a “simplicity-ﬁrst” methodology when analyzing practical datasets. There

are many different kinds of simple structure that datasets can exhibit. In one

dataset, there might be a single attribute that does all the work and the others

may be irrelevant or redundant. In another dataset, the attributes might

c h a p t e r

Algorithms:

The Basic Methods

8 3

P088407-Ch004.qxd 4/30/05 11:13 AM Page 83

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 41 42 43 44 45 46 47 48 ... 219