Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	44/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 40 41 42 43 44 45 46 47 ... 219

few exemplars are needed inside stable regions. For example, you might expect

the required density of exemplars that lie well inside class boundaries to be

much less than the density that is needed near class boundaries. Deciding which

instances to save and which to discard is another key problem in instance-based

learning.

An apparent drawback to instance-based representations is that they do not

make explicit the structures that are learned. In a sense this violates the notion

of “learning” that we presented at the beginning of this book; instances do not

really “describe” the patterns in data. However, the instances combine with the

distance metric to carve out boundaries in instance space that distinguish one

class from another, and this is a kind of explicit representation of knowledge.

For example, given a single instance of each of two classes, the nearest-neigh-

bor rule effectively splits the instance space along the perpendicular bisector of

the line joining the instances. Given several instances of each class, the space is

divided by a set of lines that represent the perpendicular bisectors of selected

lines joining an instance of one class to one of another class. Figure 3.8(a) illus-

trates a nine-sided polygon that separates the ﬁlled-circle class from the open-

circle class. This polygon is implicit in the operation of the nearest-neighbor

rule.

When training instances are discarded, the result is to save just a few proto-

typical examples of each class. Figure 3.8(b) shows as dark circles only the

examples that actually get used in nearest-neighbor decisions: the others (the

light gray ones) can be discarded without affecting the result. These prototypi-

cal examples serve as a kind of explicit knowledge representation.

Some instance-based representations go further and explicitly generalize the

instances. Typically, this is accomplished by creating rectangular regions that

enclose examples of the same class. Figure 3.8(c) shows the rectangular regions

that might be produced. Unknown examples that fall within one of the rectan-

gles will be assigned the corresponding class; ones that fall outside all rectan-

gles will be subject to the usual nearest-neighbor rule. Of course this produces

3 . 8

I N S TA N C E - BA S E D R E P R E S E N TAT I O N

7 9

(a)

(b)

(c)

(d)

Figure 3.8 Different ways of partitioning the instance space.

P088407-Ch003.qxd 4/30/05 11:09 AM Page 79

different decision boundaries from the straightforward nearest-neighbor rule,

as can be seen by superimposing the polygon in Figure 3.8(a) onto the rectan-

gles. Any part of the polygon that lies within a rectangle will be chopped off and

replaced by the rectangle’s boundary.

Rectangular generalizations in instance space are just like rules with a special

form of condition, one that tests a numeric variable against an upper and lower

bound and selects the region in between. Different dimensions of the rectangle

correspond to tests on different attributes being ANDed together. Choosing

snugly ﬁtting rectangular regions as tests leads to much more conservative rules

than those generally produced by rule-based machine learning methods,

because for each boundary of the region, there is an actual instance that lies on

(or just inside) that boundary. Tests such as x

< a (where x is an attribute value

and a is a constant) encompass an entire half-space—they apply no matter how

small x is as long as it is less than a. When doing rectangular generalization in

instance space you can afford to be conservative because if a new example is

encountered that lies outside all regions, you can fall back on the nearest-neigh-

bor metric. With rule-based methods the example cannot be classiﬁed, or

receives just a default classiﬁcation, if no rules apply to it. The advantage of more

conservative rules is that, although incomplete, they may be more perspicuous

than a complete set of rules that covers all cases. Finally, ensuring that the

regions do not overlap is tantamount to ensuring that at most one rule can apply

to an example, eliminating another of the difﬁculties of rule-based systems—

what to do when several rules apply.

A more complex kind of generalization is to permit rectangular regions to

nest one within another. Then a region that is basically all one class can contain

an inner region of a different class, as illustrated in Figure 3.8(d). It is possible

to allow nesting within nesting so that the inner region can itself contain its own

inner region of a different class—perhaps the original class of the outer region.

This is analogous to allowing rules to have exceptions and exceptions to the

exceptions, as in Section 3.5.

It is worth pointing out a slight danger to the technique of visualizing

instance-based learning in terms of boundaries in example space: it makes the

implicit assumption that attributes are numeric rather than nominal. If the

various values that a nominal attribute can take on were laid out along a

line, generalizations involving a segment of that line would make no sense: each

test involves either one value for the attribute or all values for it (or perhaps an

arbitrary subset of values). Although you can more or less easily imagine extend-

ing the examples in Figure 3.8 to several dimensions, it is much harder to

imagine how rules involving nominal attributes will look in multidimensional

instance space. Many machine learning situations involve numerous attributes,

and our intuitions tend to lead us astray when extended to high-dimensional

spaces.

8 0

C H A P T E R 3

O U T P U T: K N OW L E D G E R E P R E S E N TAT I O N

P088407-Ch003.qxd 4/30/05 11:09 AM Page 80

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 40 41 42 43 44 45 46 47 ... 219