Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	43/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 39 40 41 42 43 44 45 46 ... 219

3.7 Trees for numeric prediction
3.8 Instance-based representation

7 6

C H A P T E R 3

O U T P U T: K N OW L E D G E R E P R E S E N TAT I O N

3.7 Trees for numeric prediction

The kind of decision trees and rules that we have been looking at are designed

for predicting categories rather than numeric quantities. When it comes to pre-

dicting numeric quantities, as with the CPU performance data in Table 1.5, the

same kind of tree or rule representation can be used, but the leaf nodes of the

tree, or the right-hand side of the rules, would contain a numeric value that is

the average of all the training set values to which the leaf, or rule, applies.

Because statisticians use the term regression for the process of computing an

expression that predicts a numeric quantity, decision trees with averaged

numeric values at the leaves are called regression trees.

Figure 3.7(a) shows a regression equation for the CPU performance data, and

Figure 3.7(b) shows a regression tree. The leaves of the tree are numbers that

represent the average outcome for instances that reach the leaf. The tree is much

larger and more complex than the regression equation, and if we calculate the

average of the absolute values of the errors between the predicted and the actual

CPU performance measures, it turns out to be signiﬁcantly less for the tree than

for the regression equation. The regression tree is more accurate because a

simple linear model poorly represents the data in this problem. However, the

tree is cumbersome and difﬁcult to interpret because of its large size.

It is possible to combine regression equations with regression trees. Figure

3.7(c) is a tree whose leaves contain linear expressions—that is, regression equa-

tions—rather than single predicted values. This is (slightly confusingly) called

a model tree. Figure 3.7(c) contains the six linear models that belong at the six

leaves, labeled LM1 through LM6. The model tree approximates continuous

functions by linear “patches,” a more sophisticated representation than either

linear regression or regression trees. Although the model tree is smaller and

more comprehensible than the regression tree, the average error values on the

training data are lower. (However, we will see in Chapter 5 that calculating the

average error on the training set is not in general a good way of assessing

the performance of models.)

3.8 Instance-based representation

The simplest form of learning is plain memorization, or rote learning. Once a

set of training instances has been memorized, on encountering a new instance

the memory is searched for the training instance that most strongly resembles

the new one. The only problem is how to interpret “resembles”: we will explain

that shortly. First, however, note that this is a completely different way of rep-

resenting the “knowledge” extracted from a set of instances: just store the

instances themselves and operate by relating new instances whose class is

P088407-Ch003.qxd 4/30/05 11:09 AM Page 76

64.6

(24/19.2%)

CHMIN

≤ 7.5

> 7.5

MMAX

≤ 8.5

(8.5,28]

MMAX

> 28

19.3

(28/8.7%)

≤ 2500

29.8

(37/8.18%)

(2500,

4250]

CACH

> 4250

MYCT

≤ 0.5

59.3

(24/16.9%)

(0.5,8.5]

≤ 550

37.3

(19/11.3%)

18.3

(7/3.83%)

> 550

75.7

(10/24.6%)

≤ 10000

133

(16/28.8%)

> 10000

157

(21/73.7%)

≤ 28000

> 28000

MMIN

≤ 58

783

(5/359%)

> 58

281

(11/56%)

≤12000

492

(7/53.9%)

> 12000

CACH

MMAX

CHMAX

(b)

CHMIN

≤ 7.5

> 7.5

≤ 8.5

LM4

(50/22.1%)

> 8.5

LM1

(65/7.32%)

≤ 4250

> 4250

≤ 0.5

LM2

(26/6.37%)

LM3

(24/14.5%)

(0.5,8.5]

LM5

(21/45.5%)

≤ 28000

LM6

(23/63.5%)

> 28000

MMAX

CACH

MMAX

CACH

(c)

LM1 PRP=8.29+0.004 MMAX+2.77 CHMIN

LM2 PRP=20.3+0.004 MMIN-3.99 CHMIN

+0.946 CHMAX

LM3 PRP=38.1+0.012 MMIN

LM4 PRP=19.5+0.002 MMAX+0.698 CACH

+0.969 CHMAX

LM5 PRP=285-1.46 MYCT+1.02 CACH

-9.39 CHMIN

LM6 PRP=-65.8+0.03 MMIN-2.94 CHMIN

+4.98 CHMAX

PRP =

-56.1

+0.049 MYCT

+0.015 MMIN

+0.006 MMAX

+0.630 CACH

-0.270 CHMIN

+1.46 CHMAX

(a)

Figure 3.7 Models for the CPU performance data: (a) linear regression, (b) regression

tree, and (c) model tree.

P088407-Ch003.qxd 4/30/05 11:09 AM Page 77

unknown to existing ones whose class is known. Instead of trying to create rules,

work directly from the examples themselves. This is known as instance-based

learning. In a sense all the other learning methods are “instance-based,” too,

because we always start with a set of instances as the initial training informa-

tion. But the instance-based knowledge representation uses the instances them-

selves to represent what is learned, rather than inferring a rule set or decision

tree and storing it instead.

In instance-based learning, all the real work is done when the time comes to

classify a new instance rather than when the training set is processed. In a sense,

then, the difference between this method and the others we have seen is the time

at which the “learning” takes place. Instance-based learning is lazy, deferring the

real work as long as possible, whereas other methods are eager, producing a gen-

eralization as soon as the data has been seen. In instance-based learning, each

new instance is compared with existing ones using a distance metric, and the

closest existing instance is used to assign the class to the new one. This is called

the nearest-neighbor classiﬁcation method. Sometimes more than one nearest

neighbor is used, and the majority class of the closest k neighbors (or the dis-

tance-weighted average, if the class is numeric) is assigned to the new instance.

This is termed the k-nearest-neighbor method.

Computing the distance between two examples is trivial when examples have

just one numeric attribute: it is just the difference between the two attribute

values. It is almost as straightforward when there are several numeric attributes:

generally, the standard Euclidean distance is used. However, this assumes that

the attributes are normalized and are of equal importance, and one of the main

problems in learning is to determine which are the important features.

When nominal attributes are present, it is necessary to come up with a “dis-

tance” between different values of that attribute. What are the distances between,

say, the values red, green, and blue? Usually a distance of zero is assigned if the

values are identical; otherwise, the distance is one. Thus the distance between

red and red is zero but that between red and green is one. However, it may be

desirable to use a more sophisticated representation of the attributes. For

example, with more colors one could use a numeric measure of hue in color

space, making yellow closer to orange than it is to green and ocher closer still.

Some attributes will be more important than others, and this is usually

reﬂected in the distance metric by some kind of attribute weighting. Deriving

suitable attribute weights from the training set is a key problem in instance-

based learning.

It may not be necessary, or desirable, to store all the training instances. For

one thing, this may make the nearest-neighbor calculation unbearably slow. For

another, it may consume unrealistic amounts of storage. Generally, some regions

of attribute space are more stable than others with regard to class, and just a

7 8

C H A P T E R 3

O U T P U T: K N OW L E D G E R E P R E S E N TAT I O N

P088407-Ch003.qxd 4/30/05 11:09 AM Page 78

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 39 40 41 42 43 44 45 46 ... 219