7 6
C H A P T E R 3
|
O U T P U T: K N OW L E D G E R E P R E S E N TAT I O N
3.7 Trees for numeric prediction
The kind of decision trees and rules that we have been looking at are designed
for predicting categories rather than numeric quantities. When it comes to pre-
dicting numeric quantities, as with the CPU performance data in Table 1.5, the
same kind of tree or rule representation can be used, but the leaf nodes of the
tree, or the right-hand side of the rules, would contain a numeric value that is
the average of all the training set values to which the leaf, or rule, applies.
Because statisticians use the term regression for the process of computing an
expression that predicts a numeric quantity, decision trees with averaged
numeric values at the leaves are called regression trees.
Figure 3.7(a) shows a regression equation for the CPU performance data, and
Figure 3.7(b) shows a regression tree. The leaves of the tree are numbers that
represent the average outcome for instances that reach the leaf. The tree is much
larger and more complex than the regression equation, and if we calculate the
average of the absolute values of the errors between the predicted and the actual
CPU performance measures, it turns out to be significantly less for the tree than
for the regression equation. The regression tree is more accurate because a
simple linear model poorly represents the data in this problem. However, the
tree is cumbersome and difficult to interpret because of its large size.
It is possible to combine regression equations with regression trees. Figure
3.7(c) is a tree whose leaves contain linear expressions—that is, regression equa-
tions—rather than single predicted values. This is (slightly confusingly) called
a model tree. Figure 3.7(c) contains the six linear models that belong at the six
leaves, labeled LM1 through LM6. The model tree approximates continuous
functions by linear “patches,” a more sophisticated representation than either
linear regression or regression trees. Although the model tree is smaller and
more comprehensible than the regression tree, the average error values on the
training data are lower. (However, we will see in Chapter 5 that calculating the
average error on the training set is not in general a good way of assessing
the performance of models.)
3.8 Instance-based representation
The simplest form of learning is plain memorization, or rote learning. Once a
set of training instances has been memorized, on encountering a new instance
the memory is searched for the training instance that most strongly resembles
the new one. The only problem is how to interpret “resembles”: we will explain
that shortly. First, however, note that this is a completely different way of rep-
resenting the “knowledge” extracted from a set of instances: just store the
instances themselves and operate by relating new instances whose class is
P088407-Ch003.qxd 4/30/05 11:09 AM Page 76
64.6
(24/19.2%)
CHMIN
≤ 7.5
> 7.5
MMAX
≤ 8.5
(8.5,28]
MMAX
> 28
19.3
(28/8.7%)
≤ 2500
29.8
(37/8.18%)
(2500,
4250]
CACH
> 4250
MYCT
≤ 0.5
59.3
(24/16.9%)
(0.5,8.5]
≤ 550
37.3
(19/11.3%)
18.3
(7/3.83%)
> 550
75.7
(10/24.6%)
≤ 10000
133
(16/28.8%)
> 10000
157
(21/73.7%)
≤ 28000
> 28000
MMIN
≤ 58
783
(5/359%)
> 58
281
(11/56%)
≤12000
492
(7/53.9%)
> 12000
CACH
MMAX
CHMAX
(b)
CHMIN
≤ 7.5
> 7.5
≤ 8.5
LM4
(50/22.1%)
> 8.5
LM1
(65/7.32%)
≤ 4250
> 4250
≤ 0.5
LM2
(26/6.37%)
LM3
(24/14.5%)
(0.5,8.5]
LM5
(21/45.5%)
≤ 28000
LM6
(23/63.5%)
> 28000
MMAX
CACH
MMAX
CACH
(c)
LM1 PRP=8.29+0.004 MMAX+2.77 CHMIN
LM2 PRP=20.3+0.004 MMIN-3.99 CHMIN
+0.946 CHMAX
LM3 PRP=38.1+0.012 MMIN
LM4 PRP=19.5+0.002 MMAX+0.698 CACH
+0.969 CHMAX
LM5 PRP=285-1.46 MYCT+1.02 CACH
-9.39 CHMIN
LM6 PRP=-65.8+0.03 MMIN-2.94 CHMIN
+4.98 CHMAX
PRP =
-56.1
+0.049 MYCT
+0.015 MMIN
+0.006 MMAX
+0.630 CACH
-0.270 CHMIN
+1.46 CHMAX
(a)
Figure 3.7 Models for the CPU performance data: (a) linear regression, (b) regression
tree, and (c) model tree.
P088407-Ch003.qxd 4/30/05 11:09 AM Page 77
unknown to existing ones whose class is known. Instead of trying to create rules,
work directly from the examples themselves. This is known as instance-based
learning. In a sense all the other learning methods are “instance-based,” too,
because we always start with a set of instances as the initial training informa-
tion. But the instance-based knowledge representation uses the instances them-
selves to represent what is learned, rather than inferring a rule set or decision
tree and storing it instead.
In instance-based learning, all the real work is done when the time comes to
classify a new instance rather than when the training set is processed. In a sense,
then, the difference between this method and the others we have seen is the time
at which the “learning” takes place. Instance-based learning is lazy, deferring the
real work as long as possible, whereas other methods are eager, producing a gen-
eralization as soon as the data has been seen. In instance-based learning, each
new instance is compared with existing ones using a distance metric, and the
closest existing instance is used to assign the class to the new one. This is called
the nearest-neighbor classification method. Sometimes more than one nearest
neighbor is used, and the majority class of the closest k neighbors (or the dis-
tance-weighted average, if the class is numeric) is assigned to the new instance.
This is termed the k-nearest-neighbor method.
Computing the distance between two examples is trivial when examples have
just one numeric attribute: it is just the difference between the two attribute
values. It is almost as straightforward when there are several numeric attributes:
generally, the standard Euclidean distance is used. However, this assumes that
the attributes are normalized and are of equal importance, and one of the main
problems in learning is to determine which are the important features.
When nominal attributes are present, it is necessary to come up with a “dis-
tance” between different values of that attribute. What are the distances between,
say, the values red, green, and blue? Usually a distance of zero is assigned if the
values are identical; otherwise, the distance is one. Thus the distance between
red and red is zero but that between red and green is one. However, it may be
desirable to use a more sophisticated representation of the attributes. For
example, with more colors one could use a numeric measure of hue in color
space, making yellow closer to orange than it is to green and ocher closer still.
Some attributes will be more important than others, and this is usually
reflected in the distance metric by some kind of attribute weighting. Deriving
suitable attribute weights from the training set is a key problem in instance-
based learning.
It may not be necessary, or desirable, to store all the training instances. For
one thing, this may make the nearest-neighbor calculation unbearably slow. For
another, it may consume unrealistic amounts of storage. Generally, some regions
of attribute space are more stable than others with regard to class, and just a
7 8
C H A P T E R 3
|
O U T P U T: K N OW L E D G E R E P R E S E N TAT I O N
P088407-Ch003.qxd 4/30/05 11:09 AM Page 78