4 . 6
L I N E A R M O D E L S
1 1 9
through the dataset for each different size of item set. Sometimes the dataset is
too large to read in to main memory and must be kept on disk; then it may be
worth reducing the number of passes by checking item sets of two consecutive
sizes in one go. For example, once sets with two items have been generated, all
sets of three items could be generated from them before going through the
instance set to count the actual number of items in the sets. More three-item
sets than necessary would be considered, but the number of passes through the
entire dataset would be reduced.
In practice, the amount of computation needed to generate association rules
depends critically on the minimum coverage specified. The accuracy has less
influence because it does not affect the number of passes that we must make
through the dataset. In many situations we will want to obtain a certain num-
ber of rules—say 50—with the greatest possible coverage at a prespecified
minimum accuracy level. One way to do this is to begin by specifying the cov-
erage to be rather high and to then successively reduce it, reexecuting the entire
rule-finding algorithm for each coverage value and repeating this until the
desired number of rules has been generated.
The tabular input format that we use throughout this book, and in particu-
lar a standard ARFF file based on it, is very inefficient for many association-rule
problems. Association rules are often used when attributes are binary—either
present or absent—and most of the attribute values associated with a given
instance are absent. This is a case for the sparse data representation described
in Section 2.4; the same algorithm for finding association rules applies.
4.6 Linear models
The methods we have been looking at for decision trees and rules work most
naturally with nominal attributes. They can be extended to numeric attributes
either by incorporating numeric-value tests directly into the decision tree or rule
induction scheme, or by prediscretizing numeric attributes into nominal ones.
We will see how in Chapters 6 and 7, respectively. However, there are methods
that work most naturally with numeric attributes. We look at simple ones here,
ones that form components of more complex learning methods, which we will
examine later.
Numeric prediction: Linear regression
When the outcome, or class, is numeric, and all the attributes are numeric, linear
regression is a natural technique to consider. This is a staple method in statis-
tics. The idea is to express the class as a linear combination of the attributes,
with predetermined weights:
x
w
w a
w a
w a
k k
=
+
+
+
+
0
1 1
2 2
. . .
P088407-Ch004.qxd 4/30/05 11:13 AM Page 119
where
x is
the class;
a
1
, a
2
, . . ., a
k
are the attribute values; and w
0
, w
1
, . . ., w
k
are
weights.
The weights are calculated from the training data. Here the notation gets a
little heavy, because we need a way of expressing the attribute values for each
training instance. The first instance will have a class, say x
(1)
, and attribute values
a
1
(1)
,
a
2
(1)
, . . .,
a
k
(1)
, where the superscript denotes that it is the first example.
Moreover, it is notationally convenient to assume an extra attribute a
0
whose
value is always 1.
The predicted value for the first instance’s class can be written as
This is the predicted, not the actual, value for the first instance’s class. Of inter-
est is the difference between the predicted and the actual values. The method of
linear regression is to choose the coefficients w
j
—there are k
+ 1 of them—to
minimize the sum of the squares of these differences over all the training
instances. Suppose there are n training instances; denote the ith one with a
superscript (i). Then the sum of the squares of the differences is
where the expression inside the parentheses is the difference between the ith
instance’s actual class and its predicted class. This sum of squares is what we
have to minimize by choosing the coefficients appropriately.
This is all starting to look rather formidable. However, the minimization
technique is straightforward if you have the appropriate math background.
Suffice it to say that given enough examples—roughly speaking, more examples
than attributes—choosing weights to minimize the sum of the squared differ-
ences is really not difficult. It does involve a matrix inversion operation, but this
is readily available as prepackaged software.
Once the math has been accomplished, the result is a set of numeric weights,
based on the training data, which we can use to predict the class of new
instances. We saw an example of this when looking at the CPU performance
data, and the actual numeric weights are given in Figure 3.7(a). This formula
can be used to predict the CPU performance of new test instances.
Linear regression is an excellent, simple method for numeric prediction, and
it has been widely used in statistical applications for decades. Of course, linear
models suffer from the disadvantage of, well, linearity. If the data exhibits a non-
linear dependency, the best-fitting straight line will be found, where “best” is
interpreted as the least mean-squared difference. This line may not fit very well.
x
w a
i
j
j
i
j
k
i
n
( )
( )
=
=
-
Ê
Ë
Á
ˆ
¯
˜
Â
Â
0
1
2
w a
w a
w a
w a
w a
k k
j
j
j
k
0 0
1
1 1
1
2 2
1
1
1
0
( )
( )
( )
( )
( )
=
+
+
+
+
=
Â
. . .
.
1 2 0
C H A P T E R 4
|
A LG O R I T H M S : T H E BA S I C M E T H O D S
P088407-Ch004.qxd 4/30/05 11:13 AM Page 120