Data Mining
for the Masses
130
Figure 8-2. Value ranges for the training data set’s attributes.
Figure 8-3. Value ranges for the scoring data set’s attributes.
3)
We can see that in comparing Figures 8-2 and 8-3, the ranges are the same for all attributes
except Avg_Age. In the scoring data set, we have some observations where the Avg_Age
is slightly below the training data set’s lower bound of 15.1, and some observations where
the scoring Avg_Age is slightly above the training set’s upper bound of 72.2. You might
think that these values are so close to the training data set’s values that it would not matter
if we used our training data set to predict heating oil usage for the homes represented by
these observations. While it is likely that such a slight deviation from the range on this
attribute would not yield wildly inaccurate results, we cannot use linear regression
prediction values as evidence to support such an assumption. Thus, we will need to
remove these observations from our data set. Add two Filter Examples operators with the
parameters attribute_value_filter and Avg_Age>=15.1 | Avg_Age <=72.2. When you run
your model now, you should have 42,042 observations remaining. Check the ranges again
to ensure that none of the scoring attributes now have ranges outside those of the training
attributes. Then return to design perspective.
4)
As was the case with discriminant analysis, linear regression is a predictive model, and thus
will need an attribute to be designated as the label—this is the target, the thing we want to
predict. Search for the Set Role operator in the Operators tab and drag it into your training
Chapter 8:
Linear Regression
131
stream. Change the parameters to designate Heating_Oil as the label for this model
(Figure 8-4).
Figure 8-4. Adding an operator to designate Heating_Oil as our label.
With this step complete our data sets are now prepared for…
MODELING
5)
Using the search field in the Operators tab again, locate the Linear
Regression operator and
drag and drop it into your training data set’s stream (Figure 8-5).
Figure 8-5. Adding the Linear Regression model operator to our stream.
Data Mining for the Masses
132
6)
Note that the Linear Regression operator uses a default tolerance of .05 (also known in
statistical language as the
confidence level or
alpha level). This value of .05 is very
common in statistical analysis of this type, so we will accept this default. The final step to
complete our model is to use an Apply Model operator to connect our training stream to
our scoring stream. Be sure to connect both the
lab and
mod ports coming from the Apply
Model operator to
res ports. This is illustrated in Figure 8-6.
Figure 8-6. Applying the model to the scoring data set.
7)
Run the model. Having two splines coming from the Apply Model operator and
connecting to
res ports will result in two tabs in results perspective. Let’s examine the
LinearRegression
tab first, as we begin our…
EVALUATION
Figure 8-7. Linear regression coefficients.
Chapter 8: Linear Regression
133
Linear regression modeling is all about determing how close a given observation is to an imaginary
line representing
the average, or center of all points in the data set. That imaginary line gives us the
first part of the term “linear regression”. The formula for calculating a prediction using linear
regression is
y=mx+b. You may recognize this from a former algebra class as the formula for
calculating the slope of a line. In this formula, the variable
y, is the target, the label, the thing we
want to predict. So in this chapter’s example,
y is the amount of Heating_Oil we expect each
home to consume. But how will we predict
y? We need to know what
m,
x, and
b are. The
variable
m is the value for a given predictor attribute, or what is sometimes referred to as an
independent variable. Insulation, for example, is a predictor of heating oil usage, so Insulation is
a predictor attribute. The variable
x is that attribute’s coefficient, shown in the second column of
Figure 8-7. The coefficient is the amount of weight the attribute is given in the formula.
Insulation, with a coefficient of 3.323, is weighted heavier than any of the
other predictor attributes
in this data set. Each observation will have its Insulation value multipled by the Insulation
coefficient to properly weight that attribute when calculating
y (heating oil usage). The variable
b is
a constant that is added to all linear regression calculations. It is represented by the Intercept,
shown in figure 8-7 as 134.511. So suppose we had a house with insulation density of 5; our
formula using these Insulation values would be
y=(5*3.323)+134.511.
But wait! We had more than one predictor attribute. We started out using a combination of five
attributes to try to predict heating oil usage. The formula described in the previous paragraph only
uses one. Furthermore, our LinearRegression result set tab pictured in Figure 8-7 only has four
predictor variables. What happened to Num_Occupants?
The answer to the latter question is that Num_Occupants was not a
statistically significant
predictor of heating oil usage in this data set, and therefore, RapidMiner removed it as a predictor.
In other words, when RapidMiner evaluated the amount of influence each attribute in the data set
had on heating oil usage for each home represented in the training data set, the number of
occupants was so non-influential that its weight in the formula was set to zero. An example of
why this might occur could be that two older people living in a house may use the same amount of
heating oil as a young family of five in the house. The older couple might
take longer showers, and
prefer to keep their house much warmer in the winter time than would the young family. The
variability in the number of occupants in the house doesn’t help to explain each home’s heating oil
usage very well, and so it was removed as a predictor in our model.