Data Mining for the Masses
134
But what about the former question, the one about having multiple independent variables in this
model? How can we set up our linear formula when we have multiple predictors? This is done by
using the formula: y=mx+mx+mx…+b. Let’s take an example. Suppose we wanted to predict
heating oil usage, using our model, for a home with the following attributes:
Insulation: 6
Temperature: 67
Avg_Age: 35.4
Home_Size: 5
Our formula for this home would be: y=(6*3.323)+(67*-0.869)+(35.4*1.968)+(5*3.173)+134.511
Our prediction for this home’s annual number of heating oil units ordered ( y) is 181.758, or
basically 182 units. Let’s check our model’s predictions as we discuss possibilities for…
DEPLOYMENT
While still in results perspective, switch to the ExampleSet tab, and select the Data View radio
button. We can see in this view (Figure 8-8) that RapidMiner has quickly and efficiently predicted
the number of units of heating oil each of Sarah’s company’s new customers will likely use in their
first year. This is seen in the prediction(Heating_Oil) attribute.
Figure 8-8. Heating oil predictions for 42,042 new clients.
Chapter 8: Linear Regression
135
Let’s check the first of our 42,042 households by running the linear regression formula for row 1:
(5*3.323)+(69*-0.869)+(70.1*1.968)+(7*3.173)+134.511 = 251.321
Note that in this formula we skipped the Num_Occupants attribute because it is not predictive.
The formula’s result does indeed match RapidMiner’s prediction for this home. Sarah now has a
prediction for each of the new clients’ homes, with the exception of those that had Avg_Age
values that were out of range. How might Sarah use this data? She could start by summing the
prediction attribute. This will tell her the total new units of heating oil her company is going to
need to be able to provide in the coming year. This can be accomplished by exporting her data to
a spreasheet and summing the column, or it can even be done within RapidMiner using an
Aggregate operator. We will demonstrate this briefly.
1)
Switch back to design perspective.
2)
Search for the Aggreate operator in the Operators tab and add it between the lab and res
ports, as shown in Figure 8-9. It is not depicted in Figure 8-9, but if you wish to generate a
tab in results perspective that shows all of your obsevations and their predictions, you can
connect the ori port on the Aggregate operator to a res port.
Figure 8-9. Adding an Aggregate operator to our linear regression model.
Data Mining for the Masses
136
3)
Click on the Edit List button. A window similar to Figure 8-10 will appear. Set the
prediction(Heating_Oil) attribute as the aggregation attribute, and the aggregation function
to ‘sum’. If you would like you can add other aggretations. In the Figure 8-10 example, we
have added an average for prediction(Heating_Oil) as well.
Figure 8-10. Configuring aggregations in RapidMiner.
4)
When you are satisfied with your aggregations, click OK to return to your main process
window, then run the model. In results perspective, select the ExampleSet(Aggregate) tab,
then select the Data View radio button. The sum and average for the prediction attribute
will be shown, as depicted in Figure 8-11.
Figure 8-11. Aggregate descriptive statistics for our predicted attribute.
From this image, we can see that Sarah’s company is likely to sell some 8,368,088 units of heating
oil to these new customers. The company can expect that on average, their new customers will
order about 200 units each. These figures are for all 42,042 clients together, but Sarah is probably
going to be more interested in regional trends. In order to deploy this model to help her more
specifically address her new customers’ needs, she should probably extract the predictions, match
Chapter 8: Linear Regression
137
them back to their source records which might contain the new clients’ addresses, enabling her to
break the predictions down by city, county, or region of the country. Sarah could then work with
her colleagues in Operations and Order Fulfillment to ensure that regional heating oil distribution
centers around the country have appropriate amounts of stock on hand to meet anticipated need.
If Sarah wanted to get even more granular in her analysis of these data, she could break her
training and scoring datas set down into months using a month attribute, and then run the
predictions again to reveal fluctuations in usuage throughout the course of the year.
CHAPTER SUMMARY
Linear regression is a predictive model that uses training and scoring data sets to generate numeric
predictions in data. It is important to remember that linear regression uses numeric data types for
all of its attributes. It uses the algebraic formula for calculating the slope of a line to determine
where an observation would fall along an imaginary line through the scoring data. Each attribute
in the data set is evaluated statistically for its ability to predict the target attribute. Attributes that
are not strong predictors are removed from the model. Those attributes that are good predictors
are assigned coefficients which give them weight in the prediction formula. Any observations
whose attribute values fall in the range of corresponding training attribute values can be plugged
into the formula in order to predict the target.
Once linear regression predictions are calculated, the results can be summarized in order to
determine if there are differences in the predictions in subsets of the scoring data. As more data
are collected, they can be added into the training data set in order to create a more robust training
data set, or to expand the ranges of some attributes to include even more values. It is very
important to remember that the ranges for the scoring attributes must fall within the ranges for the
training attributes in order to ensure valid predictions.
REVIEW QUESTIONS
1)
What data type does linear regression expect for all attributes? What data type will the
predicted attribute be when it is calculated?
2)
Why are the attribute ranges so important when doing linear regression data mining?
Dostları ilə paylaş: |