Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	38/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 34 35 36 37 38 39 40 41 ... 65

CHAPTER SUMMARY
REVIEW QUESTIONS

Data Mining for the Masses
134

But what about the former question, the one about having multiple independent variables in this
model?  How can we set up our linear formula when we have multiple predictors?  This is done by
using  the  formula:  y=mx+mx+mx…+b.    Let’s  take  an  example.    Suppose  we  wanted  to  predict
heating oil usage, using our model, for a home with the following attributes:


Insulation: 6


Temperature: 67


Avg_Age: 35.4


Home_Size: 5

Our formula for this home would be: y=(6*3.323)+(67*-0.869)+(35.4*1.968)+(5*3.173)+134.511

Our  prediction  for  this  home’s  annual  number  of  heating  oil  units  ordered  (y)  is  181.758,  or
basically 182 units. Let’s check our model’s predictions as we discuss possibilities for…

DEPLOYMENT

While  still  in  results  perspective,  switch  to  the  ExampleSet  tab,  and  select  the  Data  View  radio
button.  We can see in this view (Figure 8-8) that RapidMiner has quickly and efficiently predicted
the number of units of heating oil each of Sarah’s company’s new customers will likely use in their
first year. This is seen in the prediction(Heating_Oil) attribute.

Figure 8-8. Heating oil predictions for 42,042 new clients.

Chapter 8: Linear Regression
135

Let’s check the first of our 42,042 households by running the linear regression formula for row 1:

(5*3.323)+(69*-0.869)+(70.1*1.968)+(7*3.173)+134.511 = 251.321

Note that in this formula we  skipped the Num_Occupants attribute because it is not predictive.
The formula’s result does indeed match RapidMiner’s prediction for this home.  Sarah now has a
prediction  for  each  of  the  new  clients’  homes,  with  the  exception  of  those  that  had  Avg_Age
values that were out of range.  How might Sarah use this data?  She could start by summing the
prediction attribute.  This will tell her the total new units of heating oil  her company is going to
need to be able to provide in the coming year.  This can be accomplished by exporting her data to
a  spreasheet  and  summing  the  column,  or  it  can  even  be  done  within  RapidMiner  using  an
Aggregate operator. We will demonstrate this briefly.

1)

Switch back to design perspective.

2)

Search for the Aggreate operator in the Operators tab and add it between the  lab and res
ports, as shown in Figure 8-9.  It is not depicted in Figure 8-9, but if you wish to generate a
tab in results perspective that shows all of your obsevations and their predictions, you can
connect the ori port on the Aggregate operator to a res port.

Figure 8-9. Adding an Aggregate operator to our linear regression model.

Data Mining for the Masses
136
3)

Click  on  the  Edit  List  button.    A  window  similar  to  Figure  8-10  will  appear.    Set  the
prediction(Heating_Oil) attribute as the aggregation attribute, and the aggregation function
to ‘sum’.  If you would like you can add other aggretations.  In the Figure 8-10 example, we
have added an average for prediction(Heating_Oil) as well.

Figure 8-10. Configuring aggregations in RapidMiner.

4)

When you are satisfied with your aggregations, click OK to return to your  main process
window, then run the model.  In results perspective, select the ExampleSet(Aggregate) tab,
then select the Data View radio button.  The sum and average for the prediction attribute
will be shown, as depicted in Figure 8-11.

Figure 8-11. Aggregate descriptive statistics for our predicted attribute.

From this image, we can see that Sarah’s company is likely to sell some 8,368,088 units of heating
oil to these new customers.  The company can expect that on average, their new customers will
order about 200 units each.  These figures are for all 42,042 clients together, but Sarah is probably
going to be more interested in regional trends.  In order to deploy this model to help her more
specifically address her new customers’ needs, she should probably extract the predictions, match

Chapter 8: Linear Regression
137
them back to their source records which might contain the new clients’ addresses, enabling her to
break the predictions down by city, county, or region of the country.  Sarah could then work with
her colleagues in Operations and Order Fulfillment to ensure that regional heating oil distribution
centers around the country have appropriate amounts of stock on hand to meet anticipated need.
If  Sarah  wanted  to  get  even  more  granular  in  her  analysis  of  these  data,  she  could  break  her
training  and  scoring  datas  set  down  into  months  using  a  month  attribute,  and  then  run  the
predictions again to reveal fluctuations in usuage throughout the course of the year.

CHAPTER SUMMARY

Linear regression is a predictive model that uses training and scoring data sets to generate numeric
predictions in data.  It is important to remember that linear regression uses numeric data types for
all of its attributes.  It uses the algebraic formula for calculating the slope of a line to determine
where an observation would fall along an imaginary line through the scoring data.  Each attribute
in the data set is evaluated statistically for its ability to predict the target attribute.  Attributes that
are not strong predictors are removed from the model.  Those attributes that are good predictors
are  assigned  coefficients  which  give  them  weight  in  the  prediction  formula.    Any  observations
whose attribute values fall in the range of corresponding training attribute values can be plugged
into the formula in order to predict the target.

Once  linear  regression  predictions  are  calculated,  the  results  can  be  summarized  in  order  to
determine if there are differences in the predictions in subsets of the scoring data.  As more data
are collected, they can be added into the training data set in order to create a more robust training
data  set,  or  to  expand  the  ranges  of  some  attributes  to  include  even  more  values.    It  is  very
important to remember that the ranges for the scoring attributes must fall within the ranges for the
training attributes in order to ensure valid predictions.

REVIEW QUESTIONS

1)

What  data  type  does  linear  regression  expect  for  all  attributes?    What  data  type  will  the
predicted attribute be when it is calculated?

2)

Why are the attribute ranges so important when doing linear regression data mining?

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 34 35 36 37 38 39 40 41 ... 65