Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	37/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 33 34 35 36 37 38 39 40 ... 65

Data Mining for the Masses
130

Figure 8-2. Value ranges for the training data set’s attributes.

Figure 8-3. Value ranges for the scoring data set’s attributes.

3)

We can see that in comparing Figures 8-2 and 8-3, the ranges are the same for all attributes
except Avg_Age.  In the scoring data set, we have some observations where the Avg_Age
is slightly below the training data set’s lower bound of 15.1, and some observations where
the scoring Avg_Age is slightly above the training set’s upper bound of 72.2.  You might
think that these values are so close to the training data set’s values that it would not matter
if we used our training data set to predict heating oil usage for the homes represented by
these  observations.    While  it  is  likely  that  such a  slight  deviation  from  the  range  on  this
attribute  would  not  yield  wildly  inaccurate  results,  we  cannot  use  linear  regression
prediction  values  as  evidence  to  support  such  an  assumption.    Thus,  we  will  need  to
remove these observations from our data set.  Add two Filter Examples operators with the
parameters attribute_value_filter and Avg_Age>=15.1 | Avg_Age <=72.2.  When you run
your model now, you should have 42,042 observations remaining.  Check the ranges again
to ensure that none of the scoring attributes now have ranges outside those of the training
attributes. Then return to design perspective.

4)

As was the case with discriminant analysis, linear regression is a predictive model, and thus
will need an attribute to be designated as the label—this is the target, the thing we want to
predict. Search for the Set Role operator in the Operators tab and drag it into your training

Chapter 8: Linear Regression
131
stream. Change the parameters to designate Heating_Oil as the label for this model
(Figure 8-4).

Figure 8-4. Adding an operator to designate Heating_Oil as our label.

With this step complete our data sets are now prepared for…

MODELING

5)

Using the search field in the Operators tab again, locate the Linear Regression operator and
drag and drop it into your training data set’s stream (Figure 8-5).

Figure 8-5. Adding the Linear Regression model operator to our stream.

Data Mining for the Masses
132

6)

Note  that  the  Linear  Regression  operator  uses  a  default  tolerance  of  .05  (also  known  in
statistical  language  as  the  confidence  level  or  alpha  level).    This  value  of  .05  is  very
common in statistical analysis of this type, so we will accept this default.  The final step to
complete our model is to use an Apply Model operator to connect our training stream to
our scoring stream.  Be sure to connect both the lab and mod ports coming from the Apply
Model operator to res ports. This is illustrated in Figure 8-6.

Figure 8-6. Applying the model to the scoring data set.

7)

Run  the  model.    Having  two  splines  coming  from  the  Apply  Model  operator  and
connecting  to  res  ports  will  result  in  two  tabs  in  results  perspective.    Let’s  examine  the
LinearRegression tab first, as we begin our…

EVALUATION

Figure 8-7. Linear regression coefficients.

Chapter 8: Linear Regression
133

Linear regression modeling is all about determing how close a given observation is to an imaginary
line representing the average, or center of all points in the data set. That imaginary line gives us the
first  part  of  the  term  “linear  regression”.    The  formula  for  calculating  a  prediction  using  linear
regression  is  y=mx+b.    You  may  recognize  this  from  a  former  algebra  class  as  the  formula  for
calculating the slope of a line.  In this formula, the variable y, is the target, the label, the thing we
want  to  predict.    So  in  this  chapter’s  example,  y  is  the  amount  of  Heating_Oil  we  expect  each
home  to  consume.    But  how  will  we  predict  y?    We  need  to  know  what  m,  x,  and  b  are.    The
variable  m  is  the  value  for  a  given  predictor  attribute,  or  what  is  sometimes  referred  to  as  an
independent variable.  Insulation, for example, is a predictor of heating oil usage, so Insulation is
a predictor attribute.  The variable x is that attribute’s coefficient, shown in the second column of
Figure  8-7.    The  coefficient  is  the  amount  of  weight  the  attribute  is  given  in  the  formula.
Insulation, with a coefficient of 3.323, is weighted heavier than any of the other predictor attributes
in  this  data  set.    Each  observation  will  have  its  Insulation  value  multipled  by  the  Insulation
coefficient to properly weight that attribute when calculating y (heating oil usage).  The variable b is
a  constant  that  is  added  to  all  linear  regression  calculations.    It  is  represented  by  the  Intercept,
shown  in  figure  8-7  as  134.511.    So  suppose  we  had  a  house  with  insulation  density  of  5;  our
formula using these Insulation values would be y=(5*3.323)+134.511.

But wait!  We had more than one predictor attribute.  We started out using a combination of five
attributes to try to predict heating oil usage.  The formula described in the previous paragraph only
uses one.  Furthermore, our LinearRegression result set tab pictured in Figure 8-7 only has four
predictor variables. What happened to Num_Occupants?

The  answer  to  the  latter  question  is  that  Num_Occupants  was  not  a  statistically  significant
predictor of heating oil usage in this data set, and therefore, RapidMiner removed it as a predictor.
In other words, when RapidMiner evaluated the amount of influence each attribute in the data set
had  on  heating  oil  usage  for  each  home  represented  in  the  training  data  set,  the  number  of
occupants was so non-influential that its weight in the formula was set to zero.  An example of
why this might occur could be that two older people living in a house may use the same amount of
heating oil as a young family of five in the house. The older couple might take longer showers, and
prefer  to  keep  their  house  much  warmer  in  the  winter  time  than  would  the  young  family.    The
variability in the number of occupants in the house doesn’t help to explain each home’s heating oil
usage very well, and so it was removed as a predictor in our model.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 33 34 35 36 37 38 39 40 ... 65