Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	36/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 32 33 34 35 36 37 38 39 ... 65

Chapter 7: Discriminant Analysis
125
Challenge Step!

12)

Change  your  LDA  operator  to  a  different  type  of  discriminant  analysis  (e.g.  Quadratic)
operator.  Re-run your model.  Consider doing some research to learn about the difference
between linear and quadratic discriminant analysis.  Compare your new results to the LDA
results and report any interesting findings or differences.

Chapter 8: Linear Regression
127

CHAPTER EIGHT:
LINEAR REGRESSION

CONTEXT AND PERSPECTIVE

Sarah, the regional sales manager from the Chapter 4 example, is back for more help.  Business is
booming,  her  sales  team  is  signing  up  thousands  of  new  clients,  and  she  wants  to  be  sure  the
company will be able to meet this new level of demand.  She was so pleased with our assistance in
finding correlations in her data, she now is hoping we can help her do some prediction as well.
She  knows  that  there  is  some  correlation  between  the  attributes  in  her  data  set  (things  like
temperature, insulation, and occupant ages), and she’s now wondering if she can use the data set
from  Chapter  4  to  predict  heating  oil  usage  for  new  customers.    You  see,  these  new  customers
haven’t  begun  consuming  heating  oil  yet,  there  are  a  lot  of  them  (42,650  to  be  exact),  and  she
wants  to  know  how  much  oil  she  needs  to  expect  to  keep  in  stock  in  order  to  meet  these  new
customers’  demand.    Can  she  use  data  mining  to examine  household attributes  and  known  past
consumption quantities to anticipate and meet her new customers’ needs?

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:


Explain what linear regression is, how it is used and the benefits of using it.


Recognize the necessary format for data in order to perform predictive linear regression.


Explain the basic algebraic formula for calculating linear regression.


Develop a linear regression data mining model in RapidMiner using a training data set.


Interpret the model’s coefficients and apply them to a scoring data set in order to deploy
the model.

Data Mining for the Masses
128
ORGANIZATIONAL UNDERSTANDING

Sarah’s new data mining objective is pretty clear: she wants to anticipate demand for a consumable
product.  We will use a linear regression model to help her with her desired predictions.  She has
data, 1,218 observations from the Chapter 4 data set that give an attribute profile for each home,
along with those homes’ annual heating oil consumption.  She wants to use this data set as training
data to predict the usage that 42,650 new clients will bring to her company.  She knows that these
new clients’ homes are similar in nature to her existing client base, so the existing customers’ usage
behavior should serve as a solid gauge for predicting future usage by new customers.

DATA UNDERSTANDING

As a review, our data set from Chapter 4 contains the following attributes:


Insulation:  This is a density rating, ranging from one to ten, indicating the  thickness of
each home’s insulation.  A home with a density rating of one is poorly insulated, while a
home with a density of ten has excellent insulation.


Temperature:    This  is  the  average  outdoor  ambient  temperature  at  each  home  for  the
most recent year, measure in degree Fahrenheit.


Heating_Oil:  This is the total number of units of heating oil purchased by the owner of
each home in the most recent year.


Num_Occupants: This is the total number of occupants living in each home.


Avg_Age: This is the average age of those occupants.


Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size.  The
higher the number, the larger the home.

We will use the Chapter 4 data set as our training data set in this chapter.  Sarah has assembled a
separate Comma Separated Values file containing all of these same attributes, except of course for
Heating_Oil, for her 42,650 new clients.  She has provided this data set to us to use as the scoring
data set in our model.

Chapter 8: Linear Regression
129
DATA PREPARATION

You should already have downloaded and imported the Chapter 4 data set, but if not, you can get
it from the book’s companion web site (
https://sites.google.com/site/dataminingforthemasses/
).
Download  and  import  the  Chapter  8  data  set  from  the  companion  web  site  as  well.   Once you
have both the Chapter 4 and Chapter 8 data sets imported into your RapidMiner data repository,
complete the following steps:

1)

Drag  and  drop  both  data  sets  into  a  new  process  window  in  RapidMiner.    Rename  the
Chapter  4  data  set  to  ‘Training  (CH4),  and  the  Chapter  8  data  set  to  ‘Scoring  (CH8)’.
Connect both out ports to res ports, as shown in Figure 8-1, and then run your model.

Figure 8-1.  Using both Chapter 4 and 8 data sets to set up a linear regression model.

2)

Figures 8-2 and 8-3 show side-by-side comparisons of the training and scoring data sets.
When using linear regression as a predictive model, it is extremely important to remember
that  the  ranges  for  all  attributes  in  the  scoring  data  must  be  within  the  ranges  for  the
corresponding attributes in the training data.  This is because a training data set cannot be
relied  upon  to  predict  a  target  attrtibute  for  observations  whose  values  fall  outside  the
training data set’s values.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 32 33 34 35 36 37 38 39 ... 65