Chapter 7:
Discriminant Analysis
125
Challenge Step!
12)
Change your LDA operator to a different type of discriminant analysis (e.g. Quadratic)
operator. Re-run your model. Consider doing some research to learn about the difference
between linear and quadratic discriminant analysis. Compare your new results to the LDA
results and report any interesting findings or differences.
Chapter 8:
Linear Regression
127
CHAPTER EIGHT:
LINEAR REGRESSION
CONTEXT AND PERSPECTIVE
Sarah, the regional sales manager from the Chapter 4 example, is back for more help. Business is
booming, her sales team is signing up thousands of new clients, and she wants to be sure the
company will be able to meet this new level of demand. She was so pleased with our assistance in
finding correlations in her data, she now is hoping we can help her do some prediction as well.
She knows that there is some correlation between the attributes in her data set (things like
temperature, insulation, and occupant ages), and she’s now wondering if she can use the data set
from Chapter 4 to predict heating oil usage for new customers. You see, these new customers
haven’t begun consuming heating oil yet, there are a lot of them (42,650 to be exact), and she
wants to know how much oil she needs to expect to keep in stock in order to meet these new
customers’ demand. Can she use data mining to examine household attributes and known past
consumption quantities to anticipate and meet her new customers’ needs?
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Explain
what linear regression is, how it is used and the benefits of using it.
Recognize the necessary format for data in order to perform predictive linear regression.
Explain the basic algebraic formula for calculating linear regression.
Develop a linear regression data mining model in RapidMiner using a training data set.
Interpret the model’s coefficients and apply them to a scoring data set in order to deploy
the model.
Data Mining for the Masses
128
ORGANIZATIONAL UNDERSTANDING
Sarah’s new data mining objective is pretty clear: she wants to anticipate demand for a consumable
product. We will use a
linear regression model to help her with her desired predictions. She has
data, 1,218 observations from the Chapter 4 data set that give an attribute profile for each home,
along with those homes’ annual heating oil consumption. She wants to use this data set as training
data to predict the usage that 42,650 new clients will bring to her company. She knows that these
new clients’ homes are similar in nature to her existing client base, so the existing customers’ usage
behavior should serve as a solid gauge for predicting future usage by new customers.
DATA UNDERSTANDING
As a review, our data set from Chapter 4 contains the following attributes:
Insulation: This is a density rating, ranging from one to ten, indicating the thickness of
each home’s insulation. A home with a density rating of one is poorly insulated, while a
home with a density of ten has excellent insulation.
Temperature: This is the average outdoor ambient temperature at each home for the
most
recent year, measure in degree Fahrenheit.
Heating_Oil: This is the total number of units of heating oil purchased by the owner of
each home in the most recent year.
Num_Occupants: This is the total number of occupants living in each home.
Avg_Age: This is the average age of those occupants.
Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The
higher the number, the larger the home.
We will use the Chapter 4 data set as our training data set in this chapter. Sarah has assembled a
separate Comma Separated Values file containing all of these same attributes, except of course for
Heating_Oil, for her 42,650 new clients. She has provided this data set to us to use as the scoring
data set in our model.