Chapter 4: Correlation
59
CHAPTER FOUR:
CORRELATION
CONTEXT AND PERSPECTIVE
Sarah is a regional sales manager for a nationwide supplier of fossil fuels for home heating. Recent
volatility in market prices for heating oil specifically, coupled with wide variability in the size of
each order for home heating oil, has Sarah concerned. She feels a need to understand the types of
behaviors and other factors that may influence the demand for heating oil in the domestic market.
What factors are related to heating oil usage, and how might she use a knowledge of such factors
to better manage her inventory, and anticipate demand? Sarah believes that data mining can help
her begin to formulate an understanding of these factors and interactions.
LEARNING OBJECTIVES
After completing the reading
and exercises in this chapter, you should be able to:
Explain what correlation is, and what it isn’t.
Recognize the necessary format for data in order to perform correlation analysis.
Develop a correlation model in RapidMiner.
Interpret the coefficients in a correlation matrix
and explain their significance, if any.
ORGANIZATIONAL UNDERSTANDING
Sarah’s goal is to better understand how her company can succeed in the home heating oil market.
She recognizes that there are many factors that influence heating oil consumption, and believes
that by investigating the relationship between a number of those factors, she will be able to better
monitor and respond to heating oil demand. She has selected correlation as a way to model the
relationship between the factors she wishes to investigate.
Correlation is a statistical measure of
how strong the relationships are between attributes in a data set.
Data Mining
for the Masses
60
DATA UNDERSTANDING
In order to investigate her question, Sarah has enlisted our help in creating a correlation matrix of
six attributes. Working together, using Sarah’s employer’s data resources which are primarily
drawn from the company’s billing database, we create a data set comprised of the following
attributes:
Insulation: This is a density rating, ranging from one to ten, indicating the thickness of
each home’s insulation. A home with a density rating of one is poorly insulated, while a
home with a density of ten has excellent insulation.
Temperature: This is the average outdoor ambient temperature at each home for the
most
recent year, measure in degree Fahrenheit.
Heating_Oil: This is the total number of units of heating oil purchased by the owner of
each home in the most recent year.
Num_Occupants: This is the total number of occupants living in each home.
Avg_Age: This is the average age of those occupants.
Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The
higher the number, the larger the home.
DATA PREPARATION
A CSV data set for this chapter’s example is available for download at the book’s companion web
site (
https://sites.google.com/site/dataminingforthemasses/
). If you wish to follow along with
the example, go ahead and download the Chapter04DataSet.csv file now and save it into your
RapidMiner data folder. Then, complete the following steps to prepare the data set for correlation
mining:
1)
Import the Chapter 4 CSV data set into your RapidMiner data repository. Save it with the
name Chapter4. If you need a refresher on how to bring this data set into your
RapidMiner repository, refer to steps 7 through 14 of the Hands On Exercise in Chapter 3.
The steps will be the same, with the exception of which file you select to import. Import
all attributes, and accept the default data types. When you are finished, your repository
should look similar to Figure 4-1.
Chapter 4: Correlation
61
Figure 4-1. The chapter four data set added to the author’s RapidMiner Book repository.
2)
If your RapidMiner application is not open to a new, blank process window, click the new
process icon, or click File > New to create a new process. Drag your Chapter4 data set
into your main process window. Go ahead and click the run (play) button to examine the
data set’s meta data. If you are prompted, you may choose to save your new model. For
this book’s example, we’ll save the model as Chapter4_Process.
Figure 4-2. Meta Data view of the chapter four data set.
We can see in Figure 4-2 that our six attributes are shown. There are a total of 1,218
homes represented in the data set. Our data set appears to be very clean, with no missing
values in any of the six attributes, and no inconsistent data apparent in our ranges or other
descriptive statistics. If you wish, you can take a minute to switch to Data View to
familiarize yourself with the data. It feels like these data are in good shape, and are in no
further need
of data preparation operators, so we are ready to move on to…