Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	18/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 14 15 16 17 18 19 20 21 ... 65

SECTION TWO: DATA MINING MODELS AND METHODS
LEARNING OBJECTIVES
ORGANIZATIONAL UNDERSTANDING
DATA UNDERSTANDING
Temperature
Num_Occupants
DATA PREPARATION

57

SECTION TWO: DATA MINING MODELS AND METHODS

Chapter 4: Correlation
59

CHAPTER FOUR:
CORRELATION

CONTEXT AND PERSPECTIVE

Sarah is a regional sales manager for a nationwide supplier of fossil fuels for home heating.  Recent
volatility  in  market  prices  for  heating  oil  specifically,  coupled  with  wide  variability  in  the  size  of
each order for home heating oil, has Sarah concerned.  She feels a need to understand the types of
behaviors and other factors that may influence the demand for heating oil in the domestic market.
What factors are related to heating oil usage, and how might she use a knowledge of such factors
to better manage her inventory, and anticipate demand?  Sarah believes that data mining can help
her begin to formulate an understanding of these factors and interactions.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:


Explain what correlation is, and what it isn’t.


Recognize the necessary format for data in order to perform correlation analysis.


Develop a correlation model in RapidMiner.


Interpret the coefficients in a correlation matrix and explain their significance, if any.

ORGANIZATIONAL UNDERSTANDING

Sarah’s goal is to better understand how her company can succeed in the home heating oil market.
She  recognizes  that  there  are  many  factors  that  influence  heating  oil  consumption,  and  believes
that by investigating the relationship between a number of those factors, she will be able to better
monitor and respond to heating oil demand.  She has selected correlation as a way to model the
relationship between the factors she wishes to investigate.  Correlation is a statistical measure of
how strong the relationships are between attributes in a data set.

Data Mining for the Masses
60

DATA UNDERSTANDING

In order to investigate her question, Sarah has enlisted our help in creating a correlation matrix of
six  attributes.    Working  together,  using  Sarah’s  employer’s  data  resources  which  are  primarily
drawn  from  the  company’s  billing  database,  we  create  a  data  set  comprised  of  the  following
attributes:


Insulation:  This is a density rating, ranging from one to ten, indicating the thickness of
each home’s insulation.  A home with a density rating of one is poorly insulated, while a
home with a density of ten has excellent insulation.


Temperature:    This  is  the  average  outdoor  ambient  temperature  at  each  home  for  the
most recent year, measure in degree Fahrenheit.


Heating_Oil:  This is the total number of units of heating oil purchased by the owner of
each home in the most recent year.


Num_Occupants: This is the total number of occupants living in each home.


Avg_Age: This is the average age of those occupants.


Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size.  The
higher the number, the larger the home.

DATA PREPARATION

A CSV data set for this chapter’s example is available for download at the book’s companion web
site  (
https://sites.google.com/site/dataminingforthemasses/
).    If  you  wish  to  follow  along  with
the  example,  go  ahead  and  download  the  Chapter04DataSet.csv  file  now  and  save  it  into  your
RapidMiner data folder.  Then, complete the following steps to prepare the data set for correlation
mining:

1)

Import the Chapter 4 CSV data set into your RapidMiner data repository.  Save it with the
name  Chapter4.    If  you  need  a  refresher  on  how  to  bring  this  data  set  into  your
RapidMiner repository, refer to steps 7 through 14 of the Hands On Exercise in Chapter 3.
The steps will be the same, with the exception of which file you select to import.  Import
all  attributes,  and  accept  the  default  data  types.    When  you  are  finished,  your  repository
should look similar to Figure 4-1.

Chapter 4: Correlation
61

Figure 4-1. The chapter four data set added to the author’s RapidMiner Book repository.

2)

If your RapidMiner application is not open to a new, blank process window, click the new
process icon, or click File > New to create a new process.  Drag your Chapter4 data set
into your main process window.  Go ahead and click the run (play) button to examine the
data set’s meta data.  If you are prompted, you may choose to save your new model.  For
this book’s example, we’ll save the model as Chapter4_Process.

Figure 4-2. Meta Data view of the chapter four data set.

We  can  see  in  Figure  4-2  that  our  six  attributes  are  shown.    There  are  a  total  of  1,218
homes represented in the data set.  Our data set appears to be very clean, with no missing
values in any of the six attributes, and no inconsistent data apparent in our ranges or other
descriptive  statistics.    If  you  wish,  you  can  take  a  minute  to  switch  to  Data  View  to
familiarize yourself with the data.  It feels like these data are in good shape, and are in no
further need of data preparation operators, so we are ready to move on to…

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 14 15 16 17 18 19 20 21 ... 65