Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	20/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 16 17 18 19 20 21 22 23 ... 65

CHAPTER SUMMARY

Chapter 4: Correlation
65
DEPLOYMENT

The concept of deployment in data mining means doing something with what you’ve learned from
your model; taking some action based upon what your model tells you.  In this chapter’s example,
we  conducted  some  basic,  exploratory  analysis  for  our  fictional  figure,  Sarah.    There  are  several
possible outcomes from this investigation.

We learned through our investigation, that the two most strongly correlated attributes in our data
set are Heating_Oil and Avg_Age, with a coefficient of 0.848.  Thus, we know that in this data set,
as the average age of the occupants in a home increases, so too does the heating oil usage in that
home.    What  we  do  not  know  is  why  that  occurs.    Data  analysts  often  make  the  mistake  of
interpreting  correlation  as  causation.    The  assumption  that  correlation  proves  causation  is
dangerous and often false.

Consider  for  a  moment  the  correlation  coefficient  between  Avg_Age  and  Temperature:  -0.673.
Referring  back  to  Figure  4-7,  we  see  that  this  is  considered  to  be  a  relatively  strong  negative
correlation.  As the age of a home’s residents increases, the average temperature outside decreases;
and as the temperature rises, the age of the folks inside goes down.  But could the average age of a
home’s occupants have any effect on that home’s average yearly outdoor temperature?  Certainly
not.  If it did, we could control the temperature by simply moving people of different ages in and
out of homes.  This of course is silly.  While statistically, there is a correlation between these two
attributes in our data set, there is no logical reason that movement in one causes movement in the
other.  The relationship is probably coincidental, but if not, there must be some other explanation
that our model cannot offer.  Such limitations must be recognized and accepted in all data mining
deployment decisions.

Another false interpretation about correlations is that the coefficients are percentages, as if to say
that a correlation coefficient of 0.776 between two attributes is an indication that there is 77.6%
shared variability between those two attributes.  This is not correct.  While the coefficients do tell a
story about the shared variability between attributes, the underlying mathematical formula used to
calculate correlation coefficients solely measures strength, as indicated by proximity to 1 or -1, of
the interaction between attributes. No percentage is calculated or intended.

Data Mining for the Masses
66
With  these  interpretation  parameters  explained,  there  may  be  several  things  that  Sarah  can  do  in
order to take action based upon our model. A few options might include:



Dropping  the  Num_Occupants  attribute.   While  the  number  of  people living  in a  home
might logically seem like a variable that would influence energy usage, in our model it did
not correlate in any significant way with anything else.  Sometimes there are attributes that
don’t turn out to be very interesting.



Investigating the role of home insulation. The Insulation rating attribute was fairly strongly
correlated  with  a  number  of  other  attributes.    There  may  be  some  opportunity  there  to
partner with a company (or start  one…?)  that specializes in adding insulation to existing
homes.    If  she  is  interested  in  contributing  to  conservation,  working  on  a  marketing
promotion to show the benefits of adding insulation to a home might be a good course of
action, however if she wishes to continue to sell as much heating oil as she can, she may
feel conflicted about participating in such a campaign.



Adding  greater  granularity  in  the  data  set.    This  data  set  has  yielded  some  interesting
results, but frankly, it’s pretty general.  We have used average yearly temperatures and total
annual  number  of  heating  oil  units  in  this  model.    But  we  also  know  that  temperatures
fluctuate throughout the year in most areas of the world, and thus monthly, or even weekly
measures would not only be likely to show more detailed results of demand and usage over
time, but the correlations between attributes would probably be more interesting.  From
our model, Sarah now knows how certain attributes interact with one another, but in the
day-to-day business of doing her job, she’ll probably want to know about usage over time
periods shorter than one year.



Adding additional attributes to the data set.  It turned out that the number of occupants in
the  home  didn’t  correlate  much  with  other  attributes,  but  that  doesn’t  mean  that  other
attributes  would  be  equally  uninteresting.    For  example,  what  if  Sarah  had  access  to  the
number of furnaces and/or boilers in each home?  Home_size was slightly correlated with
Heating_Oil usage, so perhaps the number of instruments that consume heating oil in each
home would tell an interesting story, or at least add to her insight.

Chapter 4: Correlation
67
Sarah would also be wise to remember that the CRISP-DM approach is cyclical in nature.  Each
month as new orders come in and new bills go out, as new customers sign up for a heating oil
account, there are additional data available to add into the model.  As she learns more about how
each  attribute  in  her  data  set  interacts  with  others,  she  can  increase  our  correlation  model  by
adding not only new attributes, but also, new observations.

CHAPTER SUMMARY

This chapter has introduced the concept of correlation as a data mining model.  It has been chosen
as the first model for this book because it is relatively simple to construct, run and interpret, thus
serving as an easy starting point upon which to build.  Future models will become more complex,
but  continuing  to  develop  your  skills  in  RapidMiner  and  getting  comfortable  with  the  tools  will
make the more complex models easier for you to achieve as we move forward.

Recall  from  Chapter  1  (Figure  1-2)  that  data  mining  has  two  somewhat  interconnected  sides:
Classification,  and  Prediction.    Correlation  has  been  shown  to  be  primarily  on  the  side  of
Classification.    We  do  not  infer  causation  using  correlation  metrics,  nor  do  we  use  correlation
coefficients  to  predict  one  attribute’s  value  based  on  another’s.    We  can  however  quickly  find
general  trends  in  data  sets  using  correlations,  and  we  can  anticipate  how  strongly  an  observed
movement in one attribute will occur in conjunction with movement in another.

Correlation  can  be  a  quick  and  easy  way  to  see  how  elements  of  a  given  problem  may  be
interacting with one another.  Whenever you find yourself asking how certain factors in a problem
you’re trying to solve interact with one another, consider building a correlation matrix to find out.
For  example,  does  customer  satisfaction  change  based  on  time  of  year?    Does  the  amount  of
rainfall change the price of a crop?  Does household income influence which restaurants a person
patronizes?  The answer to each of these questions is probably ‘yes’, but correlation can not only
help us know if that’s true, it can also help us learn how strongly the interactions are when, and if,
they occur.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 16 17 18 19 20 21 22 23 ... 65