Chapter 4: Correlation
65
DEPLOYMENT
The concept of deployment in data mining means doing something with what you’ve learned from
your model; taking some action based upon what your model tells you. In this chapter’s example,
we conducted some basic, exploratory analysis for our fictional figure, Sarah. There are several
possible outcomes from this investigation.
We learned through our investigation, that the two most strongly correlated attributes in our data
set are Heating_Oil and Avg_Age, with a coefficient of 0.848. Thus, we know that in this data set,
as the average age of the occupants in a home increases, so too does the heating oil usage in that
home. What we do not know is why that occurs. Data analysts often make the mistake of
interpreting correlation as causation. The assumption that correlation proves causation is
dangerous and often false.
Consider for a moment the correlation coefficient between Avg_Age and Temperature: -0.673.
Referring back to Figure 4-7, we see that this is considered to be a relatively strong negative
correlation. As the age of a home’s residents increases, the average temperature outside decreases;
and as the temperature rises, the age of the folks inside goes down. But could the average age of a
home’s occupants have any effect on that home’s average yearly outdoor temperature? Certainly
not. If it did, we could control the temperature by simply moving people of different ages in and
out of homes. This of course is silly. While statistically, there is a correlation between these two
attributes in our data set, there is no logical reason that movement in one causes movement in the
other. The relationship is probably coincidental, but if not, there must be some other explanation
that our model cannot offer. Such limitations must be recognized and accepted in all data mining
deployment decisions.
Another false interpretation about correlations is that the coefficients are percentages, as if to say
that a correlation coefficient of 0.776 between two attributes is an indication that there is 77.6%
shared variability between those two attributes. This is not correct. While the coefficients do tell a
story about the shared variability between attributes, the underlying mathematical formula used to
calculate correlation coefficients solely measures strength, as indicated by proximity to 1 or -1, of
the interaction between attributes. No percentage is calculated or intended.
Data Mining for the Masses
66
With these interpretation parameters explained, there may be several things that Sarah can do in
order to take action based upon our model. A few options might include:
Dropping the Num_Occupants attribute. While the number of people living in a home
might logically seem like a variable that would influence energy usage, in our model it did
not correlate in any significant way with anything else. Sometimes there are attributes that
don’t turn out to be very interesting.
Investigating the role of home insulation. The Insulation rating attribute was fairly strongly
correlated with a number of other attributes. There may be some opportunity there to
partner with a company (or start one…?) that specializes in adding insulation to existing
homes. If she is interested in contributing to conservation, working on a marketing
promotion to show the benefits of adding insulation to a home might be a good course of
action, however if she wishes to continue to sell as much heating oil as she can, she may
feel conflicted about participating in such a campaign.
Adding greater granularity in the data set. This data set has yielded some interesting
results, but frankly, it’s pretty general. We have used average yearly temperatures and total
annual number of heating oil units in this model. But we also know that temperatures
fluctuate throughout the year in most areas of the world, and thus monthly, or even weekly
measures would not only be likely to show more detailed results of demand and usage over
time, but the correlations between attributes would probably be more interesting. From
our model, Sarah now knows how certain attributes interact with one another, but in the
day-to-day business of doing her job, she’ll probably want to know about usage over time
periods shorter than one year.
Adding additional attributes to the data set. It turned out that the number of occupants in
the home didn’t correlate much with other attributes, but that doesn’t mean that other
attributes would be equally uninteresting. For example, what if Sarah had access to the
number of furnaces and/or boilers in each home? Home_size was slightly correlated with
Heating_Oil usage, so perhaps the number of instruments that consume heating oil in each
home would tell an interesting story, or at least add to her insight.
Chapter 4: Correlation
67
Sarah would also be wise to remember that the CRISP-DM approach is cyclical in nature. Each
month as new orders come in and new bills go out, as new customers sign up for a heating oil
account, there are additional data available to add into the model. As she learns more about how
each attribute in her data set interacts with others, she can increase our correlation model by
adding not only new attributes, but also, new observations.
CHAPTER SUMMARY
This chapter has introduced the concept of correlation as a data mining model. It has been chosen
as the first model for this book because it is relatively simple to construct, run and interpret, thus
serving as an easy starting point upon which to build. Future models will become more complex,
but continuing to develop your skills in RapidMiner and getting comfortable with the tools will
make the more complex models easier for you to achieve as we move forward.
Recall from Chapter 1 (Figure 1-2) that data mining has two somewhat interconnected sides:
Classification, and Prediction. Correlation has been shown to be primarily on the side of
Classification. We do not infer causation using correlation metrics, nor do we use correlation
coefficients to predict one attribute’s value based on another’s. We can however quickly find
general trends in data sets using correlations, and we can anticipate how strongly an observed
movement in one attribute will occur in conjunction with movement in another.
Correlation can be a quick and easy way to see how elements of a given problem may be
interacting with one another. Whenever you find yourself asking how certain factors in a problem
you’re trying to solve interact with one another, consider building a correlation matrix to find out.
For example, does customer satisfaction change based on time of year? Does the amount of
rainfall change the price of a crop? Does household income influence which restaurants a person
patronizes? The answer to each of these questions is probably ‘yes’, but correlation can not only
help us know if that’s true, it can also help us learn how strongly the interactions are when, and if,
they occur.
Dostları ilə paylaş: |