Data Mining
for the Masses
62
MODELING
3)
Switch back to design perspective. On the Operators tab in the lower left hand corner, use
the search box and begin typing in the word
correlation. The tool
we are looking for is called
Correlation Matrix. You may be able to find it before you even finish typing the full search
term. Once you’ve located it, drag it over into your process window and drop it into your
stream. By default, the
exa port will connect to the
res port, but in this chapter’s example
we are interested in creating a matrix of correlation coefficients that we can analyze. Thus,
is it important for you to connect the
mat (matrix) port to a
res port, as illustrated in Figure
4-3.
Figure 4-3. The addition of a Correlation Matrix to our stream, with the
mat (matrix) port connected to a result set (
res) port.
4)
Correlation is a relatively simple statistical analysis tool, so there are few parameters to
modify. We will accept the defaults, and run the model. The results will be similar to
Figure 4-4.
Figure 4-4. Results of a Correlation Matrix.
Chapter 4:
Correlation
63
5)
In Figure 4-4, we have our
correlation coefficients in a matrix. Correlation coefficients
are relatively easy to decipher. They are simply a measure of the strength of the
relationship between each possible set of attributes in the data set. Because we have six
attributes in this data set, our matrix is six columns wide by six rows tall. In the location
where an attribute intersects with itself, the correlation coefficient is ‘1’, because everything
compared to itself has a perfectly matched relationship. All other pairs of attributes will
have a correlation coefficient of less than one. To complicate matters a bit, correlation
coefficients can actually be negative as well, so all correlation coefficients will fall
somewhere between -1 and 1. We can see that this is the case in Figure 4-4, and so we can
now move on to the CRISP-DM step of…
EVALUATION
All correlation coefficients between 0 and 1 represent
positive correlations, while all coefficients
between 0 and -1 are
negative correlations. While this may seem straightforward, there is an
important distinction to be made when interpreting the matrix’s values. This distinction has to do
with the direction of movement between the two attributes being analyzed. Let’s consider the
relationship between the Heating_Oil consumption attribute, and the Insulation rating level
attribute. The coefficient there, as seen in our matrix in Figure 4-4, is 0.736. This is a positive
number, and therefore, a positive correlation. But what does that mean? Correlations that are
positive mean that as one attribute’s value rises, the other attribute’s value also rises.
But,
a positive
correlation also means that as one attribute’s value falls, the other’s also falls. Data analysts
sometimes make the mistake in thinking that a negative correlation exists if an attribute’s values are
decreasing, but if its corresponding attribute’s values are also decreasing, the correlation is still a
positive one. This is illustrated in Figure 4-5.
Heating Oil use
rises
Insulation
rating
also rises
Heating Oil use
falls
Insulation
rating
also falls
Whenever both attribute values move in the same direction, the correlation is positive.
Figure 4-5. Illustration of positive correlations.
Data Mining for the Masses
64
Next, consider the relationship between the Temperature attribute and the Insulation rating
attribute. In our Figure 4-4 matrix, we see that the coefficient there is -0.794. In this example, the
correlation
is negative, as illustrated in Figure 4-6.
Temperature
rises
Insulation
rating falls
Temperature
falls
Insulation
rating rises
Whenever attribute values move in opposite directions, the correlation is negative.
Figure 4-6. Illustration of negative correlations.
So correlation coefficients tell us something about the relationship between attributes, and this is
helpful, but they also tell us something about the
strength of the correlation. As previously
mentioned, all correlations will fall between 0 and 1 or 0 and -1. The closer a correlation
coefficient is to 1 or to -1, the stronger it is. Figure 4-7 illustrates the correlation
strength along the
continuum from -1 to 1.
-1 0 1
-1 ← -0.8
-0.8 ← -0.6 -0.6 ← -0.4
-0.4 ← 0
0 → 0.4
0.4 → 0.6
0.6 → 0.8
0.8 → 1.0
Very Strong
Correlation
Strong
Correlation
Some
Correlation
No
correlation
No
correlation
Some
correlation
Strong
correlation
Very strong
correlation
Figure 4-7. Correlation strengths between -1 and 1.
RapidMiner attempts to help us recognize correlation strengths through color coding. In the
Figure 4-4 matrix, we can see that some of the cells are tinted with shades of purple in graduated
colors, in order to more strongly highlight those with stronger correlations. It is important to
recognize that these are only general guidelines and not hard-and-fast rules. A correlation
coefficient around .2 does show some interaction between attributes, even if it is not statistically
significant. This should be kept in mind as we proceed to…