Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	19/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 15 16 17 18 19 20 21 22 ... 65

EVALUATION All correlation coefficients between 0 and 1 represent positive correlations

Data Mining for the Masses
62
MODELING

3)

Switch back to design perspective.  On the Operators tab in the lower left hand corner, use
the search box and begin typing in the word correlation.  The tool we are looking for is called
Correlation Matrix.  You may be able to find it before you even finish typing the full search
term.  Once you’ve located it, drag it over into your process window and drop it into your
stream.  By default, the exa port will connect to the res port, but in this chapter’s example
we are interested in creating a matrix of correlation coefficients that we can analyze.  Thus,
is it important for you to connect the mat (matrix) port to a res port, as illustrated in Figure
4-3.

Figure 4-3. The addition of a Correlation Matrix to our stream, with the
mat (matrix) port connected to a result set (res) port.

4)

Correlation  is  a  relatively  simple  statistical  analysis  tool,  so  there  are  few  parameters  to
modify.    We  will  accept  the  defaults,  and  run  the  model.    The  results  will  be  similar  to
Figure 4-4.

Figure 4-4. Results of a Correlation Matrix.

Chapter 4: Correlation
63

5)

In Figure 4-4, we have our correlation coefficients in a matrix.  Correlation coefficients
are  relatively  easy  to  decipher.    They  are  simply  a  measure  of  the  strength  of  the
relationship between each possible set of attributes in the data set.  Because we have six
attributes in this data set, our matrix is six columns wide by six rows tall.  In the location
where an attribute intersects with itself, the correlation coefficient is ‘1’, because everything
compared to itself has a perfectly matched relationship.  All other pairs of attributes will
have  a  correlation  coefficient  of  less  than  one.    To  complicate  matters  a  bit,  correlation
coefficients  can  actually  be  negative  as  well,  so  all  correlation  coefficients  will  fall
somewhere between -1 and 1.  We can see that this is the case in Figure 4-4, and so we can
now move on to the CRISP-DM step of…

EVALUATION

All correlation coefficients between 0 and 1 represent positive correlations, while all coefficients
between  0  and  -1  are  negative  correlations.    While  this  may  seem  straightforward,  there  is  an
important distinction to be made when interpreting the matrix’s values.  This distinction has to do
with  the  direction  of  movement  between  the  two  attributes  being  analyzed.    Let’s  consider  the
relationship  between  the  Heating_Oil  consumption  attribute,  and  the  Insulation  rating  level
attribute.  The coefficient there, as seen in our matrix in Figure 4-4, is 0.736.  This is a positive
number,  and  therefore,  a  positive  correlation.    But  what  does  that  mean?    Correlations  that  are
positive mean that as one attribute’s value rises, the other attribute’s value also rises.  But, a positive
correlation  also  means  that  as  one  attribute’s  value  falls,  the  other’s  also  falls.    Data  analysts
sometimes make the mistake in thinking that a negative correlation exists if an attribute’s values are
decreasing, but if its corresponding attribute’s values are also decreasing, the correlation is still a
positive one.  This is illustrated in Figure 4-5.

Heating Oil use
rises

Insulation
rating also rises

Heating Oil use
falls

Insulation
rating also falls
Whenever both attribute values move in the same direction, the correlation is positive.
Figure 4-5. Illustration of positive correlations.

Data Mining for the Masses
64
Next,  consider  the  relationship  between  the  Temperature  attribute  and  the  Insulation  rating
attribute.  In our Figure 4-4 matrix, we see that the coefficient there is -0.794.  In this example, the
correlation is negative, as illustrated in Figure 4-6.

Temperature
rises

Insulation
rating falls

Temperature
falls

Insulation
rating rises
Whenever attribute values move in opposite directions, the correlation is negative.
Figure 4-6. Illustration of negative correlations.

So correlation coefficients tell us something about the relationship between attributes, and this is
helpful,  but  they  also  tell  us  something  about  the  strength  of  the  correlation.    As  previously
mentioned,  all  correlations  will  fall  between  0  and  1  or  0  and  -1.    The  closer  a  correlation
coefficient is to 1 or to -1, the stronger it is. Figure 4-7 illustrates the correlation strength along the
continuum from -1 to 1.

-1             0         1
-1 ← -0.8
-0.8 ← -0.6  -0.6 ← -0.4
-0.4 ← 0
0 → 0.4
0.4 → 0.6
0.6 → 0.8
0.8 → 1.0
Very Strong
Correlation
Strong
Correlation
Some
Correlation
No
correlation
No
correlation
Some
correlation
Strong
correlation
Very strong
correlation
Figure 4-7. Correlation strengths between -1 and 1.

RapidMiner  attempts  to  help  us  recognize  correlation  strengths  through  color  coding.    In  the
Figure 4-4 matrix, we can see that some of the cells are tinted with shades of purple in graduated
colors,  in  order  to  more  strongly  highlight  those  with  stronger  correlations.    It  is  important  to
recognize  that  these  are  only  general  guidelines  and  not  hard-and-fast  rules.    A  correlation
coefficient around .2 does show some interaction between attributes, even if it is not statistically
significant. This should be kept in mind as we proceed to…

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 15 16 17 18 19 20 21 22 ... 65