Data Mining
for the Masses
68
REVIEW QUESTIONS
1)
What are some of the limitations of correlation models?
2)
What is a correlation coefficient? How is it interpreted?
3)
What is the difference between a positive and a negative correlation?
If two attributes have
values that decrease at essentially the same rate, is that a negative correlation? Why or why
not?
4)
How is correlation strength measured? What are the ranges for strengths of correlation?
5)
The number of heating oil consuming devices was suggested as a possibly interesting
attribute that could be added to the example data set for this chapter. Can you think of
others? Why might they be interesting? To what other attributes in the data set do you
think your suggested attributes might be correlated? What would be the value in knowing
if they are?
EXERCISE
It is now your turn to develop a correlation model, generate a coefficient matrix, and analyze the
results. To complete this chapter’s exercise, follow the steps below.
1)
Select a professional sporting organization that you enjoy, or of which you are aware.
Locate that organization’s web site and search it for statistics, facts and figures about the
athletes in that organization.
2)
Open OpenOffice Calc, and starting in Cell A across Row 1 of the spreadsheet, define
some attributes (at least three or four) to hold data about each athlete. Some possible
attributes you may wish to consider could be annual_salary, points_per_game,
years_as_pro, height, weight, age, etc. The list is potentially unlimited, will vary based on
the type of sport you choose, and will depend on the data available to you on the web site
you’ve selected. Measurements of the athletes’ salaries and performance in competition are
Chapter 4:
Correlation
69
likely to be the most interesting. You may include the athletes’ names, however keep in
mind that correlations can only be conducted on numeric data, so the name attribute
would need to be reduced out of your data set before creating your correlation matrix.
(Remember the Select Attributes operator!)
3)
Look up the statistics for each of your selected attributes and enter them as observations
into your spreadsheet. Try to find as many as you can—at least thirty is a good rule of
thumb in order to achieve at least a basic level of statistical validity. More is better.
4)
Once you’ve created your data set, use the menu to save it as a CSV file. Click File, then
Save As. Enter a file name, and change ‘Save as type:’ to be Text CSV (.csv). Be sure to
save the file in your data mining data folder.
5)
Open RapidMiner and import your data set into your RapidMiner repository. Name it
Chapter4Exercise, or something descriptive so that you will remember what data are
contained in the data set when you look in your repository.
6)
Add the data set to a new process in RapidMiner. Ensure that the
out port is connected to
a
res port and run your model. Save your process with a descriptive name if you wish.
Examine your data in results perspective and ensure there are no missing, inconsistent, or
other potentially problematic data that might need to be handled as part of your Data
Preparation phase. Return to design perspective and handle any data preparation tasks that
may be necessary.
7)
Add a Correlation Matrix operator to your stream and ensure that the
mat port is
connected to a
res port. Run your model again. Interpret your correlation coefficients as
displayed on the matrix tab.
8)
Document your findings. What correlations exist? How strong are they? Are they
surprising to you and if so, why? What other attributes would you like to add? Are there
any you’d eliminate now that you’ve mined your data?