Data Mining
for the Masses
102
5)
How might the presence of outliers in the attributes of a data set influence the usefulness
of a k-Means clustering model? What could be done to address the problem?
EXERCISE
Think of an example of a problem that could be at least partially addressed by being able to group
observations in a data set into clusters. Some examples might be grouping kids who might be at
risk for delinquency, grouping product sale volumes, grouping workers by productivity and
effectiveness, etc. Search the Internet or other resources available to you for a data set that would
allow you to investigate your question using a k-means model. As with all exercises in this text,
please ensure that you have permission to use any data set that might belong to your employer or
another entity. When you have secured your data set, complete the following steps:
1)
Ensure that your data set is saved as a CSV file. Import your data set into your
RapidMiner repository and save it with a meaningful name. Drag it into a new process
window in RapidMiner.
2)
Conduct any data preparation that you need for your data set. This may include handling
inconsistent data, dealing with missing values, or changing data types. Remember that in
order to calculate means, each attribute in your data set will need to be numeric. If, for
example, one of your attributes contains the values ‘yes’ and ‘no’, you may need to change
these to be 1 and 0 respectively, in order for the k-Means operator to work.
3)
Connect a k-Means operator to your data set, configure your parameters (especially set
your
k to something meaningful for your question) and then run your model.
4)
Investigate
your Centroid Table, Folder View, and the other evaluation tools.
5)
Report your findings for your clusters. Discuss what is interesting about them and
describe what iterations of modeling you went through, such as experimentation with
different parameter values, to generate the clusters. Explain how your findings are relevant
to your original question.
Chapter 6:
k-Means Clustering
103
Challenge Step!
6)
Experiment with the other k-Means operators in RapidMiner, such as Kernel or Fast.
How are they different from your original model. Did the use of these operators change
your clusters, and if so, how?
Chapter 7:
Discriminant Analysis
105
CHAPTER SEVEN:
DISCRIMINANT ANALYSIS
CONTEXT AND PERSPECTIVE
Gill runs a sports academy designed to help high school aged athletes achieve their maximum
athletic potential. On the boys side of his academy, he focuses on four major sports: Football,
Basketball, Baseball and Hockey. He has found that while many high school athletes enjoy
participating in a number of sports in high school, as they begin to consider playing a sport at the
college level, they would prefer to specialize in one sport. As he’s worked with athletes over the
years, Gill has developed an extensive data set, and he now is wondering if he can use past
performance from some of his previous clients to predict prime sports for up-and-coming high
school athletes. Ultimately, he hopes he can make a recommendation to each athlete as to the
sport in which they should most likely choose to specialize. By evaluating each athlete’s
performance across a battery of test, Gill hopes we can help him figure out for which sport each
athlete has the highest aptitude.
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Explain
what discriminant analysis is, how it is used and the benefits of using it.
Recognize the necessary format for data in order to perform discriminant analysis.
Explain the differences and similarities between k-Means clustering and discriminant
analysis.
Develop a discriminant analysis data mining model in RapidMiner using a training data
set.
Interpret the model output and apply it to a scoring data set in order to deploy the model.
Data Mining for the Masses
106
ORGANIZATIONAL UNDERSTANDING
Gill’s objective is to examine young athletes and, based upon their performance across a number
of metrics, help them decide which sport is the most prime for their specialized success. Gill
recognizes that all of his clients possess some measure of athleticism, and that they enjoy
participating in a number of sports. Being young, athletic, and adaptive, most of his clients are
quite good at a number of sports, and he has seen over the years that some people are so naturally
gifted that they would excel in any sport they choose for specialization. Thus, he recognizes, as a
limitation of this data mining exercise, that he may not be able to use data to determine an athlete’s
“best” sport. Still, he has seen metrics and evaluations work in the past, and has seen that some of
his previous athletes really were pre-disposed to a certain sport, and that they were successful as
they went on to specialize in that sport. Based on his industry experience, he has decided to go
ahead with an experiment in mining data for athletic aptitude, and has enlisted our help.
DATA UNDERSTANDING
In order to begin to formulate a plan, we sit down with Gill to review his data assets. Every athlete
that has enrolled at Gill’s academy over the past several years has taken a battery test, which tested
for a number of athletic and personal traits. The battery has been administered to both boys and
girls participating in a number of different sports, but for this preliminary study we have decided
with Gill that we will look at data only for boys. Because the academy has been operating for
some time, Gill has the benefit of knowing which of his former pupils have gone on to specialize
in a single sport, and which sport it was for each of them. Working with Gill, we gather the results
of the batteries for all former clients who have gone on to specialize, Gill adds the sport each
person specialized in, and we have a data set comprised of 493 observations containing the
following attributes:
Age: This is the age in years (one decimal precision for the part of the year since the
client’s last birthday) at the time that the athletic and personality trait battery test was
administered. Participants ranged in age from 13-19 years old at the time they took the
battery.
Strength: This is the participant’s strength measured through a series of weight lifting
exercises and recorded on a scale of 0-10, with 0 being limited strength and 10 being