Chapter 7:
Discriminant Analysis
107
sufficient strength to perform all lifts without any difficulty. No participant scored 8, 9 or
10, but some participants did score 0.
Quickness: This is the participant’s performance on a series of responsiveness tests.
Participants were timed on how quickly they were able to press buttons when they were
illuminated or to jump when a buzzer sounded. Their response times were tabulated on a
scale of 0-6, with 6 being extremely quick response and 0 being very slow. Participants
scored all along the spectrum for this attribute.
Injury: This is a simple yes (1) / no (0) column indicating whether or not the young athlete
had already suffered an athletic-related injury that was severe enough to require surgery or
other major medical intervention. Common injuries treated with ice, rest, stretching, etc.
were entered as 0. Injuries that took more than three week to heal, that required physical
therapy or surgery were flagged as 1.
Vision: Athletes were not only tested on the usual 20/20 vision scale using an eye chart,
but were also tested using eye-tracking technology to see how well they were able to pick
up objects visually. This test challenged participants to identify items that moved quickly
across their field of vision, and to estimate speed and direction of moving objects. Their
scores were recorded on a 0 to 4 scale with 4 being perfect vision and identification of
moving objects. No participant scored a perfect 4, but the scores did range from 0 to 3.
Endurance: Participants were subjected to an array of physical fitness tests including
running, calisthenics, aerobic and cardiovascular exercise, and distance swimming. Their
performance was rated on a scale of 0-10, with 10 representing the ability to perform all
tasks without fatigue of any kind. Scores ranged from 0 to 6 on this attribute. Gill has
acknowledged to us that even finely tuned professional athletes would not be able to score
a 10 on this portion of the battery, as it is specifically designed to test the limits of human
endurance.
Agility: This is the participant’s score on a series of tests of their ability to move, twist,
turn, jump, change direction, etc. The test checked the athlete’s ability to move nimbly,
precisely, and powerfully in a full range of directions. This metric is comprehensive in
nature, and is influenced by some of the other metrics, as agility is often dictated by one’s
strength, quickness, etc. Participants were scored between 0 and 100 on this attribute, and
in
our data set from Gill, we have found performance between 13 and 80.
Decision_Making: This portion of the battery tests the athlete’s process of deciding what
to do in athletic situations. Athlete’s participated in simulations that tested their choices of
Data Mining
for the Masses
108
whether or not to swing a bat, pass a ball, move to a potentially advantageous location of a
playing surface, etc. Their scores were to have been recorded on a scale of 0 to 100,
though Gill has indicated that no one who completed the test should have been able to
score lower than a 3, as three points are awarded simply for successfully entering and
exiting the decision making part of the battery. Gill knows that all 493 of his former
athletes represented in this data set successfully entered and exited this portion, but there
are a few scores lower than 3, and also a few over 100 in the data set, so we know we have
some data preparation in our future.
Prime_Sport: This attribute is the sport each of the 453 athletes went on to specialize in
after they left Gill’s academy. This is the attribute Gill is hoping to be able to predict for
his current clients. For the boys in this study, this attribute will be one of four sports:
football (American,
not soccer; sorry soccer fans),
Basketball,
Baseball, or Hockey.
As we analyze and familiarize ourselves with these data, we realize that all of the attributes with the
exception of Prime_Sport are numeric, and as such, we could exclude Prime_Sport and conduct a
k-means clustering data mining exercise on the data set. Doing this, we might be able group
individuals into one sport cluster or another based on the means for each of the attributes in the
data set. However, having the Prime_Sport attribute gives us the ability to use a different type of
data mining model:
Discriminant Analysis. Discriminant analysis is a lot like k-means clustering,
in that it groups observations together into like-types of values, but it also gives us something
more, and that is the ability to
predict. Discriminant analysis then helps us cross that intersection
seen in the Venn diagram in Chapter 1 (Figure 1-2). It is still a data mining methodology for
classifying observations, but it classifies them
in a predictive way. When we have a data set that
contains an attribute that we know is useful in predicting the same value for other observations
that do not yet have that attribute, then we can use
training data and
scoring data to mine
predictively. Training data are simply data sets that have that known prediction attribute. For the
observations in the
training data set, the outcome of the prediction attribute is already known. The
prediction attribute is also sometimes referred to as the
dependent attribute (or variable) or the
target attribute. It is the thing you are trying to predict. RapidMiner will ask us to set this
attribute to be the
label when we build our model. Scoring data are the observations which have
all of the same attributes as the training data set, with the exception of the prediction attribute. We
can use the training data set to allow RapidMiner to evaluate the values of all our attributes in the
context of the resulting prediction variable (in this case, Prime_Sport), and then compare those
values to the scoring data set and predict the Prime_Sport for each observation in the scoring data