Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	31/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 27 28 29 30 31 32 33 34 ... 65

Decision_Making
Discriminant Analysis

Chapter 7: Discriminant Analysis
107
sufficient strength to perform all lifts without any difficulty.  No participant scored 8, 9 or
10, but some participants did score 0.


Quickness:    This  is  the  participant’s  performance  on  a  series  of  responsiveness  tests.
Participants were timed on how quickly they were able to press buttons when they were
illuminated or to jump when a buzzer sounded.  Their response times were tabulated on a
scale  of  0-6,  with  6  being  extremely  quick  response  and  0  being  very  slow.    Participants
scored all along the spectrum for this attribute.


Injury: This is a simple yes (1) / no (0) column indicating whether or not the young athlete
had already suffered an athletic-related injury that was severe enough to require surgery or
other major medical intervention.  Common injuries treated with ice, rest, stretching, etc.
were entered as 0.  Injuries that took more than three week to heal, that required physical
therapy or surgery were flagged as 1.


Vision: Athletes were not only tested on the usual 20/20 vision scale using an eye chart,
but were also tested using eye-tracking technology to see how well they were able to pick
up objects visually.  This test challenged participants to identify items that moved quickly
across their field of vision, and to estimate speed and direction of moving objects.  Their
scores  were  recorded  on  a  0  to  4  scale  with  4  being  perfect  vision  and  identification  of
moving objects. No participant scored a perfect 4, but the scores did range from 0 to 3.


Endurance:    Participants  were  subjected  to  an  array  of  physical  fitness  tests  including
running,  calisthenics, aerobic  and cardiovascular  exercise,  and  distance  swimming.    Their
performance was rated on a scale of 0-10, with 10 representing the ability to perform all
tasks without fatigue of any kind.  Scores ranged from 0 to 6 on this  attribute.  Gill has
acknowledged to us that even finely tuned professional athletes would not be able to score
a 10 on this portion of the battery, as it is specifically designed to test the limits of human
endurance.


Agility:  This is the participant’s score on a series of tests of their ability to move, twist,
turn, jump, change direction, etc.  The test checked the athlete’s ability to move nimbly,
precisely,  and  powerfully  in  a  full  range  of  directions.    This  metric  is  comprehensive  in
nature, and is influenced by some of the other metrics, as agility is often dictated by one’s
strength, quickness, etc.  Participants were scored between 0 and 100 on this attribute, and
in our data set from Gill, we have found performance between 13 and 80.


Decision_Making: This portion of the battery tests the athlete’s process of deciding what
to do in athletic situations.  Athlete’s participated in simulations that tested their choices of

Data Mining for the Masses
108
whether or not to swing a bat, pass a ball, move to a potentially advantageous location of a
playing  surface,  etc.    Their  scores  were  to  have  been  recorded  on  a  scale  of  0  to  100,
though  Gill  has  indicated  that  no  one  who  completed  the  test  should  have  been able  to
score  lower  than  a  3,  as  three  points  are  awarded  simply  for  successfully  entering  and
exiting  the  decision  making  part  of  the  battery.    Gill  knows  that  all  493  of  his  former
athletes represented in this data set successfully entered and exited this portion, but there
are a few scores lower than 3, and also a few over 100 in the data set, so we know we have
some data preparation in our future.


Prime_Sport:  This attribute is the sport each of the 453 athletes went on to specialize in
after they left Gill’s academy.  This is the attribute Gill is hoping to be able to predict for
his  current  clients.    For  the  boys  in  this  study,  this  attribute  will  be  one  of  four  sports:
football (American, not soccer; sorry soccer fans), Basketball, Baseball, or Hockey.

As we analyze and familiarize ourselves with these data, we realize that all of the attributes with the
exception of Prime_Sport are numeric, and as such, we could exclude Prime_Sport and conduct a
k-means  clustering  data  mining  exercise  on  the  data  set.    Doing  this,  we  might  be  able  group
individuals into one sport cluster or another based on the means for each of the attributes in the
data set.  However, having the Prime_Sport attribute gives us the ability to use a different type of
data mining model: Discriminant Analysis.  Discriminant analysis is a lot like k-means clustering,
in  that  it  groups  observations  together  into  like-types  of  values,  but  it  also  gives  us  something
more, and that is the ability to predict.  Discriminant analysis then helps us cross that intersection
seen  in  the  Venn  diagram  in  Chapter  1  (Figure  1-2).    It  is  still  a  data  mining  methodology  for
classifying  observations,  but  it  classifies  them  in  a  predictive  way.    When  we  have  a  data  set  that
contains an attribute that we know is useful in predicting the same value for other observations
that  do  not  yet  have  that  attribute,  then  we  can  use  training  data  and  scoring  data  to  mine
predictively.  Training data are simply data sets that have that known prediction attribute.  For the
observations in the training data set, the outcome of the prediction attribute is already known. The
prediction attribute is also sometimes referred to as the dependent attribute (or variable) or the
target  attribute.    It  is  the  thing  you  are  trying  to  predict.    RapidMiner  will  ask  us  to  set  this
attribute to be the label when we build our model.  Scoring data are the observations which have
all of the same attributes as the training data set, with the exception of the prediction attribute.  We
can use the training data set to allow RapidMiner to evaluate the values of all our attributes in the
context  of  the  resulting  prediction  variable  (in  this  case,  Prime_Sport),  and  then  compare  those
values to the scoring data set and predict the Prime_Sport for each observation in the scoring data

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 27 28 29 30 31 32 33 34 ... 65