Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	34/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 30 31 32 33 34 35 36 37 ... 65

CHAPTER SUMMARY

Chapter 7: Discriminant Analysis
119
confidence  percentages  for  each  of  our  four  target  sports.    If  RapidMiner  had  found  some
significant  possibility  that  an  observation  might  have  more  than  one  possible  Prime_Sport,  it
would have calculated the percent probability that the person represented by an observation would
succeed  in  one  sport  and  in  the  others.    For  example,  if  an  observation  yielded  a  statistical
possibility that the Prime_Sport for a person could have been any of the four, but Baseball was the
strongest statistically, the confidence attributes on that observation might be: confidence(Football):
8%; confidence(Baseball): 69%; confidence(Hockey): 12%; confidence(Basketball): 11%.  In some
predictive  data  mining  models  (including  some  later  in  this  text),  your  data  will  yield  partial
confidence percentages such as this.  This phenomenon did not occur however in the data sets we
used for this chapter’s example.  This is most likely explained by the fact discussed earlier in the
chapter: all athletes will display some measure of aptitude in many sports, and so their battery test
scores will likely be varied across the specializations.  In statistical language, this is often referred to
as heterogeneity.

Not  finding  confidence  percentages  does  not  mean  that  our  experiment  has  been  a  failure
however.  The fifth new attribute, generated by RapidMiner when we applied our LDA model to
our scoring data, is the prediction of Prime_Sport for each of our 1,767 boys.  Click on the Data
View radio button, and you will see that RapidMiner has applied our discriminant analysis model to
our  scoring  data,  resulting  in  a  predicted  Prime_Sport  for  each  boy  based  on  the  specialization
sport of previous academy attendees (Figure 7-17).

Figure 7-17. Prime_Sport predictions for each boy in the scoring data set.

Data Mining for the Masses
120

DEPLOYMENT

Gill now has a data set with a prediction for each boy that has been tested using the athletic battery
at  his  academy.    What  to  do  with  these  predictions  will  be  a  matter  of  some  thought  and
discussion.    Gill  can  extract  these  data  from  RapidMiner  and  relate  them  back  to  each  boy
individually.    For  relatively  small  data  sets,  such  as  this  one,  we  could  move  the  results  into  a
spreadsheet  by  simply  copying  and  pasting  them.    Just  as  a  quick  exercise  in  moving  results  to
other formats, try this:

1)

Open a blank OpenOffice Calc spreadsheet.

2)

In RapidMiner, click on the 1 under Row No. in Data View of results perspective (the cell
will turn gray).

3)

Press Ctrl+A (the keyboard command for ‘select all’ in Windows; you can use equivalent
keyboard command for Mac or Linux as well). All cells in Data View will turn gray.

4)

Press Ctrl+C (or the equivalent keyboard command for ‘copy’ if not using Windows).

5)

In  your  blank  OpenOffice  Calc  spreadsheet,  right  click  in  cell  A1  and  choose  Paste
Special… from the context menu.

6)

In the pop up dialog box, select Unformatted Text, then click OK.

7)

A  Text  Import  pop  up  dialog  box  will  appear  with  a  preview  of  the  RapidMiner  data.
Accept  the  defaults  by clicking  OK.    The  data will  be  pasted  into  the  spreadsheet.    The
attribute names will have to be transcribed and added to the top row of the spreadsheet,
but the data are now available outside of RapidMiner. Gill can match each prediction back
to each boy in the scoring data set.  The data are still in order,  but remember that a few
were  removed  because  on  inconsistent  data,  so  care  should  be  exercised  when  matching
the  predictions  back  to  the  boys  represented  by  each  observation.    Bringing  a  unique
identifying  number  into  the  training  and  scoring  data  sets  might  aid  the  matching  once

Chapter 7: Discriminant Analysis
121
predictions  have  been  generated.    This  will  be  demonstrated  in  an  upcoming  chapter’s
example.

Chapter 14 of this book will spend some time talking about ethics in data mining.  As previously
mentioned, Gill’s use of these predictions is going to require some thought and discussion.  Is it
ethical to push one of his young clients in the direction of one specific sport based on our model’s
prediction  that  that  activity  as  a  good  match  for  the  boy?    Simply  because  previous  academy
attendees went on to specialize in one sport or another, can we assume that current clients would
follow  the  same  path?    The  final  chapter  will  offer  some  suggestions  for  ways  to  answer  such
questions,  but  it  is  wise  for  us  to  at  least  consider  them  now  in  the  context  of  the  chapter
examples.

It  is  likely  that  Gill,  being  experienced  at  working  with  young  athletes  and  recognizing  their
strengths and weaknesses, will be able to use our predictions in an ethical way.  Perhaps he can
begin  by  grouping  his  clients  by  their  predicted  Prime_Sports  and  administering  more  ‘sport-
specific’  drills—say,  jumping  tests  for  basketball,  skating  for  hockey,  throwing  and  catching  for
baseball, etc.  This may allow him to capture more specific data on each athlete, or even to simply
observe whether or not the predictions based on the data are in fact consistent with observable
performance  on  the  field,  court,  or  ice.    This  is  an  excellent  example  of  why  the  CRISP-DM
approach is cyclical: the predictions we’ve generated for Gill are a starting point for a new round of
assessment and evaluation, not the ending or culminating point.  Discriminant analysis has given
Gill  some  idea  about  where  his  young  proteges  may  have  strengths,  and  this  can  point  him  in
certain  directions  when working  with each  of  them,  but  he  will  inevitably gather  more  data  and
learn whether or not the use of this data mining methodology and approach is helpful in guiding
his clients to a sport in which they might choose to specialize as they mature.

CHAPTER SUMMARY

Discriminant analysis helps us to cross the threshold between Classification and Prediction in data
mining.    Prior  to  Chapter  7,  our  data  mining  models  and  methodologies  focused  primarily  on
categorization of data.  With Discriminant Analysis, we can take a process that is very similar in
nature  to  k-means  clustering,  and  with  the  right  target  attribute  in  a  training  data  set,  generate

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 30 31 32 33 34 35 36 37 ... 65