Chapter 7:
Discriminant Analysis
119
confidence percentages for each of our four target sports. If RapidMiner had found some
significant possibility that an observation might have more than one possible Prime_Sport, it
would have calculated the percent probability that the person represented by an observation would
succeed in one sport and in the others. For example, if an observation yielded a statistical
possibility that the Prime_Sport for a person could have been any of the four, but Baseball was the
strongest statistically, the confidence attributes on that observation might be: confidence(Football):
8%; confidence(Baseball): 69%; confidence(Hockey): 12%; confidence(Basketball): 11%. In some
predictive data mining models (including some later in this text), your data
will yield partial
confidence percentages such as this. This phenomenon did not occur however in the data sets we
used for this chapter’s example. This is most likely explained by the fact discussed earlier in the
chapter: all athletes will display some measure of aptitude in many sports, and so their battery test
scores will likely be varied across the specializations. In statistical language, this is often referred to
as
heterogeneity.
Not finding confidence percentages does not mean that our experiment has been a failure
however. The fifth new attribute, generated by RapidMiner when we applied our LDA model to
our scoring data, is the prediction of Prime_Sport for each of our 1,767 boys. Click on the Data
View radio button, and you will see that RapidMiner has applied our discriminant
analysis model to
our scoring data, resulting in a predicted Prime_Sport for each boy based on the specialization
sport of previous academy attendees (Figure 7-17).
Figure 7-17. Prime_Sport predictions for each boy in the scoring data set.
Data Mining
for the Masses
120
DEPLOYMENT
Gill now has a data set with a prediction for each boy that has been tested using the
athletic battery
at his academy. What to do with these predictions will be a matter of some thought and
discussion. Gill can extract these data from RapidMiner and relate them back to each boy
individually. For relatively small data sets, such as this one, we could move the results into a
spreadsheet by simply copying and pasting them. Just as a quick exercise in moving results to
other formats, try this:
1)
Open a blank OpenOffice Calc spreadsheet.
2)
In RapidMiner, click on the 1 under Row No. in Data View of results perspective (the cell
will turn gray).
3)
Press Ctrl+A (the keyboard command for ‘select all’ in Windows; you can use equivalent
keyboard command for Mac or Linux as well). All cells in Data View will turn gray.
4)
Press Ctrl+C (or the equivalent keyboard command for ‘copy’ if not using Windows).
5)
In your blank OpenOffice Calc spreadsheet, right click in cell A1 and choose Paste
Special… from the context menu.
6)
In
the pop up dialog box, select
Unformatted Text, then click OK.
7)
A Text Import pop up dialog box will appear with a preview of the RapidMiner data.
Accept the defaults by clicking OK. The data will be pasted into the spreadsheet. The
attribute names will have to be transcribed and added to the top row of the spreadsheet,
but the data are now available outside of RapidMiner. Gill can match each prediction back
to each boy in the scoring data set. The data are still in order, but remember that a few
were removed because on inconsistent data, so care should be exercised when matching
the predictions back to the boys represented by each observation. Bringing a unique
identifying number into the training and scoring data sets might aid the matching once
Chapter 7: Discriminant Analysis
121
predictions have been generated. This will be demonstrated in an upcoming chapter’s
example.
Chapter 14 of this book will spend some time talking about ethics in data mining. As previously
mentioned, Gill’s use of these predictions is going to require some thought and discussion. Is it
ethical to push one of his young clients in the direction of one specific sport based on our model’s
prediction that that activity as a good match for the boy? Simply because previous academy
attendees went on to specialize in one sport or another, can we assume that current clients would
follow the same path? The final chapter will offer some suggestions for ways to answer such
questions, but it is wise for us to at least consider them now in the context of the chapter
examples.
It is likely that Gill, being experienced at working with young athletes and recognizing their
strengths and weaknesses, will be able to use our predictions in an ethical way. Perhaps he can
begin by grouping his clients by their predicted Prime_Sports and administering more ‘sport-
specific’ drills—say, jumping tests for basketball, skating for hockey, throwing and catching for
baseball, etc. This may allow him to capture more specific data on each athlete, or even to simply
observe whether or not the predictions based on the data are in fact consistent with observable
performance on the field, court, or ice. This is an excellent example of why the CRISP-DM
approach is
cyclical: the predictions we’ve generated for Gill are a starting point for a new round of
assessment and evaluation, not the ending or culminating point. Discriminant analysis has given
Gill some idea about where his young proteges may have strengths, and this can point him in
certain directions when working with each of them, but he will inevitably gather more data and
learn whether or not the use of this data mining methodology and approach is helpful in guiding
his clients to a sport in which they might choose to specialize as they mature.
CHAPTER SUMMARY
Discriminant analysis helps us to cross the threshold between Classification and Prediction in data
mining. Prior to Chapter 7, our data mining models and methodologies focused primarily on
categorization of data. With Discriminant Analysis, we can take a process that is very similar in
nature to k-means clustering, and with the right target attribute in a training data set, generate