Chapter 7: Discriminant Analysis
109
set. That may seem a little confusing, but our chapter example should help clarify it, so let’s move
on to the next CRISP-DM step.
DATA PREPARATION
This chapter’s example will be a slight divergence from other chapters. Instead of there being a
single example data set in CSV format for you to download, there are two this time. You can
access the Chapter 7 data sets on the book’s companion web site
(
https://sites.google.com/site/dataminingforthemasses/
).
They are labeled Chapter07DataSet_Scoring.csv and Chapter07DataSet_Training.csv. Go ahead
and download those now, and import both of them into your RapidMiner repository as you have
in past chapters. Be sure to designate the attribute names in the first row of the data sets as you
import them. Be sure you give each of the two data sets descriptive names, so that you can tell
they are for Chapter 7, and also so that you can tell the difference between the training data set and
the scoring data set. After importing them, drag only the training data set into a new process
window, and then follow the steps below to prepare for and create a discriminant analysis data
mining model.
1)
Thus far, when we have added data to a new process, we have allowed the operator to
simply be labeled ‘Retrieve’, which is done by RapidMiner by default. For the first time, we
will have more than one Retrieve operator in our model, because we have a training data
set and a scoring data set. In order to easily differentiate between the two, let’s start by
renaming the Retrieve operator for the training data set that you’ve dragged and dropped
into your main process window. Right click on this operator and select Rename. You will
then be able to type in a new name for this operator. For this example, we will name the
operator ‘Training’, as is depicted in Figure 7-1.
Data Mining for the Masses
110
Figure 7-1. Our Retrieve operator renamed as ‘Training’.
2)
We know from our Data Preparation phase that we have some data that need to be fixed
before we can mine this data set. Specifically, Gill noticed some inconsistencies in the
Decision_Making attribute. Run your model and let’s examine the meta data, as seen in
Figure 7-2.
Figure 7-2. Identifying inconsistent data in the Decision_Making attribute.
3)
While still in results perspective, switch to the Data View radio button. Click on the
column heading for the Decision_Making attribute. This will sort the attribute from
smallest to largest (note the small triangle indicating that the data are sorted in ascending
order using this attribute). In this view (Figure 7-3) we see that we have three observations
with scores smaller than three. We will need to handle these observations.
Chapter 7: Discriminant Analysis
111
Figure 7-3. The data set sorted in ascending order by the Decision_Making attribute.
4)
Click on the Decision_Making attribute again. This will re-sort the attribute in descending
order. Again, we have some values that need to be addressed (Figure 7-4).
Figure 7-4. The Decision_Making variable, re-sorted in descending order.
5)
Switch back to design perspective. Let’s address these inconsistent data by removing them
from our training data set. We could set these inconsistent values to missing then set
missing values to another value, such as the mean, but in this instance we don’t really know
Data Mining for the Masses
112
what should have been in this variable, so changing these to the mean seems a bit arbitrary.
Removing this inconsistencies means only removing 11 of our 493 observations, so rather
than risk using bad data, we will simply remove them. To do this, add two Filter Examples
operators in a row to your stream. For each of these, set the condition class to
attribute_value_filter, and for the parameter strings, enter ‘Decision_Making>=3’ (without
single quotes) for the first one, and ‘Decision_Making<=100’ for the second one. This
will reduce our training data set down to 482 observations. The set-up described in this
step is shown in Figure 7-5.
Figure 7-5. Filtering out observations with inconsistent data.
6)
If you would like, you can run the model to confirm that your number of observations
(examples) has been reduced to 482. Then, in design perspective, use the search field in
the Operators tab to look for ‘Discriminant’ and locate the operator for Linear
Discriminant Analysis. Add this operator to your stream, as shown in Figure 7-6.
Figure 7-6. Addition of the Linear Discriminant Analysis operator to the model.
Chapter 7: Discriminant Analysis
113
7)
The tra port on the LDA (or Linear Discriminant Analysis) operator indicates that this tool
does expect to receive input from a training data set like the one we’ve provided, but
despite this, we still have received two errors, as indicated by the black arrow at the bottom
of the Figure 7-6 image. The first error is because of our Prime_Sport attribute. It is data
typed as polynominal, and LDA likes attributes that are numeric. This is OK, because the
predictor attribute can have a polynominal data type, and the Prime_Sport attribute is the
one we want to predict, so this error will be resolved shortly. This is because it is related to
the second error, which tells us that the LDA operator wants one of our attributes to be
designated as a ‘label’. In RapidMiner, the label is the attribute that you want to predict.
At the time that we imported our data set, we could have designated the Prime_Sport
attribute as a label, rather than as a normal attribute, but it is very simple to change an
attribute’s role right in your stream. Using the search field in the Operators tab, search for
an operator called Set Role. Add this to your stream and then in the parameters area on
the right side of the window, select Prime_Sport in the name field, and in target role, select
label. We still have a warning (which does not prevent us from continuing), but you will
see the errors have now disappeared at the bottom of the RapidMiner window (Figure 7-7).
Figure 7-7. Setting an attribute’s role in RapidMiner.
With our inconsistent data removed and our errors resolved, we are now prepared to move on
to…
Dostları ilə paylaş: |