Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	32/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 28 29 30 31 32 33 34 35 ... 65

DATA PREPARATION

Chapter 7: Discriminant Analysis
109
set.  That may seem a little confusing, but our chapter example should help clarify it, so let’s move
on to the next CRISP-DM step.

DATA PREPARATION

This chapter’s example will be a slight divergence from other chapters.  Instead of there being a
single  example  data  set  in  CSV  format  for  you  to  download,  there  are  two  this  time.    You  can
access the Chapter 7 data sets on the book’s companion web site
(
https://sites.google.com/site/dataminingforthemasses/
).

They  are  labeled  Chapter07DataSet_Scoring.csv  and  Chapter07DataSet_Training.csv.    Go  ahead
and download those now, and import both of them into your RapidMiner repository as you have
in past chapters.  Be sure to designate the attribute names in the first row of the data sets as you
import them.  Be sure you give each of the two data sets descriptive names, so that you can tell
they are for Chapter 7, and also so that you can tell the difference between the training data set and
the scoring data set.  After importing them, drag only the training data set into a new process
window, and then follow the steps below to prepare for and create a discriminant analysis data
mining model.

1)

Thus  far,  when  we  have  added  data  to  a  new  process,  we  have  allowed  the  operator  to
simply be labeled ‘Retrieve’, which is done by RapidMiner by default. For the first time, we
will have more than one Retrieve operator in our model, because we have a training data
set and a scoring data set.  In order to easily differentiate between the two, let’s start by
renaming the Retrieve operator for the training data set that you’ve dragged and dropped
into your main process window.  Right click on this operator and select Rename.  You will
then be able to type in a new name for this operator.  For this example, we will name the
operator ‘Training’, as is depicted in Figure 7-1.

Data Mining for the Masses
110

Figure 7-1. Our Retrieve operator renamed as ‘Training’.

2)

We know from our Data Preparation phase that we have some data that need to be fixed
before  we  can  mine  this  data  set.    Specifically,  Gill  noticed  some  inconsistencies  in  the
Decision_Making attribute.  Run your model and let’s examine the meta data, as seen in
Figure 7-2.

Figure 7-2. Identifying inconsistent data in the Decision_Making attribute.

3)

While  still  in  results  perspective,  switch  to  the  Data  View  radio  button.    Click  on  the
column  heading  for  the  Decision_Making  attribute.    This  will  sort  the  attribute  from
smallest to largest (note the small triangle indicating that the data are sorted in ascending
order using this attribute).  In this view (Figure 7-3) we see that we have three observations
with scores smaller than three. We will need to handle these observations.

Chapter 7: Discriminant Analysis
111

Figure 7-3. The data set sorted in ascending order by the Decision_Making attribute.

4)

Click on the Decision_Making attribute again.  This will re-sort the attribute in descending
order. Again, we have some values that need to be addressed (Figure 7-4).

Figure 7-4. The Decision_Making variable, re-sorted in descending order.

5)

Switch back to design perspective.  Let’s address these inconsistent data by removing them
from  our  training  data  set.    We  could  set  these  inconsistent  values  to  missing  then  set
missing values to another value, such as the mean, but in this instance we don’t really know

Data Mining for the Masses
112
what should have been in this variable, so changing these to the mean seems a bit arbitrary.
Removing this inconsistencies means only removing 11 of our 493 observations, so rather
than risk using bad data, we will simply remove them.  To do this, add two Filter Examples
operators  in  a  row  to  your  stream.    For  each  of  these,  set  the  condition  class  to
attribute_value_filter, and for the parameter strings, enter ‘Decision_Making>=3’ (without
single  quotes)  for  the  first  one,  and  ‘Decision_Making<=100’  for  the  second  one.    This
will reduce our training data set down to 482 observations.  The set-up described in this
step is shown in Figure 7-5.

Figure 7-5. Filtering out observations with inconsistent data.

6)

If  you  would  like,  you  can  run  the  model  to  confirm  that  your  number  of  observations
(examples) has been reduced to 482.  Then, in design perspective, use the search field in
the  Operators  tab  to  look  for  ‘Discriminant’  and  locate  the  operator  for  Linear
Discriminant Analysis. Add this operator to your stream, as shown in Figure 7-6.

Figure 7-6. Addition of the Linear Discriminant Analysis operator to the model.

Chapter 7: Discriminant Analysis
113

7)

The tra port on the LDA (or Linear Discriminant Analysis) operator indicates that this tool
does  expect  to  receive  input  from  a  training  data  set  like  the  one  we’ve  provided,  but
despite this, we still have received two errors, as indicated by the black arrow at the bottom
of the Figure 7-6 image.  The first error is because of our Prime_Sport attribute.  It is data
typed as polynominal, and LDA likes attributes that are numeric.  This is OK, because the
predictor attribute can have a polynominal data type, and the Prime_Sport attribute is the
one we want to predict, so this error will be resolved shortly.  This is because it is related to
the second error, which tells us that the LDA operator wants one of our attributes to be
designated as a ‘label’.  In RapidMiner, the label is the attribute that you want to predict.
At  the  time  that  we  imported  our  data  set,  we  could  have  designated  the  Prime_Sport
attribute  as  a  label,  rather  than  as  a  normal  attribute,  but  it  is  very  simple  to  change  an
attribute’s role right in your stream.  Using the search field in the Operators tab, search for
an operator called Set Role.  Add this to your stream and then in the parameters area on
the right side of the window, select Prime_Sport in the name field, and in target role, select
label.  We still have a warning (which does not prevent us from continuing), but you will
see the errors have now disappeared at the bottom of the RapidMiner window (Figure 7-7).

Figure 7-7. Setting an attribute’s role in RapidMiner.

With  our  inconsistent  data  removed  and  our  errors  resolved,  we  are  now  prepared  to  move  on
to…

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 28 29 30 31 32 33 34 35 ... 65