Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	33/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 29 30 31 32 33 34 35 36 ... 65

Data Mining for the Masses
114
MODELING

8)

We now have a functional stream.  Go ahead and run the model as it is now.  With the mod
port connected to the res port, RapidMiner will generate Discriminant Analysis output for
us.

Figure 7-8. The results of discriminant analysis on our training data set.

9)

The probabilities given in the results will total to 1.  This is because at this  stage of our
Discriminant Analysis model, all that has been calculated is the likelihood of an observation
landing in one of the four categories in our target attribute of Prime_Sport.  Because this is
our  training  data  set,  RapidMiner  can  calculate  theses  probabilities  easily—every
observation is already classified.  Football has a probability of 0.3237.  If you refer back to
Figure  7-2,  you  will  see  that  Football  as  Prime_Sport  comprised  160  of  our  493
observations.    Thus,  the  probability  of  an  observation  having  Football  is  160/493,  or
0.3245.  But in steps 3 and 4 (Figures 7-3 and 7-4), we removed 11 observations that had
inconsistent  data  in  their  Decision_Making  attribute.    Four  of  these  were  Football
observations  (Figure  7-4),  so  our  Football  count  dropped  to  156  and  our  total  count
dropped to 482: 156/482 = 0.3237.  Since we have no observations where the value for
Prime_Sport is missing, each possible value in Prime_Sport will have some portion of the
total count, and the sum of these portions will equal 1, as is the case in Figure 7-8.  These
probabilities,  coupled  with  the  values  for  each  attribute,  will  be  used  to  predict  the
Prime_Sport classification for each of Gill’s current clients represented in our scoring data
set.    Return  now  to  design  perspective  and  in  the  Repositories  tab,  drag  the  Chapter  7
scoring data set over and drop it in the main process window.  Do not connect it to your

Chapter 7: Discriminant Analysis
115
existing  stream,  but  rather,  allow  it  to  connect  directly  to  a  res  port.    Right  click  the
operator and rename it to ‘Scoring’. These steps are illustrated in Figure7-9.

Figure 7-9. Adding the scoring data set to our model.

10)

Run the model again.  RapidMiner will give you an additional tab in results perspective this
time which will show the meta data for the scoring data set (Figure 7-10).

Figure 7-10. Results perspective meta data for our scoring data set.

11)

The scoring data set contains 1,841, however, as indicated by the black arrow in the Range
column  of  Figure  7-10,  the  Decision_Making  attribute  has  some  inconsistent  data  again.
Repeating the process previously outlined in steps 3 and 4, return to design perspective and
use  two  consecutive  Filter  Examples  operators  to  remove  any  observations  that  have
values  below  3  or  above  100  in  the  Decision_Making  attribute  (Figure  7-11).    This  will

Data Mining for the Masses
116
leave  us  with  1,767  observations,  and  you  can  check  this  by  running  the  model  again
(Figure 7-12).

Figure 7-11. Filtering out observations containing inconsistent Decision_Making values.

Figure 7-12. Verification that observations with inconsistent values have been removed.

12)

We now have just one step remaining to complete our model and predict the Prime_Sport
for the 1,767 boys represented in our scoring data set.  Return to design perspective, and
use the search field in the Operators tab to locate an operator called Apply Model.  Drag
this operator over and place it in the Scoring data set’s stream, as is shown in Figure 7-13.

Chapter 7: Discriminant Analysis
117

Figure 7-13. Adding the Apply Model operator to our Discriminant Analysis model.

13)

As you can see in Figure 7-13, the Apply Model operator has given us an error.  This is
because the Apply Model operator expects the output of a model generation operator as its
input.  This is an easy fix, because our LDA operator (which generated a model for us) has
a mod port for its output.  We simply need to disconnect the LDA’s mod port from the res
port it’s currently connected to, and connect it instead to the Apply Model operator’s mod
input port.  To do this, click on the mod port for the LDA operator, and then click on the
mod port for the Apply Model operator.  When you do this, the following warning will pop
up:

Figure 7-14. The port reconnection warning in RapidMiner.

14)

Click OK to indicate to RapidMiner that you do in fact wish to reconfigure the spline to
connect mod port to mod port.  The error message will disappear and your scoring model
will be ready for prediction (Figure 7-15).

Data Mining for the Masses
118

Figure 7-15. Discriminant analysis model with training and scoring data streams.

15)

Run the model by clicking the play button.  RapidMiner will generate five new attributes
and add them to our results perspective (Figure 7-16), preparing us for…

EVALUATION

Figure 7-16. Prediction attributes generated by RapidMiner.

The  first  four  attributes  created  by  RapidMiner  are  confidence  percentages,  which  indicate  the
relative strength of RapidMiner’s prediction when compared to the other values the software might
have  predicted  for  each  observation.    In  this  example  data  set,  RapidMiner  has  not  generated

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 29 30 31 32 33 34 35 36 ... 65