Data Mining for the Masses
114
MODELING
8)
We now have a functional stream. Go ahead and run the model as it is now. With the mod
port connected to the res port, RapidMiner will generate Discriminant Analysis output for
us.
Figure 7-8. The results of discriminant analysis on our training data set.
9)
The probabilities given in the results will total to 1. This is because at this stage of our
Discriminant Analysis model, all that has been calculated is the likelihood of an observation
landing in one of the four categories in our target attribute of Prime_Sport. Because this is
our training data set, RapidMiner can calculate theses probabilities easily—every
observation is already classified. Football has a probability of 0.3237. If you refer back to
Figure 7-2, you will see that Football as Prime_Sport comprised 160 of our 493
observations. Thus, the probability of an observation having Football is 160/493, or
0.3245. But in steps 3 and 4 (Figures 7-3 and 7-4), we removed 11 observations that had
inconsistent data in their Decision_Making attribute. Four of these were Football
observations (Figure 7-4), so our Football count dropped to 156 and our total count
dropped to 482: 156/482 = 0.3237. Since we have no observations where the value for
Prime_Sport is missing, each possible value in Prime_Sport will have some portion of the
total count, and the sum of these portions will equal 1, as is the case in Figure 7-8. These
probabilities, coupled with the values for each attribute, will be used to predict the
Prime_Sport classification for each of Gill’s current clients represented in our scoring data
set. Return now to design perspective and in the Repositories tab, drag the Chapter 7
scoring data set over and drop it in the main process window. Do not connect it to your
Chapter 7: Discriminant Analysis
115
existing stream, but rather, allow it to connect directly to a res port. Right click the
operator and rename it to ‘Scoring’. These steps are illustrated in Figure7-9.
Figure 7-9. Adding the scoring data set to our model.
10)
Run the model again. RapidMiner will give you an additional tab in results perspective this
time which will show the meta data for the scoring data set (Figure 7-10).
Figure 7-10. Results perspective meta data for our scoring data set.
11)
The scoring data set contains 1,841, however, as indicated by the black arrow in the Range
column of Figure 7-10, the Decision_Making attribute has some inconsistent data again.
Repeating the process previously outlined in steps 3 and 4, return to design perspective and
use two consecutive Filter Examples operators to remove any observations that have
values below 3 or above 100 in the Decision_Making attribute (Figure 7-11). This will
Data Mining for the Masses
116
leave us with 1,767 observations, and you can check this by running the model again
(Figure 7-12).
Figure 7-11. Filtering out observations containing inconsistent Decision_Making values.
Figure 7-12. Verification that observations with inconsistent values have been removed.
12)
We now have just one step remaining to complete our model and predict the Prime_Sport
for the 1,767 boys represented in our scoring data set. Return to design perspective, and
use the search field in the Operators tab to locate an operator called Apply Model. Drag
this operator over and place it in the Scoring data set’s stream, as is shown in Figure 7-13.
Chapter 7: Discriminant Analysis
117
Figure 7-13. Adding the Apply Model operator to our Discriminant Analysis model.
13)
As you can see in Figure 7-13, the Apply Model operator has given us an error. This is
because the Apply Model operator expects the output of a model generation operator as its
input. This is an easy fix, because our LDA operator (which generated a model for us) has
a mod port for its output. We simply need to disconnect the LDA’s mod port from the res
port it’s currently connected to, and connect it instead to the Apply Model operator’s mod
input port. To do this, click on the mod port for the LDA operator, and then click on the
mod port for the Apply Model operator. When you do this, the following warning will pop
up:
Figure 7-14. The port reconnection warning in RapidMiner.
14)
Click OK to indicate to RapidMiner that you do in fact wish to reconfigure the spline to
connect mod port to mod port. The error message will disappear and your scoring model
will be ready for prediction (Figure 7-15).
Data Mining for the Masses
118
Figure 7-15. Discriminant analysis model with training and scoring data streams.
15)
Run the model by clicking the play button. RapidMiner will generate five new attributes
and add them to our results perspective (Figure 7-16), preparing us for…
EVALUATION
Figure 7-16. Prediction attributes generated by RapidMiner.
The first four attributes created by RapidMiner are confidence percentages, which indicate the
relative strength of RapidMiner’s prediction when compared to the other values the software might
have predicted for each observation. In this example data set, RapidMiner has not generated
Dostları ilə paylaş: |