Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	59/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 55 56 57 58 59 60 61 62 ... 65

Data Mining for the Masses
222
‘label’ (the thing we want to predict).  Back in Chapter 10, we did this by adding another
Set Role operator, but this time, click on the ‘set additional roles: Edit List’ button in the
Parameters area. This is indicated by the black arrow in Figure 13-2.

Figure 13-2. Setting multiple roles with a single Set Role operator.

4)

In the resulting pop-up window, set the name field to be eReader_Adoption and the target
role field to be label.  (Note that we could use the Add Entry button to use this single Set
Role operator to handle role assignments for many attributes all at once.)

Figure 13-3. Setting additional roles by editing the parameters
of a single Set Role operator.

Chapter 13: Evaluation and Deployment
223

5)

When we used this data set previously, we added our Decision Tree operator at this point.
This time, we will use the search field in the Operators tab to find x-Validation operators.
There are four of them, but we will use the basic cross-validation operator in this example:

Figure 13-4. Adding a cross-validation operator to our stream.

6)

The cross-validation operator requires a two-part sub-process. In the first part of the sub-
process, we will add our Decision Tree operator to build a model, and in the second part
we will apply our model and check its performance. Double click the Validation operator
to enter the sub-process window.

Figure 13-5. Modeling and applying the model in the cross-validation sub-process.

Data Mining for the Masses
224

7)

In Figure 13-5, add the Decision Tree operator in the Training side of the cross-validation
sub-process, and the Apply Model operator on the Testing side.  Leave the Decision Tree’s
operator as gain_ratio for now.  The splines depicted here are automatically drawn when
you  drag  these  operators  into  these  two  areas.    If  for  any  reason  you  do  not  have  these
splines  configured  in  this  way,  connect  the  ports  as  shown  so  that  your  sub-process
matches Figure 13-5.  We must now complete the Testing portion of the sub-process.  In
the  Operators  search  field,  search  for  an  operator  called  ‘Performance’.    There  are  a
number of these.  We will use the first onw: Performance (Classification).  The reason for
this  is  that  a  decision  tree  predicts  a  classification  in  an  attribute—in  our  example,  the
adopter class (innovator, early adopter, etc.).

Figure 13-6. The configuration of the cross-validation sub-process.

8)

Once your sub-process is configured, click the blue up arrow to return to the main process.
Connect the mod, tra and ave ports to res ports as shown in Figure 13-7.  The mod port will
generate the visual depiction of our decision tree, the tra port will create the training data
set’s attribute table, and the avg port will calculate a True Positive table showing the training
data set’s ability to predict accurately.

Chapter 13: Evaluation and Deployment
225

Figure 13-7. Splines to create the three desired outputs from our
cross-validated Decision Tree data mining model.

9)

Run the model.  The ExampleSet (tra port) and Tree (mod port) tabs will be familiar to you.
The  PerformanceVector  (avg  port)  is  new,  and  in  the  context  of  Evaluation  and
Deployment, this tab is the most interesting to us.  We see that using this training data set
and  Decision  Tree  algorithm  (gain_ratio), RapidMiner  calculates a  54%  accuracy  rate  for
this  model.    This  overall accuracy  rate  reflects  the  class  precision  rates  for  each  possible
value in our eReader_Adoption attribute.  For pred. Late Majority as an example, the class
precision (or true positive rate) is 69.8%, leaving us with a 30.2% false positive rate for this
value.  If all of the possible eReader_Adoption values had true positive class precisions of
69.8%, then our model’s overall accuracy would be 69.8% as well, but they don’t—some
are  lower,  and  so  when  they  are  weighted  and  averaged,  our  model’s  overall  accuracy  is
only 54%.

Data Mining for the Masses
226

Figure 13-8. Evaluating the predictive quality of our decision tree model.

10)

An overall accuracy of 54% might seem alarming, and even individual class precisions in
the 40-60% range might seem discouraging, but remember, life is unpredictable, or at least
inconsistent, so 100% true positives are probably a pipe dream.  The probability of false
positives shouldn’t even be that surprising to us, because back in Chapter 10, we evaluated
our  Confidence  Percentage  attributes,  and  we  knew  back  then  that  most  of  our
observations had partial confidences in the predicted value.  In Figure 10-10, person 77373
had some chance of landing in any one of three of the four possible adopter categories—of
course there is a chance of a false positive!  But that doesn’t render our model useless, and
perhaps  we  can  improve  it.    Return  to  design  perspective  and  double  click  on  the
Validation operator to  re-open the sub-process.   Click on the Decision Tree operator to
change its criterion parameter to use gini_index as its underlying algorithm.

Figure 13-9. Changing the Decision Tree operator to use Gini.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 55 56 57 58 59 60 61 62 ... 65