Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	46/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 42 43 44 45 46 47 48 49 ... 65

Chapter 10: Decision Trees
163

4)

One  of  the  nice  side-effects  of  setting  an  attribute’s  role  to  ‘id’  rather  than  removing  it
using  a  Select  Attributes  operator  is  that  it  makes  each  record  easier  to  match  back  to
individual people later, when viewing predictions in results perspective.  Thinking back to
some of our other predictive models in previous chapters (e.g. Discriminant Analysis), you
could use such an approach to leave in peoples’ names or ID numbers so that you could
easily know who to contact during the deployment phase of data mining projects.

Before adding a Decision Tree operator, we still need to do another data preparation step.
The Decision Tree operator, as with other predictive model operators we’ve used to this
point in the text,  expects the training stream to supply a ‘label’ attribute.  For this example,
we want to predict which adopter group Richard’s next-gen eReader customers are likely to
be in. So our label will be eReader_Adoption (Figure 10-4).

Figure 10-4. Setting the eReader_Adoption attribute as the label in our training stream.

5)

Next,  search  in  the  Operators  tab  for  ‘Decision  Tree’.    Select  the  basic  Decision  Tree
operator and add it to your training stream as it is in Figure 10-5.

Data Mining for the Masses
164

Figure 10-5. The Decision Tree operator added to our model.

6)

Run the model and switch to the Tree (Decision Tree) tab in results perspective.  You will
see our preliminary tree (Figure 10-6).

Figure 10-6. Decision tree results.

7)

In Figure 10-6, we can see what are referred to as nodes and leaves.  The nodes are the
gray oval shapes.  They are attributes which serve as good predictors for our label attribute.
The leaves are the multicolored end points that show us the distribution of categories from

Chapter 10: Decision Trees
165
our label attribute that follow the branch of the tree to the point of that leaf. We can see in
this tree that Website_Activity is our best predictor of whether or not a customer is going
to adopt (buy) the company’s new eReader.  If the person’s activity is frequent or regular,
we see that they are likely to be an Innovator or Early Adopter, respectively.  If however,
they seldom use the web site, then whether or not they’ve bought digital books becomes
the next best predictor of their eReader adoption category.  If they have not bought digital
books through the web site in the past, Age is another predictive attribute which forms a
node, with younger folks adopting sooner than older ones.  This is seen on the branches
for the two leaves coming from the Age node in Figure 10-6.  Those who seldom use the
company’s website, have never bought digital books on the site, and are older than 25 ½
are most likely to land in the Late Majority category, while those with the same profile but
are under 25 ½ are bumped to the Early Majority prediction.  In this example you can see
how you read the nodes, leaves and branch labels as you move down through the tree.

Before returning to design perspective, take a minute to try some of the tools on the left
hand  side  of  the  screen.    The  magnifying  glasses  can  help  you  see  your  tree  better,
spreading out or compacting the nodes and leaves to enhance readability or to view more
of a large tree at one time.  Also, try using the ‘hand’ icon under Mode (see the arrow on
Figure  10-6).    This  allows  you  to  click  and  hold  on  individual  leaves  or  nodes  and  drag
them around to enhance your tree’s readability.  Finally, try hovering your mouse over one
of the leaves in the tree.  In Figure 10-7, we see a tool-tip hover box showing details of this
leaf.  Although our training data is going to predict that ‘regular’ web site users are going to
be Early Adopters, the model is not 100% based on that prediction.  In the hover, we read
that  in  the  training  data  set,  9  people  who  fit  this  profile  are  Late  Adopters,  58  are
Innovators, 75 are Early Adopters and 41 are Early Majority.  When we get to Evaluation
phase,  we  will  see  that  this  uncertainty  in  our  data  will  translate  into  confidence
percentages, similar to what we saw in Chapter 9 with logistic regression.

Data Mining for the Masses
166

Figure 10-7. A tool-tip hover showing expanded leaf detail in our tree.

With our predictor attributes prepared, we are now ready to move on to…

MODELING

8)

Return to design perspective.  In the Operators tab search for and add an  Apply Model
operator, bringing your training and scoring streams together.  Ensure that both the lab and
mod ports are connected to res ports in order to generate our desired outputs (Figure 10-8).

Figure 10-8. Applying the model to our scoring data, and outputting
label predictions (lab) and a decision tree model (mod).

Chapter 10: Decision Trees
167

9)

Run the model.  You will see familiar results—the tree remains the same as it was in Figure
10-6,  for  now.    Click  on  the  ExampleSet  tab  next  to  the  Tree  tab.    Our  tree  has  been
applied to our scoring data.  As was the case with logistic regression, confidence attributes
have been created by RapidMiner, along with a prediction attribute.

Figure 10-9. Meta data for scoring data set predictions.

10)

Switch  to  Data View  using  the  radio  button.   We  see  in  Figure  10-10  the prediction  for
each  customer’s  adoption  group,  along  with  confidence  percentages  for  each  prediction.
Unlike the logistic regression example in the previous chapter, there are four confidence
attributes, corresponding to the four possible values in the label (eReader_Adoption).  We
interpret these the same way that we did with the other models though—the percentages
add  to  100%,  and  the  prediction  is  whichever  category  yielded  the  highest  confidence
percentage.    RapidMiner  is  very  (but  not  100%)  convinced  that  person  77373  (Row  14,
Figure  10-10)  is  going  to  be  a  member  of  the  early  majority  (88.9%).    Despite  some
uncertainty,  RapidMiner  is  completely  sure  that  this  person  is  not  going  to  be  an  early
adopter (0%).

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 42 43 44 45 46 47 48 49 ... 65