Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	47/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 43 44 45 46 47 48 49 50 ... 65

Data Mining for the Masses
168

Figure 10-10. Predictions and their associated confidence percentages
using our decision tree.

11)

We’ve already begun to evaluate our model’s results, but what if we feel like we’d like to see
greater  detail,  or  granularity  in  our  model.  Surely  some  of  our  other  attributes  are  also
predictive  in  nature.    Remember  that  CRISP-DM  is  cyclical  in  nature,  and  that  in  some
modeling techniques, especially those with less structured data, some back and forth trial-
and-error can reveal more interesting patterns in data.  Switch back to design perspective,
click  on  the  Decision  Tree  operator,  and  in  the  Parameters  area,  change  the  ‘criterion’
parameter to ‘gini_index’, as shown in Figure 10-11.

Figure 10-11. Constructing our decision tree model using the gini_index algorithm
rather than the gain_ratio algorithm.

Chapter 10: Decision Trees
169
Now, re-run the model and we will move on to…

EVALUATION

Figure 10-12. Tree resulting from a gini_index algorithm.

We see in this tree that there is much more detail, more granularity in using the Gini algorithm as
our parameter for our decision tree.  We could further modify the tree by going back to design
view and changing the minimum number of items to form a node (size for split) or the minimum
size for a leaf.  Even accepting the defaults for those parameters though, we can see that the Gini
algorithm alone is much more sensitive than is the Gain Ratio algorithm in identifying nodes and
leaves.  Take a minute to explore around this new tree model.  You will find that it is extensive,
and that you will to use both the Zoom and Mode tools to see it all.  You should find that most of
our other independent variables (predictor attributes) are now being used, and the granularity with
which Richard can identify each customer’s likely adoption category is much greater.  How active
the  person  is  on  Richard’s  employer’s  web  site  is  still  the  single  best  predictor,  but  gender,  and
multiple levels of age have now also come into play.  You will also find that a single attribute is
sometimes used more than once in a single branch of the tree.  Decision trees are a lot of fun to
experiment  with,  and  with  a  sensitive  algorithm  like  Gini  generating  them,  they  can  be
tremendously interesting as well.

Data Mining for the Masses
170
Switch to the ExampleSet tab in Data View.  We see here (Figure 10-13) that changing our tree’s
underlying algorithm has, in some cases, also changed our confidence in the prediction.

Figure 10-13. New predictions and confidence percentages using Gini.

Let’s  take  the  person  on  Row  1  (ID  56031)  as  an  example.    In  Figure  10-10,  this  person  was
calculated  as  having  at  least  some  percentage  chance  of  landing  in  any  one  of  the  four  adopter
categories.    Under  the  Gain  Ratio  algorithm,  we  were  41%  sure  he’d  be  an  early  adopter,  but
almost 32% sure he might also turn out to be an innovator.  In other words, we feel confident he’ll
buy the eReader early on, but we’re not sure how early.  Maybe that matters to Richard, maybe not.
He’ll  have  to  decide  during  the  deployment  phase.    But  perhaps  using  Gini,  we  can  help  him
decide.    In  Figure 10-13,  this  same  man  is  now  shown  to  have  a  60%  chance  of  being  an  early
adopter and only a 20% chance of being an innovator.  The odds of him becoming part of the late
majority crowd under the Gini model have dropped to zero. We know he will adopt (or at least we
are predicting with 100% confidence that he will adopt), and that he will adopt early.  While he may
not be at the top of Richard’s list when deployment rolls around, he’ll probably be higher than he
otherwise  would  have  been  under  gain_ratio.    Note  that  while  Gini  has  changed  some  of  our
predictions,  it  hasn’t  affected  all  of  them.    Re-check  person  ID  77373  briefly.    There  is  no
difference in this person’s predictions under either algorithm—RapidMiner is quite certain in its
predictions  for  this  young  man.    Sometimes  the  level  of  confidence  in  a  prediction  through  a

Chapter 10: Decision Trees
171
decision  tree  is  so  high  that  a  more  sensitive  underlying  algorithm  won’t  alter  an  observation’s
prediction values at all.

DEPLOYMENT

Richard’s original desire was to be able to figure out which customers he could expect to buy the
new  eReader  and  on  what  time  schedule,  based  on  the  company’s  last  release  of  a  high-profile
digital reader.  The decision tree has enabled him to predict that and to determine how reliable the
predictions  are.    He’s  also  been  able  to  determine  which  attributes  are  the  most  predictive  of
eReader  adoption,  and  to  find  greater  granularity  in  his  model  by  using  gini_index  as  his  tree’s
underlying algorithm.

But how will he use this new found knowledge?  The simplest and most direct answer is that he
now has a list of customers and their probable adoption timings for the next-gen eReader.  These
customers are identifiable by the User_ID that was retained in the results perspective data but not
used as a predictor in the model.  He can segment these customers and begin a process of target
marketing that is timely and relevant to each individual.  Those who are most likely to purchase
immediately (predicted innovators) can be contacted and encouraged to go ahead and buy as soon
as  the  new  product  comes  out.    They  may  even  want  the  option  to  pre-order  the  new  device.
Those  who  are  less  likely  (predicted  early  majority)  might  need  some  persuasion,  perhaps  a  free
digital  book  or  two  with  eReader  purchase  or  a  discount  on  digital  music  playable  on  the  new
eReader.  The least likely (predicted late majority), can be marketed to passively, or perhaps not at
all if marketing budgets are tight and those dollars need to be spent incentivizing the most likely
customers  to  buy.    On  the  other  hand,  perhaps  very  little  marketing  is  needed  to  the  predicted
innovators, since they are predicted to be the most likely to buy the eReader in the first place.

Further  though,  Richard  now  has  a  tree  that  shows  him  which  attributes  matter  most  in
determining  the  likelihood  of  buying  for  each  group.    New  marketing  campaigns  can  use  this
information  to  focus  more  on  increasing  web  site  activity  level,  or  on  connecting  general
electronics that are for sale on the company’s web site with the eReaders and digital media more
specifically.  These types of cross-categorical promotions can be further honed to appeal to buyers
of a specific gender or in a given age range.  Richard has much that he can use in this rich data
mining output as he works to promote the next-gen eReader.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 43 44 45 46 47 48 49 50 ... 65