Data Mining
for the Masses
168
Figure 10-10. Predictions and their associated confidence percentages
using our decision tree.
11)
We’ve already begun to evaluate our model’s results, but what if we feel like we’d like to see
greater detail, or granularity in our model. Surely some of our other attributes are also
predictive in nature. Remember that CRISP-DM is cyclical in nature, and that in some
modeling techniques, especially those with less structured data, some back and forth trial-
and-error can reveal more interesting patterns in data. Switch back to design perspective,
click on the Decision Tree operator, and in the Parameters area, change the ‘criterion’
parameter to ‘gini_index’, as shown in Figure 10-11.
Figure 10-11. Constructing our
decision tree model using the gini_index algorithm
rather than the
gain_ratio algorithm.
Chapter 10:
Decision Trees
169
Now, re-run the model and we will move on to…
EVALUATION
Figure 10-12. Tree resulting from a gini_index algorithm.
We see in this tree that there is much more detail, more granularity in using the Gini algorithm as
our parameter for our decision tree. We could further modify the tree by going back to design
view and changing the minimum number of items to form a node (size for split) or the minimum
size for a leaf. Even accepting the defaults for those parameters though, we can see that the Gini
algorithm alone is much more sensitive than is the Gain Ratio algorithm in identifying nodes and
leaves. Take a minute to explore around this new tree model. You will find that it is extensive,
and that you will to use both the Zoom and Mode tools to see it all. You should find that most of
our other independent variables (predictor attributes) are now being used, and the granularity with
which Richard can identify each customer’s likely adoption category is much greater. How active
the person is on Richard’s employer’s web site is still the single best predictor, but gender, and
multiple levels of age have now also come into play. You will also find that a single attribute is
sometimes used more than once in a single branch of the tree. Decision trees are a lot of fun to
experiment with, and with a sensitive algorithm like Gini generating them, they can be
tremendously interesting as well.
Data Mining for the Masses
170
Switch to the ExampleSet tab in Data View. We see here (Figure 10-13) that changing our tree’s
underlying algorithm has, in
some cases, also changed our confidence in the prediction.
Figure 10-13. New predictions and confidence percentages using Gini.
Let’s take the person on Row 1 (ID 56031) as an example. In Figure 10-10, this person was
calculated as having at least some percentage chance of landing in any one of the four adopter
categories. Under the Gain Ratio algorithm, we were 41% sure he’d be an early adopter, but
almost 32% sure he might also turn out to be an innovator. In other words, we feel confident he’ll
buy the eReader early on, but we’re not sure how early. Maybe that matters to Richard, maybe not.
He’ll have to decide during the deployment phase. But perhaps using Gini, we can help him
decide. In Figure 10-13, this same man is now shown to have a 60% chance of being an early
adopter and only a 20% chance of being an innovator. The odds of him becoming part of the late
majority crowd under the Gini model have dropped to zero. We know he will adopt (or at least we
are
predicting with 100% confidence that he will adopt), and that he will adopt early. While he may
not be at the top of Richard’s list when deployment rolls around, he’ll probably be higher than he
otherwise would have been under gain_ratio. Note that while Gini has changed some of our
predictions, it hasn’t affected all of them. Re-check person ID 77373 briefly. There is no
difference in this person’s predictions under either algorithm—RapidMiner is quite certain in its
predictions for this young man. Sometimes the level of confidence in a prediction through a