Data Mining for the Masses
172
CHAPTER SUMMARY
Decision trees are excellent predictive models when the target attribute is categorical in nature, and
when the data set is of mixed types. Although this chapter’s data sets did not contain any
examples, decision trees are better than more statistics-based approaches at handling attributes that
have missing or inconsistent values that are not handled—decision trees will work around such
data and still generate usable results.
Decision trees are made of nodes and leaves (connected by labeled branch arrows), representing
the best predictor attributes in a data set. These nodes and leaves lead to confidence percentages
based on the actual attributes in the training data set, and can then be applied to similarly
structured scoring data in order to generate predictions for the scoring observations. Decision
trees tell us what is predicted, how confident we can be in the prediction, and how we arrived at the
prediction. The ‘how we arrived at’ portion of a decision tree’s output is shown in a graphical view
of the tree.
REVIEW QUESTIONS
1)
What characteristics of a data set’s attributes might prompt you to choose a decision tree
data mining methodology, rather than a logistic or linear regression approach? Why?
2)
Run this chapter’s model using the gain_ratio algorithm and make a note of three or four
individuals’ prediction and confidences. Then re-run the model under gini_index. Locate
the people you noted. Did their prediction and/or confidences change? Look at their
attribute values and compare them to the nodes and leaves in the decision tree. Explain
why you think at least one person’s prediction changed under Gini, based on that person’s
attributes and the tree’s nodes.
3)
What are confidence percentages used for, and why would they be important to consider,
in addition to just considering the prediction attribute?
4)
How do you keep an attribute, such as a person’s name or ID number, that should not be
considered predictive in a process’s model, but is useful to have in the data mining results?
Chapter 10: Decision Trees
173
5)
If your decision tree is large or hard to read, how can you adjust its visual layout to
improve readability?
EXERCISE
For this chapter’s exercise, you will make a decision tree to predict whether or not you, and others
you know would have lived, died, or been lost if you had been on the Titanic. Complete the
following steps.
1)
Conduct an Internet search for passenger lists for the Titanic. The search term ‘Titanic
passenger list’ in your favorite search engine will yield a number of web sites containing
lists of passengers.
2)
Select from the sources you find a sample of passengers. You do not need to construct a
training data set of every passenger on the Titanic (unless you want to), but get at least 30,
and preferably more. The more robust your training data set is, the more interesting your
results will be.
3)
In a spreadsheet in OpenOffice Calc, enter these passengers’ data.
a.
Record attributes such as their name, age, gender, class of service they traveled in,
race or nationality if known, or other attributes that may be available to you
depending on the detail level of the data source you find.
b.
Be sure to have at least four attributes, preferably more. Remember that the
passengers’ names or ID numbers won’t be predictive, so that attribute shouldn’t
be counted as one of your predictor attributes.
c.
Add to your data set whether the person lived (i.e. was rescued from a life boat or
from the water), died (i.e. their body was recovered), or was lost (i.e. was on the
Titanic’s manifest but was never accounted for and therefore presumed dead after
the ship’s sinking). Call this attribute ‘Survival_Result’.
d.
Save this spreadsheet as a CSV file and then import it into your RapidMiner
repository. Set the Survival_Result attribute’s role to be your label. Set other
Data Mining for the Masses
174
attributes which are not predictive, such as names, to not be considered in the
decision tree model.
e.
Add a Decision Tree operator to your stream.
4)
In a new, blank spreadsheet in OpenOffice Calc, duplicate the attribute names from your
training data set, with the exception of Survival_Result. You will predict this attribute
using your decision tree.
5)
Enter data for yourself and people that you know into this spreadsheet.
a.
For some attributes, you may have to decide what to put. For example, the author
acknowledges that based on how relentlessly he searches for the absolutely
cheapest ticket when shopping for airfare, he almost certainly would have been in
3
rd
class if he had been on the Titanic. He further knows some people who very
likely would have been in 1
st
class.
b.
If you want to include some people in your data set but you don’t know every single
attribute for them, remember, decision trees can handle some missing values.
c.
Save this spreadsheet as a CSV file and import it into your RapidMiner repository.
d.
Drag this data set into your process and ensure that attributes that are not predictive,
such as names, will not be included as predictors in the model.
6)
Apply your decision tree model to your scoring data set.
7)
Run your model using gain_ratio. Report your tree nodes, and discuss whether you and
the people you know would have lived, died or been lost.
8)
Re-run your model using gini_index. Report differences in your tree’s structure. Discuss
whether your chances for survival increase under Gini.
9)
Experiment with changing leaf and split sizes, and other decision tree algorithm criteria,
such as information_gain. Analyze and report your results.
Dostları ilə paylaş: |