Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	48/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 44 45 46 47 48 49 50 51 ... 65

CHAPTER SUMMARY
REVIEW QUESTIONS

Data Mining for the Masses
172
CHAPTER SUMMARY

Decision trees are excellent predictive models when the target attribute is categorical in nature, and
when  the  data  set  is  of  mixed  types.    Although  this  chapter’s  data  sets  did  not  contain  any
examples, decision trees are better than more statistics-based approaches at handling attributes that
have  missing  or  inconsistent  values  that  are  not  handled—decision  trees  will  work  around  such
data and still generate usable results.

Decision trees are made of nodes and leaves (connected by labeled branch arrows), representing
the best predictor attributes in a data set.  These nodes and leaves lead to confidence percentages
based  on  the  actual  attributes  in  the  training  data  set,  and  can  then  be  applied  to  similarly
structured  scoring  data  in  order  to  generate  predictions  for  the  scoring  observations.    Decision
trees tell us what is predicted, how confident we can be in the prediction, and how we arrived at the
prediction.  The ‘how we arrived at’ portion of a decision tree’s output is shown in a graphical view
of the tree.

REVIEW QUESTIONS

1)

What characteristics of a data set’s attributes might prompt you to choose a decision tree
data mining methodology, rather than a logistic or linear regression approach? Why?

2)

Run this chapter’s model  using the gain_ratio algorithm and make a note of three or four
individuals’ prediction and confidences.  Then re-run the model under gini_index.  Locate
the  people  you  noted.    Did  their  prediction  and/or  confidences  change?    Look  at  their
attribute values and compare them to the nodes and leaves in the decision tree.   Explain
why you think at least one person’s prediction changed under Gini, based on that person’s
attributes and the tree’s nodes.

3)

What are confidence percentages used for, and why would they be important to consider,
in addition to just considering the prediction attribute?

4)

How do you keep an attribute, such as a person’s name or ID number, that should not be
considered predictive in a process’s model, but is useful to have in the data mining results?

Chapter 10: Decision Trees
173

5)

If  your  decision  tree  is  large  or  hard  to  read,  how  can  you  adjust  its  visual  layout  to
improve readability?

EXERCISE

For this chapter’s exercise, you will make a decision tree to predict whether or not you, and others
you  know  would  have  lived,  died,  or  been  lost  if  you  had  been  on  the  Titanic.    Complete  the
following steps.

1)

Conduct  an  Internet  search  for  passenger  lists  for  the  Titanic.    The  search  term  ‘Titanic
passenger  list’  in  your  favorite  search engine  will yield a  number  of  web  sites  containing
lists of passengers.

2)

Select from the sources you find a sample of passengers.  You do not need to construct a
training data set of every passenger on the Titanic (unless you want to), but get at least 30,
and preferably more.  The more robust your training data set is, the more interesting your
results will be.

3)

In a spreadsheet in OpenOffice Calc, enter these passengers’ data.
a.

Record attributes such as their name, age,  gender, class of service they traveled in,
race  or  nationality  if  known,  or  other  attributes  that  may  be  available  to  you
depending on the detail level of the data source you find.
b.

Be  sure  to  have  at  least  four  attributes,  preferably  more.    Remember  that  the
passengers’ names or ID numbers won’t be predictive, so that attribute shouldn’t
be counted as one of your predictor attributes.
c.

Add to your data set whether the person lived (i.e. was rescued from a life boat or
from the water), died (i.e. their body was recovered), or was lost (i.e. was on the
Titanic’s manifest but was never accounted for and therefore presumed dead after
the ship’s sinking). Call this attribute ‘Survival_Result’.
d.

Save  this  spreadsheet  as  a  CSV  file  and  then  import  it  into  your  RapidMiner
repository.    Set  the  Survival_Result  attribute’s  role  to  be  your  label.    Set  other

Data Mining for the Masses
174
attributes  which  are  not  predictive,  such  as  names,  to  not  be  considered  in  the
decision tree model.
e.

Add a Decision Tree operator to your stream.

4)

In a new, blank spreadsheet in OpenOffice Calc, duplicate the attribute names from your
training  data  set,  with  the  exception  of  Survival_Result.    You  will  predict  this  attribute
using your decision tree.

5)

Enter data for yourself and people that you know into this spreadsheet.
a.

For some attributes, you may have to decide what to put.  For example, the author
acknowledges  that  based  on  how  relentlessly  he  searches  for  the  absolutely
cheapest ticket when shopping for airfare, he almost certainly would have been in
3
rd
class if he had been on the Titanic.  He further knows some people who very
likely would have been in 1
st
class.
b.

If you want to include some people in your data set but you don’t know every single
attribute for them, remember, decision trees can handle some missing values.
c.

Save this spreadsheet as a CSV file and import it into your RapidMiner repository.
d.

Drag this data set into your process and ensure that attributes that are not predictive,
such as names, will not be included as predictors in the model.

6)

Apply your decision tree model to your scoring data set.

7)

Run your model using gain_ratio.  Report your tree nodes, and discuss whether you and
the people you know would have lived, died or been lost.

8)

Re-run your model using gini_index.  Report differences in your tree’s structure.  Discuss
whether your chances for survival increase under Gini.

9)

Experiment  with  changing  leaf  and  split  sizes,  and  other  decision  tree  algorithm  criteria,
such as information_gain. Analyze and report your results.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 44 45 46 47 48 49 50 51 ... 65