The ARBORETUM Procedure
Overview
A decision tree is a type of predictive model developed independently in the statistics
and artificial intelligence communities. A tree partitions large amounts of data into
segments called terminal nodes or leaves. The data in a leaf determine estimates of
the value of a target variable, the dependent variable to be predicted. These estimates
are subsequently applied to predict the target of a new observation assigned to the
leaf. The ARBORETUM procedure searches for partitions that fit the training data,
the data used to compute the estimates. If these estimates fit new data well, the tree is
said to generalize well. Good generalization is the primary goal for predictive tasks.
A tree might fit the training data well but generalize poorly.
Decision trees may also help prepare data for other predictive models by suggesting
which variables to use, suggesting interactions of variables, and providing segments
for stratified modeling.
Trees are popular because they seem easy to use and understand. They seem easy to
use because trees accept interval, ordinal, and nominal variables and tolerate missing
values well. No understanding of statistical distributions is necessary because tree
construction relies on frequencies of values in the training data set. Trees seem easy
to understand because small trees clearly depict how a few variables characterize a
target variable.
However, trees have shortcomings. Small trees, though easy to understand, are too
simplistic to represent complex relationships in data, and large trees have many par-
titions that, collectively, may be difficult to comprehend. Trees require a lot of data
to discover complex relationships, and even to fit a simple linear relationship well.
Trees are therefore less efficient and less intuitive than a simple regression when a
linear relationship exists. Even when a tree provides a simple and accurate descrip-
tion, other equally simple and accurate descriptions may exist. The tree would then
give the false impression that certain variables uniquely explain the variations in the
target values, whereas different variables would suggest a different interpretation that
might generalize just as well.
Note: this document describes syntax for the ARBORETUM procedure in SAS 9.1.
The syntax is likely to change and be incompatible in future releases.
6
The ARBORETUM Procedure
Terminology
The ARBORETUM procedure uses recursive partitioning to create a decision tree.
Recursive partitioning partitions the data into subsets and then partitions each of the
subsets, and so on. In the terminology of the tree metaphor, the subsets are nodes,
the original data set is the root node, and the final, unpartitioned subsets are terminal
nodes
or leaves. Nodes that are not terminal nodes are sometimes called internal
nodes. The subsets of a single partition are commonly called child nodes, thereby
mixing the metaphor with genealogy, which also provides the terms descendent and
ancestor
nodes. A branch of a node consists of a child node and its descendents.
The ARBORETUM procedure defines a partition in terms of values of a single vari-
able selected from a set of available variables called input variables. A rule assigning
the variable values to branches is called a splitting rule. It is sometimes called the
primary
splitting rule when discussing alternative partitions of the same node. A
competing rule
refers to a partition considered on an input variable other than the one
used in the primary rule. A leaf has no primary or competing rule, but may have a
candidate rule
ready for splitting the leaf. A surrogate rule is one chosen to emulate
a primary rule, and is used when the primary rule cannot be used, most commonly
when an observation is missing a value for the primary input variable.
The ARBORETUM procedure searches for a splitting rule that maximizes an asso-
ciation between the target variable and the node partitions. The splitting criterion
defines the measure of worth of the rule.
A nominal variable is a numeric or character categorical variable in which the cate-
gories are unordered. An ordinal variable is a numeric or character categorical vari-
able in which the categories are ordered. An interval variable is a numeric variable for
which differences of values are informative. The measurement level of the variable is
the property of being nominal, ordinal, or interval.
The ARBORETUM procedure uses normalized, formatted values of categorical vari-
ables, and considers two categorical values the same if the normalized values are the
same. Normalization removes any leading blank spaces from a value, converts lower
case characters to upper case, and truncates the value to 32 characters. The FORMAT
procedure in the SAS Procedures Guide explains how to define a format. A FORMAT
statement in the current run of PROC ARBORETUM or in the DATA step that created
the training data associates a format with a variable. By default, numeric variables
use the BEST12 format, and the formatted values of character variables are the same
as the unformatted ones.
The relative proportions of categorical target values in the training data may differ
from the proportions in the data to which the tree will be applied. Estimates of these
latter proportions should be specified with prior probabilities when the tree is cre-
ated. If the prior probabilities are the same as the proportions of the target values in
the training data, then the predicted probabilities for an observation, also called the
posterior probabilities
, equal the proportions of the target values in the training data
in the leaf to which the observation is assigned. If the prior probabilities differ from
the training data proportions, then the posterior probabilities will also.