Retrospective Pruning
49
Surrogate Splitting Rules
A surrogate splitting rule is a backup to the main splitting rule. For example, the main
rule might use the variable CITY and the surrogate might use the variable REGION.
If CITY is missing and REGION is not missing, the surrogate rule is applied to the
observation.
The measure of agreement between a main splitting rule and a surrogate is the pro-
portion of observations in the within-node training sample that the two rules assign to
the same branch. The definition excludes observations with missing values or unseen
categorical values of the variable used in the main splitting rule. However, remaining
observations with missing or unseen values of the surrogate variable count as obser-
vations not assigned to the same branch. Therefore, an observation whose value is
missing for the variable used in the surrogate rule but not the variable in the main rule
diminishes the measure of agreement between the two rules.
The search for a surrogate rule treats infrequent categorical values as missing values.
A categorical value is considered infrequent when it appears in fewer observations
than the number specified in the MINCATSIZE= option. This policy does not dimin-
ish the agreement measure because the search for the main splitting rule also treats
infrequent values as missing.
The ARBORETUM procedure discards a surrogate unless its agreement is larger than
the proportion of nonmissing observations that the main rule assigns to any single
branch. This ensures that a surrogate has a better agreement than the trivial rule that
assigns all observations to the largest branch. The MAXSURROGATES= option in
the PROC statement specifies the maximum number of surrogate rules to use with a
splitting rule.
The ARBORETUM procedure always finds a surrogate rule achieving the maximum
possible agreement with the main splitting rule, except when the surrogate variable
is an interval and the main rule creates more than two branches. A best surrogate is
usually found even in this situation. The search begins by finding the best surrogate
among binary splits, creating two intervals, and proceeds recursively by finding the
best binary split of one of the new intervals.
Tree Assessment and the Subtree Sequence
The ASSESS statement declares an assessment measure with which to evaluate trees.
The ARBORETUM procedure anticipates creating more nodes than are worthwhile,
and therefore considers subtrees obtainable by choosing which nodes to be leaves and
then deleting their descendents. For every feasible number of leaves, the procedure
chooses a subtree with the best sequencing assessment measure (discussed in the
next paragraph), and then organizes these subtrees into a sequence, beginning with
the subtree consisting only of the original root node, and ending with the largest tree
consisting of all the nodes.
The sequencing assessment measure is the same as the assessment measure in the
ASSESS statement except when the latter is LIFT or LIFTPROFIT, in which case the
sequencing measure is ASE or PROFIT, respectively. The ARBORETUM procedure
only uses the sequencing measure to create the sequence of subtrees.
50
The ARBORETUM Procedure
Retrospective Pruning
The ASSESS statement and the SUBTREE statement select one subtree in the se-
quence. Tree results output from the CODE, DESCRIBE, SAVE, and SCORE state-
ments reflect the selected subtree, and ignore any nodes not in the subtree.
This strategy of intentionally creating more nodes than will be used is called ret-
rospective pruning
, and seeks a subtree that generalizes well, that is, a subtree that
predicts better on new data than the tree with all the nodes would. A primary purpose
of developing a predictive model after all is said and done is to make a prediction
on an observation for which the target value is not already known. Predicting vali-
dation data, observations withheld from training, is worthwhile for estimating how
well the model will do when predicting observations with unknown target value, but
is not initself an important objective because the target values in the validation data
are already known.
Splitting rules are usually more effective on the training data than on new data; the tar-
get values of training data are more homogeneous within a branch and more different
between branches than the target values of validation data. The effect accumulates.
A plot of the assessment value of each subtree in the sequence versus the number of
leaves portrays the growth of unfounded optimism as an increasing gap between the
curve computed with the training data and the curve computed with the validation
data. Larger trees may overfit, producing more errors when applied to new data than
would occur with a smaller tree. Proponents of retrospective pruning contend that the
right number of nodes for a tree cannot be known while the nodes are being created.
Retrospective pruning originated with cost–complexity pruning, described in
Breiman et al. (1984). Cost-complexity pruning uses the training data to find a
nested
sequence of subtrees, each of which evaluates best among other subtrees
with the same number of leaves. Nesting requires each subtree in the sequence to
be a subtree of all larger subtrees in the sequence. A best subtree with three leaves
might not be a subtree of a best subtree with four leaves. Consequently, a nested
sequence of best subtrees may have to omit subtrees for some specific numbers of
leaves. The nesting restriction, however, makes possible a fast method of finding
the subtrees. The ARBORETUM procedure foregoes the nesting restriction to find
a best subtree for every possible number of leaves. The PRUNEDATA=TRAIN
option in the ASSESS statement chooses the same subtrees for the sequence that
cost–complexity pruning would find, as well as non-nested subtrees for all the tree
sizes that cost–complexity pruning would omit.
Retrospective pruning based entirely on validation data was introduced with re-
duced–error pruning
in Quinlan (1987). Reduced–error pruning finds the subtree
that is best for a validation data set.
It creates no sequence of subtrees.
The
PRUNEDATA=VALID option in the ASSESS statement finds the same best subtree
and provides an entire sequence.
Formulas for Assessment Measures
All assessment measures are of the form
τ ∈Λ
ω(τ, χ)λ(τ, χ)ψ(τ, χ)