The DUMMY option in the SCORE statement species that the OUT= data set contain
value of –i– equals the proportion of the observation assigned to the leaf with leaf
identiﬁcation number i. The sum of these variables equals one for each observation.
Unless the MISSING=DISTRIBUTE option is speciﬁed in some INPUT statement
or in the PROC statement, exactly one of the variables –i– equals one, and the rest
are zero. When the MISSING=DISTRIBUTE option is speciﬁed, observations are
distributed over more than one leaf, and –i– equals the proportion of the observation
assigned the leaf i.
The SEQUENCE= option in the SAVE statement speciﬁes a data set to contain ﬁt
statistics for all subtrees in the subtree sequence. See the
“Tree Assessment and the
section beginning on page 49 for an explanation of the subtree
sequence. Each observation describes a subtree with a different number of leaves.
The variables are
• –ASSESS–, the assessment value
• –VASSESS–, the assessment value based on validation data
• –SEQUENCE–, the assessment value used for creating the subtree sequence
if different from –ASSESS–
• –VSEQUENCE–, the validation assessment value used for creating the sub-
tree sequence if different from –VASSESS–
• ﬁt statistics variables output by the OUTFIT= option of the SCORE statement
Example 1. Prior Probabilities with Biased Samples
This example illustrates the need for prior probabilities when the training data con-
tains different proportions of categorical target values than does the data to which the
model is intended to apply. A common situation is that of a binary target in which one
value occurs infrequently. Some analysts will remove a portion of the observations
with the more frequent target value from the training data. One reason is to reduce
the volume of data without changing the predictions very much. Another stems from
the belief that the algorithm performs better when starting with equal proportions of
the target values.
This example compares four approaches to analyzing data in which one value of a
binary target dominates:
The ARBORETUM Procedure
using prior probabilities in the split search as well
that would obtain from analyzing the original data. Using prior probabilities in the
split search produces splits similar to those found in the original data, undermining
any attempt to train the tree on roughly equal amounts of the two target values.
Suppose variables COIN1 and COIN2 represent the outcomes of tossing two coins in
which heads and tails are equally likely. Let the variable HEADS be 1 if both COIN1
and COIN2 are heads, and 0 otherwise. Evidently, the probability that HEADS
equals 1 is 1/4.
COIN1 and COIN2 completely determine HEADS, and a decision tree should pre-
dict HEADS perfectly, regardless of what prior probabilities are speciﬁed or what
proportion of target values are in the training data, as long as at least one instance
of each of the four possible combinations of COIN1 and COIN2 are in the training
Prior probabilities become necessary when the input variables inﬂuence the target
variable X1 equal to COIN1 75 percent of the time, and generates a random variable
X2 similarly. When COIN1 and COIN2 are both heads, the chance that both X1 and
X2 are heads is 0.75 times 0.75, which equals 0.56.
The proportion of observations in the data set ORIGINAL for which HEADS equals
is about 1/4. By removing every third observation in which HEADS is not equal to
, the DATA step below creates a data set named BIASED, in which the proportion
of observations with HEADS equal to one is about 1/2.
retain n_tails 0;
if heads eq 0 then do;
n_tails + 1;
if mod( n_tails, 3) eq 0 then output;
In the following code, the ARBORETUM procedure creates a tree from the data set
ORIGINAL and saves three data sets containing results:
ORIGINAL–PATH description of each path to a leaf
ORIGNAL–NODES counts and predictions in each leaf
ORIGNAL–SEQ assessment of each subtree
The KEEP= and RENAME= data set options specify which variables to keep and
what to name them. Selecting and renaming variables now will make it easy to merge
and print results from different runs of the procedure later. The procedure is also run
using the BIASED data set. The code is not shown because it is identical to that
below except that ‘biased’ replaces ‘original’ everywhere.
proc arboretum data=original ;
x1 x2 / level=binary;
target heads / level=binary;
save path = original_path
= leaf n p_heads1
where =(leaf ne .)
=original_seq(keep=_assess_ rename=(_assess_ = original))
The following DATA step creates a variable, PRIOR, equal to the prior probability
of the corresponding value of HEADS.
input heads prior;
The ARBORETUM procedure is run a second time on the BIASED data set, this time
with a DECISION statement to include the prior probabilities speciﬁed in the PRIOR
variable of the PRIORS data set. The procedure uses the priors to adjust the poste-
rior probabilities, but not to adjust the overall evaluation of a subtree unless explicitly
“Incorporating Prior Probabilities in the Tree Assessment”
the two ASSESS statements here. The ﬁrst ASSESS statement computes the subtree