The ARBORETUM Procedure
sequence without using priors, and the second ASSESS statement computes it using
priors. The two SAVE statements save the assessment values for each subtree. The
ﬁrst SAVE statement saves the values for the ﬁrst subtree in the variable PRIORS,
and the second SAVE statement saves the values in the variable PASSESS, indicat-
ing that the overall assessment uses priors.
proc arboretum data=biased ;
input x1 x2
target heads / level=binary;
decision decdata=priors priorvar=prior;
= leaf n npriors p_heads1
where =(leaf ne .)
=priors_seq(keep=_assess_ rename=(_assess_ = priors))
=assess_seq(keep=_assess_ rename=(_assess_ = passess))
The three runs of the ARBORETUM procedure have produced the same splitting
rules depicted in the ORIGINAL–PATH data set show below. (The BIASED–PATH
and PRIORS–PATH data sets are the same and are not shown.)
Path to Each Leaf
The output describes three leaves deﬁned as follows:
X1 equals 0 and X2 equals 1
both X1 and X2 equal 1
X2 equals 0
shows the proportion of observations expected to appear in each leaf, as well
as the probability that HEADS equals 1.
Expected Statistics for Each Leaf
Proportion of N
These expected numbers will be compared with the actual numbers from the three
different ARBORETUM runs. The DATA step below merges the leaf statistics from
the three runs.
The following code creates
showing the actual count of observations in
each leaf. Only counts in the ﬁrst and last columns are in the expected proportions.
The ﬁrst column results from training on the ORIGINAL data set. The last two
columns show the variables N and NPRIORS, respectively, from the execution of the
ARBORETUM procedure with prior probabilities on the BIASED training data set.
The counts in the last column (variable NPRIORS) incorporate the prior probabilities
to adjust the counts of the observations displayed in the previous column.
Count in Each Leaf for Each ARBORETUM Run
shows the predicted probability that HEADS equals one for each leaf in
each ARBORETUM run. The column labeled, P–ORIGINAL, uses the ORIGINAL
training data, the next column uses the BIASED, and the last uses the BIASED data
with prior probabilities. Only the P–ORIGINAL and the last column agree with the
) by Leaf and ARBORETUM Run
The ARBORETUM procedure uses the misclassiﬁcation rate by default for assessing
how well a tree ﬁts data with a categorical target. After creating a tree, the proce-
dure creates a sequence of subtrees such that each subtree in the sequence has the
lowest misclassiﬁcation rate among all subtrees with the same number of leaves. The
SEQUENCE= option to the SAVE statement creates a data set containing statistics
about each subtree.
The following DATA step code combines the subtree sequence statistics from the
three ARBORETUM runs. The third run computes the misclassiﬁcation rates in two
different ways: ﬁrst without and then with incorporating prior probabilities into the
misclassiﬁcation counts. The DATA step therefore merges four sequences.
shows the data set of four subtree sequences.
The column labeled
ORIGINAL results from the ARBORETUM run using the ORIGINAL training data
set, while the other columns are based on the BIASED sample. Prior probabilities
were speciﬁed in the run producing last two columns, PRIORS and PASSESS. Only
for the last column, PASSESS, were the misclassiﬁcation counts adjusted with prior
The rows represent subtrees with 1, 2, 3, and 4 subtrees respectively. The ﬁrst row
represents the consequence of predicting each observation to have HEADS equal to
. The manner of generating the data results in an expected proportion of observa-
1/4. Only the ﬁrst and last columns match this expected proportion. In fact, the ﬁrst
and last columns agree for every subtree.
proc print data=sequence;