64
The ARBORETUM Procedure
sequence without using priors, and the second ASSESS statement computes it using
priors. The two SAVE statements save the assessment values for each subtree. The
first SAVE statement saves the values for the first subtree in the variable PRIORS,
and the second SAVE statement saves the values in the variable PASSESS, indicat-
ing that the overall assessment uses priors.
proc arboretum data=biased ;
input x1 x2
/ level=binary;
target heads / level=binary;
decision decdata=priors priorvar=prior;
assess;
save path
= priors_path
nodestats =priors_nodes(keep
= leaf n npriors p_heads1
where =(leaf ne .)
rename=(n
=
n_priors
npriors
=
np_priors
p_heads1 =
p_priors))
sequence
=priors_seq(keep=_assess_ rename=(_assess_ = priors))
rules
=priors_rules
;
assess priors;
save sequence
=assess_seq(keep=_assess_ rename=(_assess_ = passess))
;
run;
The three runs of the ARBORETUM procedure have produced the same splitting
rules depicted in the ORIGINAL–PATH data set show below. (The BIASED–PATH
and PRIORS–PATH data sets are the same and are not shown.)
proc print data=original_path;
var leaf variable character_value;
where relation eq ’=’;
Output 1.1.
Path to Each Leaf
CHARACTER_
Obs
LEAF
VARIABLE
VALUE
1
1
x1
0
3
1
x2
1
4
2
x1
1
5
2
x2
1
6
3
x2
0
The output describes three leaves defined as follows:
1
X1 equals 0 and X2 equals 1
2
both X1 and X2 equal 1
3
X2 equals 0
Example 1. Prior Probabilities with Biased Samples
65
Table
10
shows the proportion of observations expected to appear in each leaf, as well
as the probability that HEADS equals 1.
Table 10.
Expected Statistics for Each Leaf
Leaf
Proportion of N
Prob(
HEADS=1)
1
0.25
0.1875 (= 0.25 * 0.75)
2
0.25
0.5625 (= 0.75 * 0.75)
3
0.50
0.1250 (= 0.25 * 0.50)
These expected numbers will be compared with the actual numbers from the three
different ARBORETUM runs. The DATA step below merges the leaf statistics from
the three runs.
data nodes;
set original_nodes;
set biased_nodes;
set priors_nodes;
The following code creates
Output 1.2
showing the actual count of observations in
each leaf. Only counts in the first and last columns are in the expected proportions.
The first column results from training on the ORIGINAL data set. The last two
columns show the variables N and NPRIORS, respectively, from the execution of the
ARBORETUM procedure with prior probabilities on the BIASED training data set.
The counts in the last column (variable NPRIORS) incorporate the prior probabilities
to adjust the counts of the observations displayed in the previous column.
proc print data=nodes;
var leaf n_: np_:;
Output 1.2.
Count in Each Leaf for Each ARBORETUM Run
Obs
LEAF
n_original
n_biased
n_priors
np_priors
1
1
2529
1152
1152
1250.30
2
2
2468
1722
1722
1223.63
3
3
5003
2072
2072
2472.07
Output 1.3
shows the predicted probability that HEADS equals one for each leaf in
each ARBORETUM run. The column labeled, P–ORIGINAL, uses the ORIGINAL
training data, the next column uses the BIASED, and the last uses the BIASED data
with prior probabilities. Only the P–ORIGINAL and the last column agree with the
expected probabilities.
66
The ARBORETUM Procedure
proc print data=nodes;
var leaf p_:;
Output 1.3.
Prob(HEADS Equals
1
) by Leaf and ARBORETUM Run
Obs
LEAF
p_original
p_biased
p_priors
1
1
0.18
0.40
0.19
2
2
0.55
0.79
0.57
3
3
0.12
0.29
0.12
Incorporating Prior Probabilities in the Tree Assessment
The ARBORETUM procedure uses the misclassification rate by default for assessing
how well a tree fits data with a categorical target. After creating a tree, the proce-
dure creates a sequence of subtrees such that each subtree in the sequence has the
lowest misclassification rate among all subtrees with the same number of leaves. The
SEQUENCE= option to the SAVE statement creates a data set containing statistics
about each subtree.
The following DATA step code combines the subtree sequence statistics from the
three ARBORETUM runs. The third run computes the misclassification rates in two
different ways: first without and then with incorporating prior probabilities into the
misclassification counts. The DATA step therefore merges four sequences.
data sequence;
set original_seq;
set biased_seq;
set priors_seq;
set assess_seq;
Output 1.4
shows the data set of four subtree sequences.
The column labeled
ORIGINAL results from the ARBORETUM run using the ORIGINAL training data
set, while the other columns are based on the BIASED sample. Prior probabilities
were specified in the run producing last two columns, PRIORS and PASSESS. Only
for the last column, PASSESS, were the misclassification counts adjusted with prior
probabilities.
The rows represent subtrees with 1, 2, 3, and 4 subtrees respectively. The first row
represents the consequence of predicting each observation to have HEADS equal to
0
. The manner of generating the data results in an expected proportion of observa-
tions with HEADS equal to 0 of 3/4, so that the expected misclassification rate is
1/4. Only the first and last columns match this expected proportion. In fact, the first
and last columns agree for every subtree.
proc print data=sequence;