Example 1. Prior Probabilities with Biased Samples
61
The DUMMY option in the SCORE statement species that the OUT= data set contain
numeric variables –i– for integers i ranging from 1 to the number of leaves. The
value of –i– equals the proportion of the observation assigned to the leaf with leaf
identification number i. The sum of these variables equals one for each observation.
Unless the MISSING=DISTRIBUTE option is specified in some INPUT statement
or in the PROC statement, exactly one of the variables –i– equals one, and the rest
are zero. When the MISSING=DISTRIBUTE option is specified, observations are
distributed over more than one leaf, and –i– equals the proportion of the observation
assigned the leaf i.
SEQUENCE= Output Data Set
The SEQUENCE= option in the SAVE statement specifies a data set to contain fit
statistics for all subtrees in the subtree sequence. See the
“Tree Assessment and the
Subtree Sequence”
section beginning on page 49 for an explanation of the subtree
sequence. Each observation describes a subtree with a different number of leaves.
The variables are
• –ASSESS–, the assessment value
• –VASSESS–, the assessment value based on validation data
• –SEQUENCE–, the assessment value used for creating the subtree sequence
if different from –ASSESS–
• –VSEQUENCE–, the validation assessment value used for creating the sub-
tree sequence if different from –VASSESS–
• fit statistics variables output by the OUTFIT= option of the SCORE statement
Examples
Example 1. Prior Probabilities with Biased Samples
This example illustrates the need for prior probabilities when the training data con-
tains different proportions of categorical target values than does the data to which the
model is intended to apply. A common situation is that of a binary target in which one
value occurs infrequently. Some analysts will remove a portion of the observations
with the more frequent target value from the training data. One reason is to reduce
the volume of data without changing the predictions very much. Another stems from
the belief that the algorithm performs better when starting with equal proportions of
the target values.
This example compares four approaches to analyzing data in which one value of a
binary target dominates:
ORIGINAL
no sampling or use of prior probabilities
BIASED
sampling of observations with the majority target value
PRIORS
using prior probabilities in prediction and assessment
62
The ARBORETUM Procedure
PSEARCH
using prior probabilities in the split search as well
Prior probabilities convert the predictions from the BIASED analysis to predictions
that would obtain from analyzing the original data. Using prior probabilities in the
split search produces splits similar to those found in the original data, undermining
any attempt to train the tree on roughly equal amounts of the two target values.
Suppose variables COIN1 and COIN2 represent the outcomes of tossing two coins in
which heads and tails are equally likely. Let the variable HEADS be 1 if both COIN1
and COIN2 are heads, and 0 otherwise. Evidently, the probability that HEADS
equals 1 is 1/4.
COIN1 and COIN2 completely determine HEADS, and a decision tree should pre-
dict HEADS perfectly, regardless of what prior probabilities are specified or what
proportion of target values are in the training data, as long as at least one instance
of each of the four possible combinations of COIN1 and COIN2 are in the training
data.
Prior probabilities become necessary when the input variables influence the target
without completely determining it. The SAS DATA step below generates a random
variable X1 equal to COIN1 75 percent of the time, and generates a random variable
X2 similarly. When COIN1 and COIN2 are both heads, the chance that both X1 and
X2 are heads is 0.75 times 0.75, which equals 0.56.
data original;
keep heads x1 x2;
call streaminit(9754321);
do i = 1 to 10000;
coin1
= rand(’bernoulli’, 0.5);
coin2
= rand(’bernoulli’, 0.5);
heads = coin1 eq 1 and coin2 eq 1;
x1 = coin1;
if rand(’bernoulli’, 0.25) ne 0 then x1 = 1 - x1;
x2 = coin2;
if rand(’bernoulli’, 0.25) ne 0 then x2 = 1 - x2;
output;
end;
The proportion of observations in the data set ORIGINAL for which HEADS equals
1
is about 1/4. By removing every third observation in which HEADS is not equal to
1
, the DATA step below creates a data set named BIASED, in which the proportion
of observations with HEADS equal to one is about 1/2.
data biased;
retain n_tails 0;
drop n_tails;
set original;
if heads eq 0 then do;
n_tails + 1;
Example 1. Prior Probabilities with Biased Samples
63
if mod( n_tails, 3) eq 0 then output;
end;
else do;
output;
end;
In the following code, the ARBORETUM procedure creates a tree from the data set
ORIGINAL and saves three data sets containing results:
ORIGINAL–PATH description of each path to a leaf
ORIGNAL–NODES counts and predictions in each leaf
ORIGNAL–SEQ assessment of each subtree
The KEEP= and RENAME= data set options specify which variables to keep and
what to name them. Selecting and renaming variables now will make it easy to merge
and print results from different runs of the procedure later. The procedure is also run
using the BIASED data set. The code is not shown because it is identical to that
below except that ‘biased’ replaces ‘original’ everywhere.
proc arboretum data=original ;
input
x1 x2 / level=binary;
target heads / level=binary;
assess;
save path = original_path
nodestats =original_nodes(keep
= leaf n p_heads1
where =(leaf ne .)
rename=(n
=
n_original
npriors
=
np_original
p_heads1 =
p_original))
sequence
=original_seq(keep=_assess_ rename=(_assess_ = original))
;
The following DATA step creates a variable, PRIOR, equal to the prior probability
of the corresponding value of HEADS.
data priors;
input heads prior;
datalines;
0
0.75
1
0.25
;
The ARBORETUM procedure is run a second time on the BIASED data set, this time
with a DECISION statement to include the prior probabilities specified in the PRIOR
variable of the PRIORS data set. The procedure uses the priors to adjust the poste-
rior probabilities, but not to adjust the overall evaluation of a subtree unless explicitly
requested. The
“Incorporating Prior Probabilities in the Tree Assessment”
section
on page 66 compares the two approaches using the assessment values saved from
the two ASSESS statements here. The first ASSESS statement computes the subtree