Incorporating Prior Probabilities in the Split Search
67
Output 1.4.
Subtree Sequences of Misclassification Rates
Obs
original
biased
priors
passess
1
0.2420
0.48928
0.48928
0.25000
2
0.2420
0.33279
0.48928
0.25000
3
0.2162
0.28629
0.28629
0.21579
4
0.2162
0.28629
0.28629
0.21579
Incorporating Prior Probabilities in the Split Search
The worth of a splitting rule for a categorical target depends on the relative propor-
tions of the different target values within each branch (and within the node being
split). When the PRIORSSEARCH option to the PROC ARBORETUM statement
is specified, the ARBORETUM procedure uses the prior probabilities to modify the
proportions and consequently modify the worth of a splitting rule. This section com-
pares the worth computed for the same split in four situations:
ORIGINAL
4,946 unbiased observations from the ORIGINAL data
BIASED
the BIASED subsample, no priors
PRIORS
the BIASED subsample with prior probabilities
PSEARCH
the BIASED subsample with the PRIORSSEARCH option
The worth of a split depends on the number of observations. To compare the split
worth, the number of observations in the ORIGINAL training data is reduced here
to the same number (4,946) as in the BIASED data. All candidate splitting rules are
saved in a data set named ORIGINAL–RULES.
proc arboretum data=original(obs=4946);
input
x1 x2
/ level=binary;
target heads / level=binary;
assess;
save rules
=original_rules ;
The splitting rules for the BIASED and PRIORS run have already been saved in the
data sets BIASED–RULES and PRIORS–RULES, respectively.
The following code uses the PRIORSSEARCH option and saves the splitting rules in
the data set PSEARCH–RULES.
proc arboretum data=biased priorssearch ;
input x1 x2
/ level=binary;
target heads / level=binary;
decision decdata=priors priorvar=prior;
assess priors;
save rules
=psearch_rules;
68
The ARBORETUM Procedure
The WORTHDS macro extracts the worth of the best splitting rule on X1 and on
X2 in the root node as saved in the RULES= option in the SAVE statement of the
ARBORETUM procedure.
%macro worthds( prefix);
data &prefix._worth;
length var
$ 2;
length variable $ 2;
keep &prefix
variable ;
set &prefix._rules end=the_end;
retain x1 x2 var;
if node eq 1 and stat=’VARIABLE’ then
var = left(trim(character_value));
if node eq 1 and stat=’WORTH’ then do;
if var eq ’x1’ then x1 = numeric_value;
else
x2 = numeric_value;
end;
if the_end ne 0 then do;
variable = ’x1’;
&prefix
= x1;
output;
variable = ’x2’;
&prefix
= x2;
output;
end;
%mend;
The following code extracts the split worth from each of four ARBORETUM runs
and combines them into one data set, which is then printed.
%worthds( original);
%worthds( biased);
%worthds( priors);
%worthds( psearch);
data worth;
set original_worth;
set biased_worth;
set priors_worth;
set psearch_worth;
References
69
Output 1.5.
Split Worth for Four ARBORETUM Runs
Obs
variable
original
biased
priors
psearch
1
x1
88.7930
127.151
127.151
93.5635
2
x2
97.8053
127.467
127.467
93.4600
Output 1.5
shows the worth of the same split on X1 in the four situations, and sim-
ilarly for X2. X1 and X2 produce splits of equal worth, as expected, except in the
truncated ORIGINAL training data, where, presumably, chance variation in the sam-
ple resulted in a better split for X2. Using the BIASED training data, only the col-
umn from the ARBORETUM run with the PRIORSSEARCH option produced a split
worth comparable to that of the ORIGINAL data set.
References
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984), Classification and
Regression Trees
, Belmont, CA: Wadsworth, Inc.
Hawkins, D.M., ed. (1982), Topics in Applied Multivariate Analysis, Cambridge:
Cambridge University Press.
Hawkins, D.M. and Kass, G.V. (1982), “Automatic Interaction Detection,” in
Hawkins (1982).
Kass, G.V. (1980), “An Exploratory Technique for Investigating Large Quantities of
Categorical Data,” Applied Statistics, 29, 119–127.
Quinlan, R.J. (1987), “Simplifying Decision Trees,” International Journal of Man-
Machine Studies
, 27, 221–234.
Quinlan, R.J. (1993), C4.5: Programs for Machine Learning, San Mateo, CA:
Morgan Kaufmann Publishers, Inc.