The SPLIT Procedure
Example 2: Creating a Decision Tree with an Interval Target
(Baseball Data)
Features
Specifying the Input Variables and the Target Variable
q
Setting the Splitting Criterion
q
Setting the P-value Adjustment Method
q
Outputting Fit Statistics
q
Outputting Leaf Node Statistics
q
Outputting Sub-Tree Statistics
q
Outputting the Decision Tree Information Data Set
q
Scoring Data with the Score Statement
q
Creating Diagnostic Scatters Plots
q
This example demonstrates how to create a decision tree for an interval target. The default PROBF splitting criterion is used to
search for and evaluate candidate splitting rules.
The example DMDB training data set SAMPSIO.DMBASE contains performance measures and salary levels for regular hitters
and leading substitute hitters in major league baseball for the year 1986 (Collier 1987). There is one observation per hitter. The
continuous target variable is log of salary (logsalar).
The SAMPSIO.DMTBASE data set is a test data set, which is scored using the scoring formula from the trained model. The
SAMPSIO.DMBASE and SAMPSIO.DMTBASE data sets and the SAMPSIO.DMDBASE training catalog are stored in the
sample library.
Program
proc split data=sampsio.dmdbase
dmdbcat=sampsio.dmdbase
criterion=probf
padjust=depth
outmatrix=trtree
outtree=treedata
outleaf=leafdata
outseq=subtree;
input league division position / level=nominal;
input no_atbat no_hits no_home no_runs no_rbi no_bb
yr_major cr_atbat cr_hits cr_home cr_runs cr_rbi cr_bb
no_outs no_assts no_error / level=interval;
target logsalar;
score data=sampsio.dmtbase nodmdb
outfit=splfit
out=splout(rename=(p_logsal=predict r_logsal=residual));
title 'Decision Tree: Baseball Data';
run;
proc print data=trtree noobs label;
title2 'Summary Tree Statistics for the Training Data';
run;
proc print data=leafdata noobs label;
title2 'Leaf Node Summary Statistics';
run;
proc print data=subtree noobs label;
title2 'Subtree Summary Statistics';
run;
proc print data=splfit noobs label;
title2 'Summary Statistics for the Scored Test Data';
run;
proc gplot data=splout;
plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;
symbol c=black i=none v=dot h= 3 pct;
axis1 minor=none color=black width=2.5;
axis2 minor=none color=black width=2.5;
title2 'Log of Salary versus the Predicted Log of Salary';
plot residual*predict / haxis=axis1 vaxis=axis2;
title2 'Plot of the Residuals versus the Predicted Log of Salary';
run;
quit;
Output
Summary Tree Statistics for the Training Data Set
The OUTMATRIX= data set contains the following summary statistics:
N - the number of observations in the training data set
q
AVERAGE - the target average
q
AVERAGE SQ ERR - the average squared prediction error (the sum of squared errors / n)
q
R SQUARED - the R-square statistic (1 - AVERAGE SQ ERR / sum of squares from the average)
q
Decision Tree: Baseball Data
Summary Tree Statistics for the Training Data
STATISTIC ==> AVE
N 163.000
AVERAGE 5.956
AVE SQ ERR 0.062
R SQUARED 0.920
Leaf Node Summary Statistics for the Training Data Set
The OUTLEAF= data set contains the following statistics:
Leaf ID number
q
N or number of observations
in each leaf node
q
The target AVERAGE for each leaf node
q
The root average squared error (ROOT ASE) for each leaf node
q
Decision Tree: Baseball Data
Leaf Node Summary Statistics
LEAF
ID N AVERAGE ROOT ASE
8 9 4.2885299792 0.0810310161
16 9 4.581814362 0.0861344155
17 1 5.1647859739 0
36 14 5.0581554082 0.1033292134
37 1 4.6539603502 0
29 1 4.3820266347 0
19 2 5.5274252033 0.0893458944
20 7 5.5534989096 0.120153823
21 5 5.200846713 0.1327198925
38 4 6.0965134867 0.0684629543
39 7 5.7171876218 0.1860772093
40 4 5.9148579892 0.1708160326
41 8 6.3897459513 0.150561041
23 13 6.4853119091 0.3839130342
13 16 5.7881752468 0.2598459864
32 6 5.902511632 0.2696864743
33 9 6.4895454866 0.1092807452
25 1 4.4998096703 0
34 27 6.6125466821 0.3489469023
35 2 7.6883712553 0.1000475779