The arboretum procedure



Yüklə 3,07 Mb.
Pdf görüntüsü
səhifə131/148
tarix30.04.2018
ölçüsü3,07 Mb.
#40673
1   ...   127   128   129   130   131   132   133   134   ...   148

The SPLIT Procedure

Example 2: Creating a Decision Tree with an Interval Target

(Baseball Data)

Features

Specifying the Input Variables and the Target Variable

q   

Setting the Splitting Criterion



q   

Setting the P-value Adjustment Method

q   

Outputting Fit Statistics



q   

Outputting Leaf Node Statistics

q   

Outputting Sub-Tree Statistics



q   

Outputting the Decision Tree Information Data Set

q   

Scoring Data with the Score Statement



q   

Creating Diagnostic Scatters Plots

q   

This example demonstrates how to create a decision tree for an interval target. The default PROBF splitting criterion is used to



search for and evaluate candidate splitting rules.

The example DMDB training data set SAMPSIO.DMBASE contains performance measures and salary levels for regular hitters

and leading substitute hitters in major league baseball for the year 1986 (Collier 1987). There is one observation per hitter. The

continuous target variable is log of salary (logsalar).

The SAMPSIO.DMTBASE data set is a test data set, which is scored using the scoring formula from the trained model. The

SAMPSIO.DMBASE and SAMPSIO.DMTBASE data sets and the SAMPSIO.DMDBASE training catalog are stored in the

sample library.

Program

 

proc split data=sampsio.dmdbase 



           dmdbcat=sampsio.dmdbase

 

           criterion=probf



 

           padjust=depth

 

 

           outmatrix=trtree 



 

           outtree=treedata

 

           outleaf=leafdata



 

           outseq=subtree;                      




 

   input league division position / level=nominal;

   input no_atbat no_hits no_home no_runs no_rbi no_bb

     yr_major cr_atbat cr_hits cr_home cr_runs cr_rbi cr_bb

     no_outs no_assts no_error / level=interval; 

 

   target logsalar;



 

   score data=sampsio.dmtbase nodmdb

 

    outfit=splfit



    out=splout(rename=(p_logsal=predict r_logsal=residual));

    title 'Decision Tree: Baseball Data';

run;

 

proc print data=trtree noobs label;



   title2 'Summary Tree Statistics for the Training Data';

run;


 

proc print data=leafdata noobs label;

   title2 'Leaf Node Summary Statistics';

run;


 

 proc print data=subtree noobs label;

   title2 'Subtree Summary Statistics';

run;


 

proc print data=splfit noobs label;

   title2 'Summary Statistics for the Scored Test Data';

run;


 

proc gplot data=splout;

   plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;

   symbol c=black i=none v=dot h= 3 pct;

   axis1 minor=none color=black width=2.5;

   axis2 minor=none color=black width=2.5;

   title2 'Log of Salary versus the Predicted Log of Salary';

   plot residual*predict / haxis=axis1 vaxis=axis2;

   title2 'Plot of the Residuals versus the Predicted Log of Salary';

run;


quit;

Output


Summary Tree Statistics for the Training Data Set

The OUTMATRIX= data set contains the following summary statistics:

N - the number of observations in the training data set

q   


AVERAGE - the target average

q   


AVERAGE SQ ERR - the average squared prediction error (the sum of squared errors / n)

q   


R SQUARED - the R-square statistic (1 - AVERAGE SQ ERR / sum of squares from the average)

q   


 

                         Decision Tree: Baseball Data

                 Summary Tree Statistics for the Training Data

                             STATISTIC     ==> AVE

                             N             163.000

                             AVERAGE         5.956

                             AVE SQ ERR      0.062

                             R SQUARED       0.920



Leaf Node Summary Statistics for the Training Data Set

The OUTLEAF= data set contains the following statistics:

Leaf ID number

q   


N or number of observations in each leaf node

q   


The target AVERAGE for each leaf node

q   


The root average squared error (ROOT ASE) for each leaf node

q   


 

                         Decision Tree: Baseball Data

                          Leaf Node Summary Statistics

                LEAF

                  ID           N         AVERAGE        ROOT ASE

                   8           9    4.2885299792    0.0810310161

                  16           9     4.581814362    0.0861344155

                  17           1    5.1647859739               0

                  36          14    5.0581554082    0.1033292134

                  37           1    4.6539603502               0

                  29           1    4.3820266347               0

                  19           2    5.5274252033    0.0893458944

                  20           7    5.5534989096     0.120153823

                  21           5     5.200846713    0.1327198925

                  38           4    6.0965134867    0.0684629543

                  39           7    5.7171876218    0.1860772093

                  40           4    5.9148579892    0.1708160326

                  41           8    6.3897459513     0.150561041

                  23          13    6.4853119091    0.3839130342

                  13          16    5.7881752468    0.2598459864

                  32           6     5.902511632    0.2696864743

                  33           9    6.4895454866    0.1092807452

                  25           1    4.4998096703               0

                  34          27    6.6125466821    0.3489469023

                  35           2    7.6883712553    0.1000475779



Yüklə 3,07 Mb.

Dostları ilə paylaş:
1   ...   127   128   129   130   131   132   133   134   ...   148




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə