 # The arboretum procedure

Yüklə 3,07 Mb.

 səhifə 131/148 tarix 30.04.2018 ölçüsü 3,07 Mb.
 The SPLIT Procedure Example 2: Creating a Decision Tree with an Interval Target (Baseball Data) Features Specifying the Input Variables and the Target Variable q    Setting the Splitting Criterion q    q    Outputting Fit Statistics q    Outputting Leaf Node Statistics q    Outputting Sub-Tree Statistics q    Outputting the Decision Tree Information Data Set q    Scoring Data with the Score Statement q    Creating Diagnostic Scatters Plots q    This example demonstrates how to create a decision tree for an interval target. The default PROBF splitting criterion is used to search for and evaluate candidate splitting rules. The example DMDB training data set SAMPSIO.DMBASE contains performance measures and salary levels for regular hitters and leading substitute hitters in major league baseball for the year 1986 (Collier 1987). There is one observation per hitter. The continuous target variable is log of salary (logsalar). The SAMPSIO.DMTBASE data set is a test data set, which is scored using the scoring formula from the trained model. The SAMPSIO.DMBASE and SAMPSIO.DMTBASE data sets and the SAMPSIO.DMDBASE training catalog are stored in the sample library. Program   proc split data=sampsio.dmdbase             dmdbcat=sampsio.dmdbase              criterion=probf              padjust=depth                outmatrix=trtree               outtree=treedata              outleaf=leafdata              outseq=subtree;                            input league division position / level=nominal;    input no_atbat no_hits no_home no_runs no_rbi no_bb      yr_major cr_atbat cr_hits cr_home cr_runs cr_rbi cr_bb      no_outs no_assts no_error / level=interval;         score data=sampsio.dmtbase nodmdb       outfit=splfit     out=splout(rename=(p_logsal=predict r_logsal=residual));     title 'Decision Tree: Baseball Data'; run;   proc print data=trtree noobs label;    title2 'Summary Tree Statistics for the Training Data'; run;   proc print data=leafdata noobs label;    title2 'Leaf Node Summary Statistics'; run;    proc print data=subtree noobs label;    title2 'Subtree Summary Statistics'; run;   proc print data=splfit noobs label;    title2 'Summary Statistics for the Scored Test Data'; run;   proc gplot data=splout;    plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;    symbol c=black i=none v=dot h= 3 pct;    axis1 minor=none color=black width=2.5;    axis2 minor=none color=black width=2.5;    title2 'Log of Salary versus the Predicted Log of Salary';    plot residual*predict / haxis=axis1 vaxis=axis2;    title2 'Plot of the Residuals versus the Predicted Log of Salary'; run; quit; Output Summary Tree Statistics for the Training Data Set The OUTMATRIX= data set contains the following summary statistics: N - the number of observations in the training data set q    AVERAGE - the target average q    AVERAGE SQ ERR - the average squared prediction error (the sum of squared errors / n) q    R SQUARED - the R-square statistic (1 - AVERAGE SQ ERR / sum of squares from the average) q                               Decision Tree: Baseball Data                  Summary Tree Statistics for the Training Data                              STATISTIC     ==> AVE                              N             163.000                              AVERAGE         5.956                              AVE SQ ERR      0.062                              R SQUARED       0.920 Leaf Node Summary Statistics for the Training Data Set The OUTLEAF= data set contains the following statistics: Leaf ID number q    N or number of observations in each leaf node q    The target AVERAGE for each leaf node q    The root average squared error (ROOT ASE) for each leaf node q                               Decision Tree: Baseball Data                 LEAF                   ID           N         AVERAGE        ROOT ASE                    8           9    4.2885299792    0.0810310161                   16           9     4.581814362    0.0861344155                   17           1    5.1647859739               0                   36          14    5.0581554082    0.1033292134                   37           1    4.6539603502               0                   29           1    4.3820266347               0                   19           2    5.5274252033    0.0893458944                   20           7    5.5534989096     0.120153823                   21           5     5.200846713    0.1327198925                   38           4    6.0965134867    0.0684629543                   39           7    5.7171876218    0.1860772093                   40           4    5.9148579892    0.1708160326                   41           8    6.3897459513     0.150561041                   23          13    6.4853119091    0.3839130342                   13          16    5.7881752468    0.2598459864                   32           6     5.902511632    0.2696864743                   33           9    6.4895454866    0.1092807452                   25           1    4.4998096703               0                   34          27    6.6125466821    0.3489469023                   35           2    7.6883712553    0.1000475779 Dostları ilə paylaş:

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət Ana səhifə