The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	131/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 127 128 129 130 131 132 133 134 ... 148

The SPLIT Procedure

Example 2: Creating a Decision Tree with an Interval Target

(Baseball Data)

Features

Specifying the Input Variables and the Target Variable

Setting the Splitting Criterion

Setting the P-value Adjustment Method

Outputting Fit Statistics

Outputting Leaf Node Statistics

Outputting Sub-Tree Statistics

Outputting the Decision Tree Information Data Set

Scoring Data with the Score Statement

Creating Diagnostic Scatters Plots

This example demonstrates how to create a decision tree for an interval target. The default PROBF splitting criterion is used to

search for and evaluate candidate splitting rules.

The example DMDB training data set SAMPSIO.DMBASE contains performance measures and salary levels for regular hitters

and leading substitute hitters in major league baseball for the year 1986 (Collier 1987). There is one observation per hitter. The

continuous target variable is log of salary (logsalar).

The SAMPSIO.DMTBASE data set is a test data set, which is scored using the scoring formula from the trained model. The

SAMPSIO.DMBASE and SAMPSIO.DMTBASE data sets and the SAMPSIO.DMDBASE training catalog are stored in the

sample library.

Program

proc split data=sampsio.dmdbase

dmdbcat=sampsio.dmdbase

criterion=probf

padjust=depth

outmatrix=trtree

outtree=treedata

outleaf=leafdata

outseq=subtree;

input league division position / level=nominal;

input no_atbat no_hits no_home no_runs no_rbi no_bb

yr_major cr_atbat cr_hits cr_home cr_runs cr_rbi cr_bb

no_outs no_assts no_error / level=interval;

target logsalar;

score data=sampsio.dmtbase nodmdb

outfit=splfit

out=splout(rename=(p_logsal=predict r_logsal=residual));

title 'Decision Tree: Baseball Data';

run;

proc print data=trtree noobs label;

title2 'Summary Tree Statistics for the Training Data';

run;

proc print data=leafdata noobs label;

title2 'Leaf Node Summary Statistics';

run;

proc print data=subtree noobs label;

title2 'Subtree Summary Statistics';

run;

proc print data=splfit noobs label;

title2 'Summary Statistics for the Scored Test Data';

run;

proc gplot data=splout;

plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;

symbol c=black i=none v=dot h= 3 pct;

axis1 minor=none color=black width=2.5;

axis2 minor=none color=black width=2.5;

title2 'Log of Salary versus the Predicted Log of Salary';

plot residual*predict / haxis=axis1 vaxis=axis2;

title2 'Plot of the Residuals versus the Predicted Log of Salary';

run;

quit;

Output

Summary Tree Statistics for the Training Data Set

The OUTMATRIX= data set contains the following summary statistics:

N - the number of observations in the training data set

AVERAGE - the target average

AVERAGE SQ ERR - the average squared prediction error (the sum of squared errors / n)

R SQUARED - the R-square statistic (1 - AVERAGE SQ ERR / sum of squares from the average)

Decision Tree: Baseball Data

Summary Tree Statistics for the Training Data

STATISTIC ==> AVE

N 163.000

AVERAGE 5.956

AVE SQ ERR 0.062

R SQUARED 0.920

Leaf Node Summary Statistics for the Training Data Set

The OUTLEAF= data set contains the following statistics:

Leaf ID number

N or number of observations in each leaf node

The target AVERAGE for each leaf node

The root average squared error (ROOT ASE) for each leaf node

Decision Tree: Baseball Data

Leaf Node Summary Statistics

LEAF

ID N AVERAGE ROOT ASE

8 9 4.2885299792 0.0810310161

16 9 4.581814362 0.0861344155

17 1 5.1647859739 0

36 14 5.0581554082 0.1033292134

37 1 4.6539603502 0

29 1 4.3820266347 0

19 2 5.5274252033 0.0893458944

20 7 5.5534989096 0.120153823

21 5 5.200846713 0.1327198925

38 4 6.0965134867 0.0684629543

39 7 5.7171876218 0.1860772093

40 4 5.9148579892 0.1708160326

41 8 6.3897459513 0.150561041

23 13 6.4853119091 0.3839130342

13 16 5.7881752468 0.2598459864

32 6 5.902511632 0.2696864743

33 9 6.4895454866 0.1092807452

25 1 4.4998096703 0

34 27 6.6125466821 0.3489469023

35 2 7.6883712553 0.1000475779

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 127 128 129 130 131 132 133 134 ... 148