The arboretum procedure



Yüklə 3.07 Mb.

səhifə133/148
tarix30.04.2018
ölçüsü3.07 Mb.
1   ...   129   130   131   132   133   134   135   136   ...   148

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.


 

The SPLIT statement invokes the procedure. The DATA=option identifies

the training data set that is used to fit the model. The DMDBCAT= option identifies

the training data catalog.

proc split data=sampsio.dmdbase 

           dmdbcat=sampsio.dmdbase




 

The CRITERION = method specifies the PROBF method of searching and evaluating

candidate splitting rules. For interval targets, the default method is PROBF

(p-value of F-test associated with node variance).

           criterion=probf



 

The PADJUST=option specifies the DEPTH method for adjusting p-values. DEPTH adjusts

for the number of ancestor splits.

           padjust=depth




 

The OUTMATRIX= option names the output data set that contains tree summary

statistics for the training data.

 

           outmatrix=trtree 




 

The OUTTREE= option names the data set that contains tree information.

You can use the INTREE= option to read the OUTTREE= data set in a subsequent

execution of PROC SPLIT.

           outtree=treedata



 

The OUTLEAF= option names the data set that contains statistics for

each leaf node.

           outleaf=leafdata




 

The OUTSEQ= option names the data set that contains sub-tree statistics.

           outseq=subtree;                      



 

Each INPUT statement specifies a set of input variables that have the

same measurement level. The LEVEL= option identifies the measurement level

of each input set.

   input league division position / level=nominal;

   input no_atbat no_hits no_home no_runs no_rbi no_bb

     yr_major cr_atbat cr_hits cr_home cr_runs cr_rbi cr_bb

     no_outs no_assts no_error / level=interval; 




 

The TARGET statement specifies the target (response) variable.

   target logsalar;



 

The SCORE statement specifies the data set that you want to score in

conjunction with training. The DATA= option identifies the score data set. 

   score data=sampsio.dmtbase nodmdb




 

The OUTFIT= option names the output data set that contains goodness-of-fit

statistics for the scored data set. The OUT= data set contains summary statistics

for the scored data set, such as predicted and residual values.

    outfit=splfit

    out=splout(rename=(p_logsal=predict r_logsal=residual));

    title 'Decision Tree: Baseball Data';

run;



 

PROC PRINT lists summary tree statistics for the training data set.

proc print data=trtree noobs label;

   title2 'Summary Tree Statistics for the Training Data';

run;



 

PROC PRINT lists summary statistics for each leaf node.

proc print data=leafdata noobs label;

   title2 'Leaf Node Summary Statistics';

run;



 

PROC PRINT lists summary statistics for each subtree in the sub-tree

sequence.

 proc print data=subtree noobs label;

   title2 'Subtree Summary Statistics';

run;



 

PROC PRINT lists fit statistics for the scored test data set.

proc print data=splfit noobs label;

   title2 'Summary Statistics for the Scored Test Data';

run;



 

PROC GPLOT produces diagnostic plots for the scored test data set. The

first PLOT statement creates a scatter plot of the target values versus the

predicted values of the target. The second PLOT statement creates a scatter

plot of the residual values versus the predicted values of the target. 

proc gplot data=splout;

   plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;

   symbol c=black i=none v=dot h= 3 pct;

   axis1 minor=none color=black width=2.5;

   axis2 minor=none color=black width=2.5;

   title2 'Log of Salary versus the Predicted Log of Salary';



The SPLIT Procedure

References

Berry, M. J. A. and Linoff, G. (1997), Data Mining Techniques for Marketing, Sales, and



Customer Support, New York: John Wiley and Sons, Inc.

Breiman, L., Friedman, J.H., Olsen, R.A., and Stone, C.J. (1984), Classification and Regression



Trees, Belmont, CA: Wadsworth International Group.

Collier Books (1987), The Baseball Encyclopedia Update, New York: Macmillan Publishing

Company.

Hand, D. J. (1987), Construction and Assessment of Classification Rules, New York: John

Wiley and Sons, Inc.

Quinlan, J. Ross (1993), C4.5: Programs for Machine Learning, San Francisco: Morgan

Kaufmann Publishers.

Steinberg, D. and Colla, P. (1995), CART: Tree-Structured Non-Parametric Data Analysis, San

Diego, CA: Salford Systems.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.






Dostları ilə paylaş:
1   ...   129   130   131   132   133   134   135   136   ...   148


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

    Ana səhifə