The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	133/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 129 130 131 132 133 134 135 136 ... 148

References
Construction and Assessment of Classification Rules

The SPLIT statement invokes the procedure. The DATA=option identifies

the training data set that is used to fit the model. The DMDBCAT= option identifies

the training data catalog.

proc split data=sampsio.dmdbase

dmdbcat=sampsio.dmdbase

The CRITERION = method specifies the PROBF method of searching and evaluating

candidate splitting rules. For interval targets, the default method is PROBF

(p-value of F-test associated with node variance).

criterion=probf

The PADJUST=option specifies the DEPTH method for adjusting p-values. DEPTH adjusts

for the number of ancestor splits.

padjust=depth

The OUTMATRIX= option names the output data set that contains tree summary

statistics for the training data.

outmatrix=trtree

The OUTTREE= option names the data set that contains tree information.

You can use the INTREE= option to read the OUTTREE= data set in a subsequent

execution of PROC SPLIT.

outtree=treedata

The OUTLEAF= option names the data set that contains statistics for

each leaf node.

outleaf=leafdata

The OUTSEQ= option names the data set that contains sub-tree statistics.

outseq=subtree;

Each INPUT statement specifies a set of input variables that have the

same measurement level. The LEVEL= option identifies the measurement level

of each input set.

input league division position / level=nominal;

input no_atbat no_hits no_home no_runs no_rbi no_bb

yr_major cr_atbat cr_hits cr_home cr_runs cr_rbi cr_bb

no_outs no_assts no_error / level=interval;

The TARGET statement specifies the target (response) variable.

target logsalar;

The SCORE statement specifies the data set that you want to score in

conjunction with training. The DATA= option identifies the score data set.

score data=sampsio.dmtbase nodmdb

The OUTFIT= option names the output data set that contains goodness-of-fit

statistics for the scored data set. The OUT= data set contains summary statistics

for the scored data set, such as predicted and residual values.

outfit=splfit

out=splout(rename=(p_logsal=predict r_logsal=residual));

title 'Decision Tree: Baseball Data';

run;

PROC PRINT lists summary tree statistics for the training data set.

proc print data=trtree noobs label;

title2 'Summary Tree Statistics for the Training Data';

run;

PROC PRINT lists summary statistics for each leaf node.

proc print data=leafdata noobs label;

title2 'Leaf Node Summary Statistics';

run;

PROC PRINT lists summary statistics for each subtree in the sub-tree

sequence.

proc print data=subtree noobs label;

title2 'Subtree Summary Statistics';

run;

PROC PRINT lists fit statistics for the scored test data set.

proc print data=splfit noobs label;

title2 'Summary Statistics for the Scored Test Data';

run;

PROC GPLOT produces diagnostic plots for the scored test data set. The

first PLOT statement creates a scatter plot of the target values versus the

predicted values of the target. The second PLOT statement creates a scatter

plot of the residual values versus the predicted values of the target.

proc gplot data=splout;

plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;

symbol c=black i=none v=dot h= 3 pct;

axis1 minor=none color=black width=2.5;

axis2 minor=none color=black width=2.5;

title2 'Log of Salary versus the Predicted Log of Salary';

The SPLIT Procedure

References

Berry, M. J. A. and Linoff, G. (1997), Data Mining Techniques for Marketing, Sales, and

Customer Support, New York: John Wiley and Sons, Inc.

Breiman, L., Friedman, J.H., Olsen, R.A., and Stone, C.J. (1984), Classification and Regression

Trees, Belmont, CA: Wadsworth International Group.

Collier Books (1987), The Baseball Encyclopedia Update, New York: Macmillan Publishing

Company.

Hand, D. J. (1987), Construction and Assessment of Classification Rules, New York: John

Wiley and Sons, Inc.

Quinlan, J. Ross (1993), C4.5: Programs for Machine Learning, San Francisco: Morgan

Kaufmann Publishers.

Steinberg, D. and Colla, P. (1995), CART: Tree-Structured Non-Parametric Data Analysis, San

Diego, CA: Salford Systems.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 129 130 131 132 133 134 135 136 ... 148