The arboretum procedure

Yüklə 3.07 Mb.

ölçüsü3.07 Mb.
1   ...   69   70   71   72   73   74   75   76   ...   148


proc print data=regest noobs label;

   var _step_ _chosen_  _sbc_  _mse_  _averr_ _tmse_ _taverr_;

   where _type_ = 'PARMS';

   title 'Partial Listing of the OUTEST= Data Set';



proc gplot data=regout;

   plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;

   symbol c=black i=none v=dot h=3 pct;

   axis1 c=black width=2.5;

   axis2 c=black width=2.5;

   title 'Diagnostic Plots for the Scored Baseball Data';



   plot residual*predict / haxis=axis1 vaxis=axis2;




Summary Profile Information

The first section of the output lists the two-level data set name, the response variable, the number of observations, the error distribution,

and the link function.

Design Matrix For Classification Effects

The DMREG procedure uses a deviation from the means method to generate the design matrix for the classification inputs. Each row of

the design matrix is generated by a unique combination of the nominal input values. Each column of the design matrix corresponds to a

model parameter.

If a nominal variable SWING has k levels (3), then its main effect has k-1 (2) degrees of freedom, and the design matrix has k-1 (2)

columns that correspond to the first k-1 levels. The ith column contains a 1 in the ith row, a -1 in the last row, and 0 everywhere else. If 

denotes the parameter that corresponds to the ith level of variable SWING, then k-1 columns yield estimates of the independent parameter

. The last parameter is not needed because DMREG constrains the k parameters to sum to 0. Crossed effects, such as

SWING*LEAGUE, are formed by the horizontal direct product of main effects.

Design Matrix Classification






Design Columns










The printing of the design matrix can be suppressed by using the MODEL statement option NODESIGNPRINT.

Model Fitting Information for Each Subset Model of the Stepwise Selection Process

For brevity, only steps number 5 and 8 from the stepwise selection process are listed in the following output. Step number 5 contains the

model that has the smallest SBC statistic. This model is used to score the test data set. Because no other inputs met the condition for

removal from the model and no other variables met the criterion for addition to the model, the stepwise algorithm terminates after step

number 8.

For each model subset of the stepwise modeling process, DMREG provides:

An analysis of variance table which lists degrees of freedom, sums of squares, mean squares, the Model F, and its associated



Model fitting information which contains the following statistics that enable you to assess the fit of each stepwise model:

R-square - which is calculated as 

, where SSE is the error sums of squares and SST is the total sums of squares.

The R


 statistic ranges from 0 to 1. Models that have large values of R


 are preferred. For step number 8, the regression

equation explains 60.17% of the variability in the target.


Adj R-sq - the Adj-R


 is an alternative criterion to the R


 statistic that is adjusted for the number of parameters in the model.

This statistic is calculated as 

, where n is the number cases, and i is an

indicator variable that is 1 if the model includes an intercept and 0, otherwise. Large differences between the R


 and the



 values for a given model can indicate that you have used too many inputs in the model.


AIC - Akaike's Information Criterion, which is a goodness-of-fit statistic that you can use to compare one model to another.

Lower values indicate a more desirable model. It is calculated as 

, where n is the number of cases,

SSE is the error sums of squares, and p is the number of model parameters.


BIC - Bayesian Information Criterion is another goodness-of-fit statistic that is calculated as

, where q = 

 (MSE is obtained from the full

model). Smaller BIC values are preferred.


SBC -Schwarz's Bayesian Criterion is another goodness-of-fit statistic that is calculated as

. Models that have small SBC values are preferred. Because the CHOOSE=SBC option

was specified, DMREG selects the model that has the smallest SBC value.


C(p)- Mallow's Cp Statistic enables you to determine if your model is under or overspecified. This statistic is calculated as

, where SSE(p) is the error sums of squares for the subset model with p parameters including

the intercept if any, MSE is the error mean square for the full model, and n is the number of cases. For any subset model C(p)

p, there is evidence of bias due to an incompletely specified model (your model may not contain enough inputs). However,

if there are values of C(p) < p, the full model is said to be overspecified. When the right model is chosen, the parameter

estimates are unbiased, and this is reflected in Cp < p or at least near p.



Analysis of effects and parameter estimates that contains the effect, degrees of freedom, parameter estimate, standard error, type II

sums of squares, F-value and the corresponding p-value.


Dostları ilə paylaş:
1   ...   69   70   71   72   73   74   75   76   ...   148

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur © 2017
rəhbərliyinə müraciət

    Ana səhifə