proc print data=regest noobs label;
var _step_ _chosen_ _sbc_ _mse_ _averr_ _tmse_ _taverr_;
where _type_ = 'PARMS';
title 'Partial Listing of the OUTEST= Data Set';
proc gplot data=regout;
plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;
symbol c=black i=none v=dot h=3 pct;
axis1 c=black width=2.5;
axis2 c=black width=2.5;
title 'Diagnostic Plots for the Scored Baseball Data';
The first section of the output lists the two-level data set name, the response variable, the number of observations, the error distribution,
and the link function.
Design Matrix For Classification Effects
The DMREG procedure uses a deviation from the means method to generate the design matrix for the classification inputs. Each row of
the design matrix is generated by a unique combination of the nominal input values. Each column of the design matrix corresponds to a
If a nominal variable SWING has k levels (3), then its main effect has k-1 (2) degrees of freedom, and the design matrix has k-1 (2)
columns that correspond to the first k-1 levels. The ith column contains a 1 in the ith row, a -1 in the last row, and 0 everywhere else. If
denotes the parameter that corresponds to the ith level of variable SWING, then k-1 columns yield estimates of the independent parameter
. The last parameter is not needed because DMREG constrains the k parameters to sum to 0. Crossed effects, such as
SWING*LEAGUE, are formed by the horizontal direct product of main effects.
Design Matrix Classification
The printing of the design matrix can be suppressed by using the MODEL statement option NODESIGNPRINT.
For brevity, only steps number 5 and 8 from the stepwise selection process are listed in the following output. Step number 5 contains the
model that has the smallest SBC statistic. This model is used to score the test data set. Because no other inputs met the condition for
removal from the model and no other variables met the criterion for addition to the model, the stepwise algorithm terminates after step
For each model subset of the stepwise modeling process, DMREG provides:
An analysis of variance table which lists degrees of freedom, sums of squares, mean squares, the Model F, and its associated
, where SSE is the error sums of squares and SST is the total sums of squares.
statistic ranges from 0 to 1. Models that have large values of R
are preferred. For step number 8, the regression
equation explains 60.17% of the variability in the target.
is an alternative criterion to the R
statistic that is adjusted for the number of parameters in the model.
This statistic is calculated as
, where n is the number cases, and i is an
indicator variable that is 1 if the model includes an intercept and 0, otherwise. Large differences between the R
values for a given model can indicate that you have used too many inputs in the model.
Lower values indicate a more desirable model. It is calculated as
, where n is the number of cases,
SSE is the error sums of squares, and p is the number of model parameters.
BIC - Bayesian Information Criterion is another goodness-of-fit statistic that is calculated as
, where q =
(MSE is obtained from the full
model). Smaller BIC values are preferred.
SBC -Schwarz's Bayesian Criterion is another goodness-of-fit statistic that is calculated as
. Models that have small SBC values are preferred. Because the CHOOSE=SBC option
was specified, DMREG selects the model that has the smallest SBC value.
, where SSE(p) is the error sums of squares for the subset model with p parameters including
the intercept if any, MSE is the error mean square for the full model, and n is the number of cases. For any subset model C(p)
> p, there is evidence of bias due to an incompletely specified model (your model may not contain enough inputs). However,
if there are values of C(p) < p, the full model is said to be overspecified. When the right model is chosen, the parameter
estimates are unbiased, and this is reflected in Cp < p or at least near p.
sums of squares, F-value and the corresponding p-value.