proc print data=regest noobs label;
var _step_ _chosen_ _sbc_ _mse_ _averr_ _tmse_ _taverr_;
where _type_ = 'PARMS';
title 'Partial Listing of the OUTEST= Data Set';
run;
proc gplot data=regout;
plot logsalar*predict / haxis=axis1 vaxis=axis2 frame;
symbol c=black i=none v=dot h=3 pct;
axis1 c=black width=2.5;
axis2 c=black width=2.5;
title 'Diagnostic Plots for the Scored Baseball Data';
plot residual*predict / haxis=axis1 vaxis=axis2;
run;
quit;
Output
Summary Profile Information
The first section of the output lists the two-level data set name, the response variable, the number of observations, the error distribution,
and the link function.
Design Matrix For Classification Effects
The DMREG procedure uses a deviation from the means method to generate the design matrix for the classification inputs. Each row of
the design matrix is generated by a unique combination of the nominal input values. Each column of the design matrix corresponds to a
model parameter.
If a nominal variable SWING has k levels (3), then its main effect has k-1 (2) degrees of freedom, and the design matrix has k-1 (2)
columns that correspond to the first k-1 levels. The ith column contains a 1 in the ith row, a -1 in the last row, and 0 everywhere else. If
denotes the parameter that corresponds to the ith level of variable SWING, then k-1 columns yield estimates of the independent parameter
. The last parameter is not needed because DMREG constrains the k parameters to sum to 0. Crossed effects, such as
SWING*LEAGUE, are formed by the horizontal direct product of main effects.
Design Matrix Classification
Table
Data
Levels
for
SWING
Design Columns
Left
1
0
Right
0
1
Switch
-1
-1
The printing of the design matrix can be suppressed by using the MODEL statement option NODESIGNPRINT.
Model Fitting Information for Each Subset Model of the Stepwise Selection Process
For brevity, only steps number 5 and 8 from the stepwise selection process are listed in the following output. Step number 5 contains the
model that has the smallest SBC statistic. This model is used to score the test data set. Because no other inputs met the condition for
removal from the model and no other variables met the criterion for addition to the model, the stepwise algorithm terminates after step
number 8.
For each model subset of the stepwise modeling process, DMREG provides:
An analysis of variance table which lists degrees of freedom, sums of squares, mean squares, the Model F, and its associated
p-value.
1.
Model fitting information which contains the following statistics that enable you to assess the fit of each stepwise model:
R-square - which is calculated as
, where SSE is the error sums of squares and SST is the total sums of squares.
The R
2
statistic ranges from 0 to 1. Models that have large values of R
2
are preferred. For step number 8, the regression
equation explains 60.17% of the variability in the target.
r
Adj R-sq -
the Adj-R
2
is an alternative criterion to the R
2
statistic that is adjusted for the number of parameters in the model.
This statistic is calculated as
, where n is the number cases, and i is an
indicator variable that is 1 if the model includes an intercept and 0, otherwise. Large differences between the R
2
and the
Adj-R
2
values for a given model can indicate that you have used too many inputs in the model.
r
AIC - Akaike's Information Criterion, which is a goodness-of-fit statistic that you can use to compare one model to another.
Lower values indicate a more desirable model. It is calculated as
, where n is the number of cases,
SSE is the error sums of squares, and p is the number of model parameters.
r
BIC - Bayesian Information Criterion is another goodness-of-fit statistic that is calculated as
, where q =
(MSE is obtained from the full
model). Smaller BIC values are preferred.
r
SBC -Schwarz's Bayesian Criterion is another goodness-of-fit statistic that is calculated as
. Models that have small SBC values are preferred. Because the CHOOSE=SBC option
was specified, DMREG selects the model that has the smallest SBC value.
r
C(p)- Mallow's Cp Statistic enables you to determine if your model is under or overspecified. This statistic is calculated as
, where SSE(p) is the error sums of squares for the subset model with p parameters including
the intercept if any, MSE is the error mean square for the full model, and n is the number of cases. For any subset model C(p)
> p, there is evidence of bias due to an incompletely specified model (your model may not contain enough inputs). However,
if there are values of C(p) < p, the full model is said to be overspecified. When the right model is chosen, the parameter
estimates are unbiased, and this is reflected in Cp < p or at least near p.
r
2.
Analysis of effects and parameter estimates that contains the effect, degrees of freedom, parameter estimate,
standard error, type II
sums of squares, F-value and the corresponding p-value.
3.