Creating the Design Matrix Data Set for Classification Inputs
Comparing the Results of the Procedures
This example provides a comparison of the DMREG and LOGISTIC procedures when using a categorical input to model a binary target.
The example data set SAMPSIO.HMEQ contains fictitious mortgage data where each case represents an applicant for a home equity loan.
All applicants have an existing mortgage.
The binary target BAD represents whether or not an applicant eventually defaulted or was ever seriously delinquent. There are nine
continuous inputs available for modeling. JOB is the only categorical input used to predict the target BAD.
When you compare the output from the DMREG and LOGISTIC procedures code, you must take into consideration how each procedure
handles the categorical variables. By default, DMREG uses a deviations from the means coding to code the classification variables. The
design matrix for the class effects has values of 0, 1, and -1 for the reference levels. This coding is sometimes referred to as "effects",
"center-point", and "full-rank" coding. The parameters for these categorical indicators measure the difference from each level to the average
Because the LOGISTIC procedure does not enable you to specify class inputs directly in the MODEL statement, you must first create an
input data set that contains the design matrix for the class variables. To create the design matrix data set for input to the LOGISTIC
procedure, you can use a SAS DATA step, a TRANSREG procedure, or a GENMOD procedure. If you use the deviations from the means
coding method to code the class variables, then the LOGISTIC output will automatically match the output generated from the DMREG run.
If you use the GLM non-full rank coding (0, 1) to code the class variables, you must set the DMREG CODE= MODEL statement option in
GLM. In this case, both procedures will generate the same output.
Program: Deviations from the Mean Coding
proc freq data=sampsio.hmeq;
title 'JOB Classification Table';
if job = ' ' then job='Other';
proc transreg data=hmeq design;
model class (job/deviations);
proc logistic descending;
derog clage ninq clno debtinc;
title 'LOGISTIC Home Equity Data: Deviations from the Mean Coding';
proc dmdb batch data=hmeq
var loan mortdue value yoj derog
clage ninq clno debtinc;
proc dmreg data=dm_data
class bad job;
model bad = job loan mortdue value yoj derog
clage ninq clno debtinc;
title1 'DMREG Home Equity Data:
Default Deviations from the Mean Coding';
Output: Deviations from the Mean Coding
FREQ Classification Table for JOB.
The categorical input JOB contains 7 levels. Notice that 279 cases have missing values. Both the DMREG and LOGISTIC procedures omit
observations that have missing values from the analysis. For this example, the missing values are imputed using the mode of JOB.
Notice that the DMREG output matches the output generated from the LOGISTIC run.