The PASSES = option specifies an upper bound for the number of passes
that are made through the data.
passes=20
The OUTVARS = option creates a data set containing splitting information.
outvars=vout;
The VAR statement specifies the numeric and categorical inputs (independent
variables).
var amount income homeval frequent recency age
domestic apparel marital ntitle gender telind origin
job statecod numcars edlevel;
The TARGET statement defines the target (response) variable.
target purchase;
title 'DMSPLIT: Binary Target';
run;
PROC PRINT creates a partial report of the OUTVARS= data set.
proc print data=vout(obs=20);
title2 'OUTVARS= Summary Data';
run;
The PROC SPLIT statement invokes the procedure. The INDMSPLT option
specifies to read the tree created from PROC DMSPLIT. The DMSPLIT tree information
is stored in the DMDB catalog.
title 'Import and Save Tree from DMSPLT';
proc split dmdbcat=catexa1 indmsplit
The OUTTREE= option names the data set that contains tree information.
outmatrix=trtree
The OUTLEAF= option names the data set that contains statistics for
each leaf node.
outleaf=leafdata
The OUTTREE= option specifies the output data set that describes the
tree. The OUTTREE data set can be used as input in subsequent executions of
PROC SPLIT.
outtree=savetree;
run;
PROC PRINT creates a report of the training statistics.
proc print data=trtree label;
title2 'Training Statistics';
run;
PROC PRINT creates a partial report of the leaf statistics for the
training data.
proc print data=leafdata(obs=10) label;
title2 'Leaf Statistics';
run;
The DATA step creates a fictitious score data set.
data testexa1(drop=ran);
set sampsio.dmexa1;
ran=ranuni(3333);
if ran lt 0.08;
title 'Create Fictitious Score Data Set';
run;
The INTREE = option reads the tree that was saved from the previous
PROC SPLIT step.
proc split intree=savetree;
The SCORE statement scores the DATA= data set. The OUTFIT= option names
the output data set containing fit statistics. The OUT= option names the output
data set that contains tree statistics for the scored data set. Typically
you would want to score a truly mutually exclusive data set that may or may
not contain the target values (the WORK.TESTEXA1 data set is a random subset
of the SAMPSIO.DMEXA1 training data set).
score data=testexa1 nodmdb
outfit=tfit out=tout;
title 'Input Tree and Score Test Data';
PROC PRINT creates a report of the fit statistics for the scored data
set.
proc print data=tfit label;
title2 'Fit Statistics for the Scored Data Set';
run;
PROC FREQ creates a misclassification table for the scored data set.
The F_ PURCHA variable is the actual target value for each customer and the
I_PURCHA variable is the target value into which the customer is classified.
proc freq data=tout;
tables f_purcha*i_purcha;
title2 'Scored Data';
title3 'Misclassification Table';
run;
PROC PRINT creates a partial report of selected variables from the OUT=
score information data set.
proc print data=tout(obs=10) label;
var _node_ a_ a_yes a_no d_purcha f_purcha
i_purcha p_puryes p_purno p_pur r_puryes
r_purno r_pur;
title2 'Score Summary Data';
run;
The EMCLUS Procedure
The EMCLUS Procedure
Overview
Procedure Syntax
PROC EMCLUS Statement
VAR Statement
INITCLUS Statement
Output from PROC EMCLUS
EXAMPLES-SECTION
Example 1: Syntax for PROC FASTCLUS
Example 2: Use of the EMCLUS Procedure
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The EMCLUS Procedure
Overview
The EMCLUS procedure uses a scalable version of the Expectation-Maximization (EM) algorithm to
cluster a data set. You have the option to run the standard EM algorithm, in which the parameters are
estimated using the entire data set, or the scaled EM algorithm, in which a portion of the data set is used
to estimate the parameters at each iteration of the procedure. The standard EM algorithm is run by not
specifying the a value for the option NOBS, and can be run provided the entire data set. The entire data
set must fits in memory such that total number of observation can be determined. The scaled EM
algorithm can be run by specifying a value for the NOBS option that is less than the total number of
observations in the data set. When the scaled EM algorithm is used, it is important that the input data set
is randomized beforehand.
The EMCLUS procedure identifies primary clusters, which are the densest regions of data points, and
also identifies secondary clusters, which are typically smaller dense clusters. Each primary cluster is
modeled with a weighted n-dimensional multivariate normal distribution, where n is the number of
variables in the data set. Thus each primary cluster is modeled with the function w*f(x|u,V), where
f(x|u,V) ~ MVN(u,V) and V is a diagonal matrix. There are four major parts in the EMCLUS procedure:
Obtain and possibly refine the initial parameter estimates.
q
Apply the EM algorithm to update parameter values.
q
Summarize data in the primary summarization phase.
q
Summarize data in the secondary summarization phase.
q
The effectiveness of the EMCLUS procedure depends on the initial parameter estimates. Good initial
parameter estimates generally heads to faster convergence and better final estimates. The initial
parameter estimates can be obtained "randomly", from using PROC FASTCLUS, or from using PROC
EMCLUS. See Example 1 for the PROC FASTCLUS syntax. PROC FASTCLUS sometimes returns
poor results (clusters corresponding to low frequency counts). In this case, the poor results can be
ignored by specifying appropriate values for the INITCLUS option. PROC FASTCLUS can also return
clusters that are actually groups of clusters. This can be determined by clusters having a large frequency
count and a large root-mean-square standard deviation. In this case, these clusters should not be ignored,
and the user should specify a value of INITSTD which is smaller than the root-mean-square standard
deviations of the clusters with the large frequency counts and large root-mean-square standard deviations
(see Example 2). In the case when PROC FASTCLUS returns poor results, it may be of help to rerun
PROC FASTCLUS with a larger number of MAXCLUSTERS, and then choose the best clusters for the
initial values for PROC EMCLUS. Initial estimates obtained from PROC EMCLUS can be used to
refine the primary clusters or obtain better primary clusters.
The EM algorithm is used to find the primary clusters, and update the model parameters. The EM
algorithm terminates when two successive log-likelihood values differ in relative and absolute
magnitude by a particular amount or when ITER iterations have been reached.
The primary summarization phase summarizes observations near each of the primary cluster means and