INIT=EMCLUS is specified, then the option SEEDS= must also be specified, and the data
set must be the OUTSEEDS data set from PROC FASTCLUS or the OUTSTAT data set from PROC EMCLUS,
Specifies the minimum number of observations in each primary cluster. At any iteration, if the total number of
observations summarized in a cluster is less than MIN, then the cluster becomes inactive, and the cluster is
reseeded at a more appropriate point, of one exists. The default value is 3.
Specifies the number of observations to be read in for each iteration. The default value is the number of
observations in the data set, provided that this number can be determined. If the number of observations in the
data set cannot be determined, the default value is 500.
Specifies the data set that is used for the initial parameter estimates. This option must be used with
INIT=FASTCLUS or with INIT=EMCLUS. With PROC FASTCLUS, this specified data set is the resulting
SAS data set from the OUTSEEDS option. With PROC EMCLUS, this data set is the resulting SAS data set
from the OUTSTAT option.
OUTSTAT = libref.SAS-data-set
Specifies an output data set. This data set has 5+D columns, where D is the number of variables. Column 1
contains the cluster number. Column 2 is the type of cluster (primary or secondary). Column 3 is the cluster
frequency. Column 4 is the estimate for the weight parameter. Column 5 is the labelled _TYPE_, where
_TYPE_=MEAN or _TYPE_=VAR. Columns 6 through 5+D contain either the estimates for the mean or the
variance of each variable. Each variabe corresponds to two rows of this data set in the following form:
Specifies the minimum distance between initial clusters when INIT=RANDOM is used. It also specifies the
minimum distance between initial cluster means in the secondary data summarization phase. The default value is
the square root of the average of the sample variances obtained from the observations read in during the first
PRINT = ALL|LAST|NONE
Specifies how much output are printed. If PRINT=LAST is used, then the initial estimates and the output from
the last iteration will be shown. The default is PRINT=LAST.
SECITER = positive integer
Specifies the maximum number of iterations of the k-means algorithm in the secondary data summarization
phase. The default value is 1.
OUT = libref.SAS-data-set
Specifies the name of the data set that contains the probabilities PROB_h = P(x is in cluster h), h=1,2,...k. This
option must be used with ROLE=SCORE. The resulting data set will also contain the original data.
CLEAR = non-negative integer
memory following the secondary summarization phase after every n iterations. The default value is 0, which
means that no observation will be deleted from the memory.
OUTLIERS = IGNORE|KEEP
Specifies how the outlier observations are weighted when the scaled EM algorithm is implemented. If
OUTLIERS=IGNORE is specified, observation that are not in the 99th percentile of any estimated primary
cluster are weighted less. If OUTLIERS=KEEP is specified, these observations are weighted normally as in the
standard EM algorithm.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
Specifies which variables are to be used. If this statement is omitted, then all variables from the
input data set will be used.
Specifies which clusters from PROC FASTCLUS or PROC EMCLUS are to be used as initial
estimates for PROC EMCLUS. For example, when viewing the output from PROC FASTCLUS
you see that only clusters 2, 4-8, and 10 have good results, then could use the option INITCLUS
number of clusters specified in the INITCLUS statement should not exceed the number of clusters
specified with the CLUSTERS= option in PROC EMCLUS statement. The default for INITCLUS
with INIT=EMCLUS is that all the clusters will be used as initial estimates. With
INIT=FASTCLUS, the default is that the clusters with the highest frequency counts will be used.