Specifies how the initial estimates are obtained. The default value is RANDOM. If INIT=FASTCLUS or
INIT=EMCLUS is specified, then the option SEEDS= must also be specified, and the data
set must be the OUTSEEDS data set from PROC FASTCLUS or the OUTSTAT data set from PROC EMCLUS,
respectively.
MIN = nonnegative integer
Specifies the minimum number of observations in each primary cluster. At any iteration, if the total number of
observations summarized in a cluster is less than MIN, then the cluster becomes inactive, and the cluster is
reseeded at a more appropriate point, of one exists. The default value is 3.
NOBS = positive integer
Specifies the number of observations to be read in for each iteration. The default value is the number of
observations in the data set, provided that this number can be determined. If the number of observations in the
data set cannot be determined, the default value is 500.
SEED = libref.SAS-data-set
Specifies the data set that is used for the initial parameter estimates. This option must be used with
INIT=FASTCLUS or with INIT=EMCLUS. With PROC FASTCLUS, this specified data set is the resulting
SAS data set from the OUTSEEDS option. With PROC EMCLUS, this data set is the resulting SAS data set
from the OUTSTAT option.
OUTSTAT = libref.SAS-data-set
Specifies an output data set. This data set has 5+D columns, where D is the number of variables. Column 1
contains the cluster number. Column 2 is the type of cluster (primary or secondary). Column 3 is the cluster
frequency. Column 4 is the estimate for the weight parameter. Column 5 is the labelled _TYPE_, where
_TYPE_=MEAN or _TYPE_=VAR. Columns 6 through 5+D contain either the estimates for the mean or the
variance of each variable. Each variabe corresponds to two rows of this data set in the following form:
Output Data Set from the OUTSTAT Option
CLUSTER_1
PRIMARY
FREQ_1
WEIGHT_1
MEAN
MEAN_1
MEAN_2
.....
MEAN_N
CLUSTER_1
PRIMARY
FREQ_1
WEIGHT_1
VAR
VAR_1
VAR_2
.....
VAR_N
CLUSTER_2
SECONDAY
FREQ_2
WEIGHT_2
MEAN
MEAN_1
MEAN_2
.....
MEAN_N
CLUSTER_2
SECONDARY
FREQ_2
WEIGHT_2
VAR
VAR_1
VAR_2
.....
VAR_N
DIST = nonnegative number
Specifies the minimum distance between initial clusters when INIT=RANDOM is used. It also specifies the
minimum distance between initial cluster means in the secondary data summarization phase. The default value is
the square root of the average of the sample variances obtained from the observations read in during the first
iteration.
PRINT = ALL|LAST|NONE
Specifies how much output are printed. If PRINT=LAST is used, then the initial estimates and the output from
the last iteration will be shown. The default is PRINT=LAST.
SECITER = positive integer
Specifies the maximum number of iterations of the k-means algorithm in the secondary data summarization
phase. The default value is 1.
OUT = libref.SAS-data-set
Specifies the name of the data set that contains the probabilities PROB_h = P(x is in cluster h), h=1,2,...k. This
option must be used with ROLE=SCORE. The resulting data set will also contain the original data.
CLEAR = non-negative integer
Specifies the value of
n for which the EMCLUS procedure deletes the observations
that are remained in the
memory following the secondary summarization phase after every n iterations. The default value is 0, which
means that no observation will be deleted from the memory.
OUTLIERS = IGNORE|KEEP
Specifies how the outlier observations are weighted when the scaled EM algorithm is implemented. If
OUTLIERS=IGNORE is specified, observation that are not in the 99th percentile of any estimated primary
cluster are weighted less. If OUTLIERS=KEEP is specified, these observations are weighted normally as in the
standard EM algorithm.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The EMCLUS Procedure
VAR Statement
VAR variable(s);
variable(s)
Specifies which variables are to be used. If this statement is omitted, then all variables from the
input data set will be used.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The EMCLUS Procedure
INITCLUS_Statement__INITCLUS'>INITCLUS Statement
INITCLUS integer(s);
integer(s)
Specifies which clusters from PROC FASTCLUS or PROC EMCLUS are to be used as initial
estimates for PROC EMCLUS. For example, when viewing the output from PROC FASTCLUS
you see that only clusters 2, 4-8, and 10 have good results, then could use the option INITCLUS
2, 4 TO 8, 10; to only use those cluster estimates as initial estimates in PROC EMCLUS. The
number of clusters specified in the INITCLUS statement should not exceed the number of clusters
specified with the CLUSTERS= option in PROC EMCLUS statement. The default for INITCLUS
with INIT=EMCLUS is that all the clusters will be used as initial estimates. With
INIT=FASTCLUS, the default is that the clusters with the highest frequency counts will be used.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.