then deletes the summarized observations from memory. For the specified value p=p
0
, all observations
falling within the region containing 100*p
0
% of the volume of the MNV(u,V) distribution will be
summarized. At then end of the primary summarization phase, the primary clusters are checked to see if
any of them contain fewer than MIN observations. If a cluster does, then the cluster is declared to be
inactive. An inactive cluster is not used in updating the parameter estimates in the EM algorithm. Am
inactive cluster remains inactive until one of the following two conditions occur:
The inactive cluster gets reseeded at a secondary cluster containing at least MIN observations.
1.
The inactive cluster gets reseeded at point determined by an active cluster having at least one
variable with standard deviation greater than INITSTD.
2.
The secondary summarization phase first uses the k-means clustering algorithm to identify secondary
clusters, and then uses a hierarchical agglomerative clustering algorithm to combine similar secondary
clusters. At the end of the k-means algorithm, each of the SECCLUS clusters are tested to see if their
sample standard deviation for each variable is less than or equal to SECSTD. If yes, then the cluster
becomes a secondary cluster. Setting SECCLUS=0 will cause PROC EMCLUS not to perform a
secondary summarization phase, which is not recommended. The reason for this is that the secondary
summarization phase acts as a backup method for finding the primary clusters when the initial values are
poor. If the data set contains many outliers, then setting SECCLUS to be larger than the default value
will increase the chances of finding clusters. A secondary cluster is disjoint from all other secondary
clusters and from all primary clusters.
Although many of the options in PROC EMCLUS are not required to be specified, it is best to specify
them if the user has some knowledge of the input data set. Among the most important options is NOBS.
NOBS specifies the number of observations that are read in during each iteration, and consequently,
NOBS determines the number of iterations. For example, if the input data set contains 1,000
observations and NOBS is set to 100, then there will be 10 iterations. If NOBS is not specified, then it is
assumed that you wants to run the standard EM algorithm. Another important option is INITSTD, the
maximum initial standard deviation of each variable in each initial primary cluster. If INISTD is chosen
too small, then the EM algorithm may have trouble finding the primary clusters.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The EMCLUS Procedure
Procedure Syntax
PROC EMCLUS <
option(s)>;
VAR variable(s);
INITCLUS integer(s);
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The EMCLUS Procedure
PROC EMCLUS Statement
Invoke the EMCLUS procedure.
PROC EMCLUS<
option(s)>;
Options
DATA = |IN = SAS-data-set
Specifies the data set to be analyzed. All the variables in this data set must be numerical. Observations with
missing data are ignored.
ROLE = TRAIN|SCORE
Specifies the role of the DATA= data set. Setting ROLE=TRAIN will cause the EMCLUS procedure to cluster
the data set. Setting ROLE=SCORE will cause the EMCLUS procedure to compute the probabilities that each
observation in the DATA= data set is in each primary cluster. If ROLE=SCORE is specified, then SEED=
option must also be used where is the name of the OUTSTAT data set. The default value for ROLE is
TRAIN.
CLUSTERS = positive integer
Specifies the number of primary clusters.
SECCLUS = nonnegative integer
Specifies the number of secondary clusters that the algorithm will search for during the secondary data
summarization phase. If SECCLUS=0, then there will not be a secondary data summarization phase. The default
value of SECCLUS is twice the number of primary clusters.
EPS = positive number
Specifies the stopping tolerance. The default value is 10
--6
.
SECSTD = positive number
Specifies the maximum allowable sample standard deviation of any variable in a summarized subset of
observations to be deemed a secondary cluster. The default value is the smallest positive sample standard
deviation obtained from the observations read in during the first iteration.
P = number
Defines a radius around each cluster mean, such that any point which lies inside the radius is summarized in that
cluster. The value of P must be between 0 and 1. A value close to 1 defines a larger radius than a value close to 0.
The default value is 0.5.
MAXITER = positive integer
Specifies the maximum number of iterations of the EMCLUS procedure. The default value is the largest machine
integer.
INITSTD = positive number
Specifies the maximum standard deviation of the initial clusters. The default value is determined from a sample
of data read in during the first iteration.
ITER = positive integer
Specifies the number of iterations in the EM algorithm to update the model parameters. The default value is 50.
INIT = RANDOM|FASTCLUS|EMCLUS