then deletes the summarized observations from memory. For the specified value p=p

0

, all observations

falling within the region containing 100*p

0

% of the volume of the MNV(u,V) distribution will be

summarized. At then end of the primary summarization phase, the primary clusters are checked to see if

any of them contain fewer than MIN observations. If a cluster does, then the cluster is declared to be

inactive. An inactive cluster is not used in updating the parameter estimates in the EM algorithm. Am

inactive cluster remains inactive until one of the following two conditions occur:

The inactive cluster gets reseeded at a secondary cluster containing at least MIN observations.

1.

The inactive cluster gets reseeded at point determined by an active cluster having at least one

variable with standard deviation greater than INITSTD.

2.

The secondary summarization phase first uses the k-means clustering algorithm to identify secondary

clusters, and then uses a hierarchical agglomerative clustering algorithm to combine similar secondary

clusters. At the end of the k-means algorithm, each of the SECCLUS clusters are tested to see if their

sample standard deviation for each variable is less than or equal to SECSTD. If yes, then the cluster

becomes a secondary cluster. Setting SECCLUS=0 will cause PROC EMCLUS not to perform a

secondary summarization phase, which is **not** recommended. The reason for this is that the secondary

summarization phase acts as a backup method for finding the primary clusters when the initial values are

poor. If the data set contains many outliers, then setting SECCLUS to be larger than the default value

will increase the chances of finding clusters. A secondary cluster is disjoint from all other secondary

clusters and from all primary clusters.

Although many of the options in PROC EMCLUS are not required to be specified, it is best to specify

them if the user has some knowledge of the input data set. Among the most important options is NOBS.

NOBS specifies the number of observations that are read in during each iteration, and consequently,

NOBS determines the number of iterations. For example, if the input data set contains 1,000

observations and NOBS is set to 100, then there will be 10 iterations. If NOBS is not specified, then it is

assumed that you wants to run the standard EM algorithm. Another important option is INITSTD, the

maximum initial standard deviation of each variable in each initial primary cluster. If INISTD is chosen

too small, then the EM algorithm may have trouble finding the primary clusters.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

*The EMCLUS Procedure*
**Procedure Syntax**
**PROC EMCLUS **<

*option(s)*>;

**VAR ***variable(s)*;

**INITCLUS ***integer(s)*;

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

*The EMCLUS Procedure*
**PROC EMCLUS Statement**
**Invoke the EMCLUS procedure.**
**PROC EMCLUS**<

*option(s)*>;

**Options**
**DATA = |IN = ***SAS-data-set*
Specifies the data set to be analyzed. All the variables in this data set must be numerical. Observations with

missing data are ignored.

**ROLE = ***TRAIN|SCORE*

Specifies the role of the DATA= data set. Setting ROLE=TRAIN will cause the EMCLUS procedure to cluster

the data set. Setting ROLE=SCORE will cause the EMCLUS procedure to compute the probabilities that each

observation in the DATA= data set is in each primary cluster. If ROLE=SCORE is specified, then SEED=

option must also be used where is the name of the OUTSTAT data set. The default value for ROLE is

TRAIN.

**CLUSTERS = ***positive integer*
Specifies the number of primary clusters.

**SECCLUS = ***nonnegative integer*
Specifies the number of secondary clusters that the algorithm will search for during the secondary data

summarization phase. If SECCLUS=0, then there will not be a secondary data summarization phase. The default

value of SECCLUS is twice the number of primary clusters.

**EPS = ***positive number*
Specifies the stopping tolerance. The default value is 10

--6

.

**SECSTD = ***positive number*

Specifies the maximum allowable sample standard deviation of any variable in a summarized subset of

observations to be deemed a secondary cluster. The default value is the smallest positive sample standard

deviation obtained from the observations read in during the first iteration.

**P = ***number*

Defines a radius around each cluster mean, such that any point which lies inside the radius is summarized in that

cluster. The value of P must be between 0 and 1. A value close to 1 defines a larger radius than a value close to 0.

The default value is 0.5.

**MAXITER = ***positive integer*
Specifies the maximum number of iterations of the EMCLUS procedure. The default value is the largest machine

integer.

**INITSTD = ***positive number*

Specifies the maximum standard deviation of the initial clusters. The default value is determined from a sample

of data read in during the first iteration.

**ITER = ***positive integer*

Specifies the number of iterations in the EM algorithm to update the model parameters. The default value is 50.

**INIT = ***RANDOM|FASTCLUS|EMCLUS*