The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	84/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 80 81 82 83 84 85 86 87 ... 148

Specifies how the initial estimates are obtained. The default value is RANDOM. If INIT=FASTCLUS or

INIT=EMCLUS is specified, then the option SEEDS= must also be specified, and the data

set must be the OUTSEEDS data set from PROC FASTCLUS or the OUTSTAT data set from PROC EMCLUS,

respectively.

MIN = nonnegative integer

Specifies the minimum number of observations in each primary cluster. At any iteration, if the total number of

observations summarized in a cluster is less than MIN, then the cluster becomes inactive, and the cluster is

reseeded at a more appropriate point, of one exists. The default value is 3.

NOBS = positive integer

Specifies the number of observations to be read in for each iteration. The default value is the number of

observations in the data set, provided that this number can be determined. If the number of observations in the

data set cannot be determined, the default value is 500.

SEED = libref.SAS-data-set

Specifies the data set that is used for the initial parameter estimates. This option must be used with

INIT=FASTCLUS or with INIT=EMCLUS. With PROC FASTCLUS, this specified data set is the resulting

SAS data set from the OUTSEEDS option. With PROC EMCLUS, this data set is the resulting SAS data set

from the OUTSTAT option.

OUTSTAT = libref.SAS-data-set

Specifies an output data set. This data set has 5+D columns, where D is the number of variables. Column 1

contains the cluster number. Column 2 is the type of cluster (primary or secondary). Column 3 is the cluster

frequency. Column 4 is the estimate for the weight parameter. Column 5 is the labelled _TYPE_, where

_TYPE_=MEAN or _TYPE_=VAR. Columns 6 through 5+D contain either the estimates for the mean or the

variance of each variable. Each variabe corresponds to two rows of this data set in the following form:

Output Data Set from the OUTSTAT Option

CLUSTER_1

PRIMARY

FREQ_1

WEIGHT_1

MEAN

MEAN_1

MEAN_2

.....

MEAN_N

CLUSTER_1

PRIMARY

FREQ_1

WEIGHT_1

VAR

VAR_1

VAR_2

.....

VAR_N

CLUSTER_2

SECONDAY

FREQ_2

WEIGHT_2

MEAN

MEAN_1

MEAN_2

.....

MEAN_N

CLUSTER_2

SECONDARY

FREQ_2

WEIGHT_2

VAR

VAR_1

VAR_2

.....

VAR_N

DIST = nonnegative number

Specifies the minimum distance between initial clusters when INIT=RANDOM is used. It also specifies the

minimum distance between initial cluster means in the secondary data summarization phase. The default value is

the square root of the average of the sample variances obtained from the observations read in during the first

iteration.

PRINT = ALL|LAST|NONE

Specifies how much output are printed. If PRINT=LAST is used, then the initial estimates and the output from

the last iteration will be shown. The default is PRINT=LAST.

SECITER = positive integer

Specifies the maximum number of iterations of the k-means algorithm in the secondary data summarization

phase. The default value is 1.

OUT = libref.SAS-data-set

Specifies the name of the data set that contains the probabilities PROB_h = P(x is in cluster h), h=1,2,...k. This

option must be used with ROLE=SCORE. The resulting data set will also contain the original data.

CLEAR = non-negative integer

Specifies the value of n for which the EMCLUS procedure deletes the observations that are remained in the

memory following the secondary summarization phase after every n iterations. The default value is 0, which

means that no observation will be deleted from the memory.

OUTLIERS = IGNORE|KEEP

Specifies how the outlier observations are weighted when the scaled EM algorithm is implemented. If

OUTLIERS=IGNORE is specified, observation that are not in the 99th percentile of any estimated primary

cluster are weighted less. If OUTLIERS=KEEP is specified, these observations are weighted normally as in the

standard EM algorithm.

The EMCLUS Procedure

VAR Statement

VAR variable(s);

variable(s)

Specifies which variables are to be used. If this statement is omitted, then all variables from the

input data set will be used.

The EMCLUS Procedure

INITCLUS_Statement__INITCLUS'>INITCLUS Statement

INITCLUS integer(s);

integer(s)

Specifies which clusters from PROC FASTCLUS or PROC EMCLUS are to be used as initial

estimates for PROC EMCLUS. For example, when viewing the output from PROC FASTCLUS

you see that only clusters 2, 4-8, and 10 have good results, then could use the option INITCLUS

2, 4 TO 8, 10; to only use those cluster estimates as initial estimates in PROC EMCLUS. The

number of clusters specified in the INITCLUS statement should not exceed the number of clusters

specified with the CLUSTERS= option in PROC EMCLUS statement. The default for INITCLUS

with INIT=EMCLUS is that all the clusters will be used as initial estimates. With

INIT=FASTCLUS, the default is that the clusters with the highest frequency counts will be used.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 80 81 82 83 84 85 86 87 ... 148