The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	82/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 78 79 80 81 82 83 84 85 ... 148

The EMCLUS Procedure Overview Procedure Syntax

The PASSES = option specifies an upper bound for the number of passes

that are made through the data.

passes=20

The OUTVARS = option creates a data set containing splitting information.

outvars=vout;

The VAR statement specifies the numeric and categorical inputs (independent

variables).

var amount income homeval frequent recency age

domestic apparel marital ntitle gender telind origin

job statecod numcars edlevel;

The TARGET statement defines the target (response) variable.

target purchase;

title 'DMSPLIT: Binary Target';

run;

PROC PRINT creates a partial report of the OUTVARS= data set.

proc print data=vout(obs=20);

title2 'OUTVARS= Summary Data';

run;

The PROC SPLIT statement invokes the procedure. The INDMSPLT option

specifies to read the tree created from PROC DMSPLIT. The DMSPLIT tree information

is stored in the DMDB catalog.

title 'Import and Save Tree from DMSPLT';

proc split dmdbcat=catexa1 indmsplit

The OUTTREE= option names the data set that contains tree information.

outmatrix=trtree

The OUTLEAF= option names the data set that contains statistics for

each leaf node.

outleaf=leafdata

The OUTTREE= option specifies the output data set that describes the

tree. The OUTTREE data set can be used as input in subsequent executions of

PROC SPLIT.

outtree=savetree;

run;

PROC PRINT creates a report of the training statistics.

proc print data=trtree label;

title2 'Training Statistics';

run;

PROC PRINT creates a partial report of the leaf statistics for the

training data.

proc print data=leafdata(obs=10) label;

title2 'Leaf Statistics';

run;

The DATA step creates a fictitious score data set.

data testexa1(drop=ran);

set sampsio.dmexa1;

ran=ranuni(3333);

if ran lt 0.08;

title 'Create Fictitious Score Data Set';

run;

The INTREE = option reads the tree that was saved from the previous

PROC SPLIT step.

proc split intree=savetree;

The SCORE statement scores the DATA= data set. The OUTFIT= option names

the output data set containing fit statistics. The OUT= option names the output

data set that contains tree statistics for the scored data set. Typically

you would want to score a truly mutually exclusive data set that may or may

not contain the target values (the WORK.TESTEXA1 data set is a random subset

of the SAMPSIO.DMEXA1 training data set).

score data=testexa1 nodmdb

outfit=tfit out=tout;

title 'Input Tree and Score Test Data';

PROC PRINT creates a report of the fit statistics for the scored data

set.

proc print data=tfit label;

title2 'Fit Statistics for the Scored Data Set';

run;

PROC FREQ creates a misclassification table for the scored data set.

The F_ PURCHA variable is the actual target value for each customer and the

I_PURCHA variable is the target value into which the customer is classified.

proc freq data=tout;

tables f_purcha*i_purcha;

title2 'Scored Data';

title3 'Misclassification Table';

run;

PROC PRINT creates a partial report of selected variables from the OUT=

score information data set.

proc print data=tout(obs=10) label;

var _node_ a_ a_yes a_no d_purcha f_purcha

i_purcha p_puryes p_purno p_pur r_puryes

r_purno r_pur;

title2 'Score Summary Data';

run;

The EMCLUS Procedure

The EMCLUS Procedure

Overview

Procedure Syntax

PROC EMCLUS Statement

VAR Statement

INITCLUS Statement

Output from PROC EMCLUS

EXAMPLES-SECTION

Example 1: Syntax for PROC FASTCLUS

Example 2: Use of the EMCLUS Procedure

The EMCLUS Procedure

Overview

The EMCLUS procedure uses a scalable version of the Expectation-Maximization (EM) algorithm to

cluster a data set. You have the option to run the standard EM algorithm, in which the parameters are

estimated using the entire data set, or the scaled EM algorithm, in which a portion of the data set is used

to estimate the parameters at each iteration of the procedure. The standard EM algorithm is run by not

specifying the a value for the option NOBS, and can be run provided the entire data set. The entire data

set must fits in memory such that total number of observation can be determined. The scaled EM

algorithm can be run by specifying a value for the NOBS option that is less than the total number of

observations in the data set. When the scaled EM algorithm is used, it is important that the input data set

is randomized beforehand.

The EMCLUS procedure identifies primary clusters, which are the densest regions of data points, and

also identifies secondary clusters, which are typically smaller dense clusters. Each primary cluster is

modeled with a weighted n-dimensional multivariate normal distribution, where n is the number of

variables in the data set. Thus each primary cluster is modeled with the function w*f(x|u,V), where

f(x|u,V) ~ MVN(u,V) and V is a diagonal matrix. There are four major parts in the EMCLUS procedure:

Obtain and possibly refine the initial parameter estimates.

Apply the EM algorithm to update parameter values.

Summarize data in the primary summarization phase.

Summarize data in the secondary summarization phase.

The effectiveness of the EMCLUS procedure depends on the initial parameter estimates. Good initial

parameter estimates generally heads to faster convergence and better final estimates. The initial

parameter estimates can be obtained "randomly", from using PROC FASTCLUS, or from using PROC

EMCLUS. See Example 1 for the PROC FASTCLUS syntax. PROC FASTCLUS sometimes returns

poor results (clusters corresponding to low frequency counts). In this case, the poor results can be

ignored by specifying appropriate values for the INITCLUS option. PROC FASTCLUS can also return

clusters that are actually groups of clusters. This can be determined by clusters having a large frequency

count and a large root-mean-square standard deviation. In this case, these clusters should not be ignored,

and the user should specify a value of INITSTD which is smaller than the root-mean-square standard

deviations of the clusters with the large frequency counts and large root-mean-square standard deviations

(see Example 2). In the case when PROC FASTCLUS returns poor results, it may be of help to rerun

PROC FASTCLUS with a larger number of MAXCLUSTERS, and then choose the best clusters for the

initial values for PROC EMCLUS. Initial estimates obtained from PROC EMCLUS can be used to

refine the primary clusters or obtain better primary clusters.

The EM algorithm is used to find the primary clusters, and update the model parameters. The EM

algorithm terminates when two successive log-likelihood values differ in relative and absolute

magnitude by a particular amount or when ITER iterations have been reached.

The primary summarization phase summarizes observations near each of the primary cluster means and

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 78 79 80 81 82 83 84 85 ... 148