The arboretum procedure

Yüklə 3.07 Mb.

ölçüsü3.07 Mb.
1   ...   78   79   80   81   82   83   84   85   ...   148
: documentation
documentation -> From cyber-crime to insider trading, digital investigators are increasingly being asked to
documentation -> EnCase Forensic Transform Your Investigations
documentation -> File Sharing Documentation Prepared by Alan Halter Created: 1/7/2016 Modified: 1/7/2016
documentation -> Gaia Data Release 1 Documentation release 0


The PASSES = option specifies an upper bound for the number of passes

that are made through the data.



The OUTVARS = option creates a data set containing splitting information.



The VAR statement specifies the numeric and categorical inputs (independent



  var amount income homeval frequent recency age

      domestic apparel marital ntitle gender telind origin

      job statecod numcars edlevel;


The TARGET statement defines the target (response) variable. 

   target purchase;

   title 'DMSPLIT: Binary Target';



PROC PRINT creates a partial report of the OUTVARS= data set.

proc print data=vout(obs=20);

  title2 'OUTVARS= Summary Data';



The PROC SPLIT statement invokes the procedure. The INDMSPLT option

specifies to read the tree created from PROC DMSPLIT. The DMSPLIT tree information

is stored in the DMDB catalog.

title 'Import and Save Tree from DMSPLT';

proc split dmdbcat=catexa1 indmsplit


The OUTTREE= option names the data set that contains tree information.



The OUTLEAF= option names the data set that contains statistics for

each leaf node.



The OUTTREE= option specifies the output data set that describes the

tree. The OUTTREE data set can be used as input in subsequent executions of





PROC PRINT creates a report of the training statistics.

proc print data=trtree label;

   title2 'Training Statistics';



PROC PRINT creates a partial report of  the leaf statistics for the

training data.

proc print data=leafdata(obs=10) label;

   title2 'Leaf Statistics';



The DATA step  creates a fictitious score data set.

data testexa1(drop=ran);

  set sampsio.dmexa1;


  if ran lt 0.08;

  title 'Create Fictitious Score Data Set';



The INTREE = option reads the tree that was saved from the previous


proc split intree=savetree;


The SCORE statement scores the DATA= data set. The OUTFIT= option names

the output data set containing fit statistics. The OUT= option names the output

data set that contains tree statistics for the scored data set. Typically

you would want to score a truly mutually exclusive data set that may or may

not contain the target values (the WORK.TESTEXA1 data set is a random subset

of the SAMPSIO.DMEXA1 training data set).

  score data=testexa1 nodmdb  

       outfit=tfit out=tout;

  title 'Input Tree and Score Test Data';


PROC PRINT creates a report of the  fit statistics for the scored data


proc print data=tfit label;

   title2 'Fit Statistics for the Scored Data Set';



PROC FREQ creates a misclassification table for the scored data set.

The F_ PURCHA variable is the actual target value for each customer and the

I_PURCHA variable is the target value into which the customer is classified.

proc freq data=tout;

  tables f_purcha*i_purcha;

  title2 'Scored Data';

  title3 'Misclassification Table';



PROC PRINT creates a partial report of selected variables from the OUT=

score information data set.

proc print data=tout(obs=10) label;

   var _node_ a_ a_yes a_no d_purcha f_purcha

       i_purcha p_puryes p_purno p_pur r_puryes

       r_purno r_pur;

   title2 'Score Summary Data';


The EMCLUS Procedure

The EMCLUS Procedure


Procedure Syntax


VAR Statement

INITCLUS Statement

Output from PROC EMCLUS


Example 1: Syntax for PROC FASTCLUS

Example 2: Use of the EMCLUS Procedure

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

The EMCLUS Procedure


The EMCLUS procedure uses a scalable version of the Expectation-Maximization (EM) algorithm to

cluster a data set. You have the option to run the standard EM algorithm, in which the parameters are

estimated using the entire data set, or the scaled EM algorithm, in which a portion of the data set is used

to estimate the parameters at each iteration of the procedure. The standard EM algorithm is run by not

specifying the a value for the option NOBS, and can be run provided the entire data set. The entire data

set must fits in memory such that total number of observation can be determined. The scaled EM

algorithm can be run by specifying a value for the NOBS option that is less than the total number of

observations in the data set. When the scaled EM algorithm is used, it is important that the input data set

is randomized beforehand.

The EMCLUS procedure identifies primary clusters, which are the densest regions of data points, and

also identifies secondary clusters, which are typically smaller dense clusters. Each primary cluster is

modeled with a weighted n-dimensional multivariate normal distribution, where n is the number of

variables in the data set. Thus each primary cluster is modeled with the function w*f(x|u,V), where

f(x|u,V) ~ MVN(u,V) and V is a diagonal matrix. There are four major parts in the EMCLUS procedure:

Obtain and possibly refine the initial parameter estimates.


Apply the EM algorithm to update parameter values.


Summarize data in the primary summarization phase.


Summarize data in the secondary summarization phase.


The effectiveness of the EMCLUS procedure depends on the initial parameter estimates. Good initial

parameter estimates generally heads to faster convergence and better final estimates. The initial

parameter estimates can be obtained "randomly", from using PROC FASTCLUS, or from using PROC

EMCLUS. See Example 1 for the PROC FASTCLUS syntax. PROC FASTCLUS sometimes returns

poor results (clusters corresponding to low frequency counts). In this case, the poor results can be

ignored by specifying appropriate values for the INITCLUS option. PROC FASTCLUS can also return

clusters that are actually groups of clusters. This can be determined by clusters having a large frequency

count and a large root-mean-square standard deviation. In this case, these clusters should not be ignored,

and the user should specify a value of INITSTD which is smaller than the root-mean-square standard

deviations of the clusters with the large frequency counts and large root-mean-square standard deviations

(see Example 2). In the case when PROC FASTCLUS returns poor results, it may be of help to rerun

PROC FASTCLUS with a larger number of MAXCLUSTERS, and then choose the best clusters for the

initial values for PROC EMCLUS. Initial estimates obtained from PROC EMCLUS can be used to

refine the primary clusters or obtain better primary clusters.

The EM algorithm is used to find the primary clusters, and update the model parameters. The EM

algorithm terminates when two successive log-likelihood values differ in relative and absolute

magnitude by a particular amount or when ITER iterations have been reached.

The primary summarization phase summarizes observations near each of the primary cluster means and

Dostları ilə paylaş:
1   ...   78   79   80   81   82   83   84   85   ...   148

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur © 2017
rəhbərliyinə müraciət

    Ana səhifə