The arboretum procedure



Yüklə 3.07 Mb.

səhifə77/148
tarix30.04.2018
ölçüsü3.07 Mb.
1   ...   73   74   75   76   77   78   79   80   ...   148
: documentation
documentation -> From cyber-crime to insider trading, digital investigators are increasingly being asked to
documentation -> EnCase Forensic Transform Your Investigations
documentation -> File Sharing Documentation Prepared by Alan Halter Created: 1/7/2016 Modified: 1/7/2016
documentation -> Gaia Data Release 1 Documentation release 0

 

PROC DMDB step to create the DMDB data set and catalog that are required

as input to DMREG.

proc dmdb batch data=hmeq

          out=dm_data dmdbcat=dm_cat;

   var loan mortdue value yoj derog

       clage ninq clno debtinc;

   class bad(desc)

         job(asc);

   target bad;

run;



 

Because the order of the target BAD was set to descending in the DMDB

data set, DMREG also models the probability that BAD=1 (bad applicants). By

default, DMREG using deviation from the means coding to create the design

matrix for the class variables.

proc dmreg data=dm_data

           dmdbcat=dm_cat;

   class bad job;

   model bad = job loan mortdue value yoj derog

               clage ninq clno debtinc;

   title1 'DMREG Home Equity Data: 

           Default Deviations from the Mean Coding';

run;



 

DATA step program to code the class variable JOB using GLM non-full

rank (0, 1) coding.

data dumyhmeq;

   set hmeq;

   j_mgr=(job='Mgr');

   j_off=(job='Office');

   j_other=(job='Other');

   j_prof=(job='ProfExe');

   j_sales=(job='Sales');

   j_self=(job='Self');

run;



 

PROC LOGISTIC step to model the binary target BAD. 

proc logistic data=dumyhmeq descending noprint;

   model bad = j_mgr j_off j_other j_prof j_sales j_self

               loan mortdue value yoj derog

               clage ninq clno debtinc;

   output out=logfit(keep=bad p_bad1) p=p_bad1;

title 'LOGISTIC Home Equity Data: GLM coding';

run;



 

The NOPRINT option suppresses the printing of the DMREG output. PROC

COMPARE is used to compare the predicted values from the LOGISTIC and DMREG

models. The CODING=GLM option creates the design matrix for the class variables

using GLM non-full rank coding.

proc dmreg data=dm_data

           dmdbcat=dm_cat

           noprint;

   class bad job;

   model bad = job loan mortdue value yoj derog

               clage ninq clno debtinc / coding=glm;

   score out=dmscore;

   title1 'DMREG Home Equity Data: GLM coding';

run;



The DMREG Procedure

References

Berry, M. J. A. and Linoff, G. (1997), Data Mining Techniques for Marketing, Sales, and



Customer Support, New York: John Wiley and Sons, Inc.

Cox, D. R. and Snell, E. J. (1989), The Analysis of Binary Data, 2nd Edition, London: Chapman

and Hall.

Draper, N. and Smith, H. (1981), Applied Regression Analysis, 2nd Edition, New York: John

Wiley and Sons, Inc.

Little, R. J. A. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York: John

Wiley and Sons, Inc.

Little, R. J. A. (1992), "Regression with Missing X's: A review," Journal of the American



Statistical Association, 87, 1227-1237.

McCullagh, P. and Nelder, J. A. (1989), Generalized Linear Models, 2nd Edition, New York:

Chapman and Hall.

Rawlings, J. O. (1988), Applied Regression Analysis: A Research Tool, Pacific Grove,

California: Wadsworth and Brooks/Cole Advanced Books and Software.

SAS Institute Inc. (1995), Logistic Regression Examples using the SAS System, Version 6, 1st

Edition, Cary, NC: SAS Institute Inc.

SAS Institute Inc. (1997), SAS/OR Technical Report: The NLP Procedure, Cary, NC: SAS

Institute Inc.

SAS Institute Inc. (1990), SAS/STAT User's Guide, Version 6, 4th Edition, Volumes 1 and 2,

Cary, NC: SAS Institute Inc.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The DMSPLIT Procedure

The DMSPLIT Procedure

Overview

Procedure Syntax

PROC DMSPLIT Statement

FREQ Statement

TARGET Statement

VARIABLE Statement

WEIGHT Statement



Details

Examples

Example 1: Creating a Decision Tree for a Binary Target with the DMSPLIT Procedure

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.



The DMSPLIT Procedure

Overview

The DMSPLIT procedure performs variable selection using binary variable splits for maximizing the

Chi-Square value of a 2 X 2 frequency table. The cutoff threshold is chosen so that the Chi-Square value

of the table is maximized.

PROC DMINE and PROC DMSPLIT are underlying procedures for the Variable Selection node.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The DMSPLIT Procedure

Procedure Syntax

PROC DMSPLIT <option(s)>;

FREQ variable;

TARGET variable;

VARIABLE variable-list;

WEIGHT variable;

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The DMSPLIT Procedure

PROC DMSPLIT Statement

Invokes the DMSPLIT procedure.

PROC DMSPLIT <option(s)>;

Required Arguments

DATA=SAS-data-set

Specifies an input data set generated by PROC DMDB. The data set is associated with a valid

catalog specified by the DMDBCAT= option. This data set must contain interval scaled variables

and CLASS variables in a specific form written by PROC DMDB.



Default:

None.


DMDBCAT= SAS-catalog

Identifies an input metadata catalog generated by PROC DMDB. The metadata catalog is

associated with a valid data set specified by the DATA= option. The catalog contains important

information (for example, the range of variables, number of missing values of each variable,

moments of variables) that is used by many other Enterprise Miner procedures that require a

DMDB data set. The DMDBCAT= catalog and the DATA= data set must be appropriately related

to each other in order to obtain proper results.

Default:

None.


Options

BINS=integer

Specifies the number of categories in which the range of a numeric (interval) variable is divided

for splits.

Range:

Integer > 0



Default:

100


CHISQ=number

Specifies a low bound for the Chi-Square value still eligible for variable splits. The value of

CHISQ governs the number of splits that are performed: the higher the value of CHISQ, the fewer

splits and passes of the input data will be performed.



Range:

number is a real number > 0



Dostları ilə paylaş:
1   ...   73   74   75   76   77   78   79   80   ...   148


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

    Ana səhifə