The arboretum procedure



Yüklə 3,07 Mb.
Pdf görüntüsü
səhifə37/148
tarix30.04.2018
ölçüsü3,07 Mb.
#40673
1   ...   33   34   35   36   37   38   39   40   ...   148

The DMDB Procedure

Details

The data mining database (DMDB) is maintained as a SAS data set. The metadata information

associated with the DMDB is maintained in a SAS catalog. Metadata includes overall data set

information as well as statistical information for the variables according to their roles. For each CLASS

variable, the metadata contains information on each of the following: its class level value, its frequency,

and its ordering information. In the DMDB, the CLASS variables are stored as integers 0, 1, 2, ..., which

can be mapped into different class level values.

For each VAR variable, the metadata catalog contains the following statistics:

N

The number of observations with nonmissing values of the variable



NMISS

The number of observations with missing values of the variable

MIN

The minimum



MAX

The maximum

SUM

The sum of all the nonmissing values of the variable



SUMWGT

The sum of weights

CSS

The corrected sum of squares



USS

The uncorrected sum of squares

STD

The standard deviation



SKEWNESS

Measure of the tendency for the distribution of values to be more spread out on one side of the

mean than on the other

KURTOSIS


Measure of the "heaviness of the tails"

(Refer to the SAS Procedures Guide, Chapter 1 for formulas and other details.)

DMDBs are only created for training data and should not be used for validation or test during modeling.



Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.


The DMDB Procedure

Examples

The following examples were executed using the HP-UX version 10.20 operating system and the SAS

software release 6.12TS045.

Example 1: Getting Started with the DMDB Procedure

Example 2: Specifying a FREQ Variable

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The DMDB Procedure

Example 1: Getting Started with the DMDB

Procedure

Features:

Specifying the Output DMDB Data Set and Catalog

q   

Defining the Numeric Variables in a VAR Statement



q   

Defining the Class Variables in a Class Statement

q   

Setting the Order of the Class Variables



q   

Defining the Target Variable in a Target Statement

q   

This example demonstrates how to create a data mining database (DMDB) data set and catalog. The



example uses the fictitious mortgage data set name SAMPSIO.HMEQ. The data set contains 5,960

cases. Each case represents an applicant for a home equity loan. All applicants have an existing

mortgage. The binary target BAD indicates whether or not an applicant eventually defaulted or was ever

seriously delinquent. There are ten numeric inputs and two class inputs available for subsequent

modeling.

Program

 

proc dmdb batch data=sampsio.hmeq



 

          out=dmhmeq

          dmdbcat=cathmeq;

 

   var loan derog mortdue value yoj delinq



       clage ninq clno debtinc;

 

   class bad(desc)



         reason(ascending)

         job;

 

   target bad;



run;


Log

1   proc dmdb batch data=sampsio.hmeq

2   

3             out=dmhmeq



4             dmdbcat=cathmeq;

5   


6      var loan derog mortdue value yoj delinq

7          clage ninq clno debtinc;

8   

9      class bad(desc)



10            reason(ascending)

11            job;

12   

13      target bad;



14   run;

Records processed=    5960  Mem used = 511K.

NOTE: The PROCEDURE DMDB used 0:00:08.30 real 0:00:02.85 cpu.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




 

The PROC DMDB statement invokes the procedure. The BATCH option requests

the creation of a new DMDB catalog. The DATA= option specifies the input data

set.


proc dmdb batch data=sampsio.hmeq


 

The OUT= option specifies the name of the output DMDB data set. The

DMDBCAT= option specifies the name of the output DMDB catalog.

          out=dmhmeq

          dmdbcat=cathmeq;



 

The VAR statement identifies the numeric analysis variables. If you

omit the VAR statement, PROC DMDB analyzes all numeric variables not listed

in other statements.

   var loan derog mortdue value yoj delinq

       clage ninq clno debtinc;




 

The CLASS statement specifies the categorical variables to be used in

the analysis. The ORDER option specifies the order to use when considering

the levels of the classification variables. Valid ORDER options include ASCENDING

(ASC), DESCENDING (DESC), ASCFORMATTED (ASCFMT), DESFORMATTED (DESFMT), or

DSORDER (DATA). The default for the ORDER option is set to ASCENDING.

   class bad(desc)

         reason(ascending)

         job;



Yüklə 3,07 Mb.

Dostları ilə paylaş:
1   ...   33   34   35   36   37   38   39   40   ...   148




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə