The arboretum procedure



Yüklə 3,07 Mb.
Pdf görüntüsü
səhifə121/148
tarix30.04.2018
ölçüsü3,07 Mb.
#40673
1   ...   117   118   119   120   121   122   123   124   ...   148

The SPLIT Procedure

PROC SPLIT Statement

PROC SPLIT <option(s)>;

Data Set Options

OPTION

DESCRIPTION

DATA=


Specifies the

data set


containing

observations

used to create

the model.

Default: none.

DMDBCAT=


Specifies the

DMDB


metabase

associated with

the data. Default:

none.


INDMSPLIT

Requests that the

tree created by

PROC


DMSPLIT be

input.


INTREE=

Specifies the

input data set

describing a

previously

created tree.




OUTAFDS=

Specifies the

output data set

for the user

interface

components of

SAS/AF. (These

components [or

widgets] can be

scrollbars,

pushbuttons, text

fields, and so

on.)

OUTIMPORTANCE=



Specifies the

output data set

with variables

importance.

OUTLEAF=

Names the

output data set

that is to contain

statistics for each

leaf node.

OUTMATRIX=

Names the

output data set

that is to contain

summary

statistics.



OUTSEQ=

Specifies the

output data set

with sub-tree

statistics.

OUTTREE=


Specifies the

output data set

describing the

tree.


VALIDATA=

Specifies the

validation data

set.


Tree Construction Options


OPTION

DESCRIPTION

ASSESS=


Specifies the

model assessment

measure.

COSTSPLIT

Requests that the

split search

criterion

incorporate the

decision matrix.

CRITERION=

Specifies the

method of model

construction.

EXCLUDEMISS

Specifies that

missing values be

excluded during a

split search.

EXHAUSTIVE=n

Specifies the

highest number

of candidate

splits to find in

an exhaustive

search.

LEAFSIZE=



Specifies the

minimum size of

a node.

LIFTDEPTH=



Specifies the

proportion of

data to use with

the LIFT


=ASSESSMENT.

MAXBRANCH=

Specifies the

maximum


number of child

nodes of a node.




MAXDEPTH=

Specifies the

limiting depth of

tree.


NODESAMPLE=

Specifies the size

for searches,

within the node

sample.

NRULES=


Specifies the

number of rules

saved with each

node.


NSURRS=

Specifies the

number of

surrogates sought

in each non-leaf

node.


PADJUST=

Specifies the

options for

adjusting



p-values.

PVARS=


Specifies the

adjusting p-value

for the number of

variables.

SPLITSIZE=

Specifies the

minimum size of

a node required

for split.

SUBTREE=


Specifies the

method for

selecting the

sub-tree.

USEVARONCE

Specifies that no

node is split on

an input that an

ancestor is split

on.



WORTH=

Specifies worth

required of

splitting rule.



Required Arguments

DATA=SAS-data-set

Names the input training data set if constructing a tree. Variables named in the FREQ, INPUT,

and TARGET statements refer to variables in the DATA= SAS data set.

Default:

None


DMDBCAT=SAS-catalog

Names the SAS catalog describing the DMDB metabase. The DMDB metabase contains the

formatted values of all NOMINAL variables, and how they are coded in the DATA= SAS data set.

Required with the DATA= option.



Default:

None


To learn how to create the DMDB encoded data set and catalog, see the PROC DMDB chapter.

Options

ASSESS=

Specifies how to evaluate a tree. The construction of the sequence of sub-trees uses the assessment

measure. Possible measures are:

IMPURITY


Total leaf impurity (Gini index or Average Squared Error ).

LIFT


Average assessment in highest ranked observations.

PROFIT


Average profit or loss from the decision function.

STATISTIC

Nominal Classification Rate or Average Squared Error.



Default:

PROFIT


The default PROFIT measure is set to STATISTIC if no DECISION

statement is specified.

LIFT restricts the default PROFIT or STATISTIC measure to those

observations predicted to have the best assessment. The LIFTDEPTH=

option specifies the proportion of observations to use.

If ASSESS=IMPURITY, then the assessment of the tree is measured as the

total impurity of all its leaves. For interval targets, this is the same as using

Average Squared Error (ASSESS=STATISTIC).

For categorical targets, the impurity of each leaf is evaluated using the Gini

index. The impurity measure produces a finer separation of leaves than a

classification rate and is, therefore, preferable for lift charts. ASSESS=LIFT

generates the sequence of sub-trees using ASSESS=IMPURITY and then

prunes using the LIFT measure.

ASSESS=IMPURITY implements class probability trees as described in

Brieman et al., section 4.6 (1984).

COSTSPLIT

Requests that the split search criterion incorporate the decision matrix. To use COSTSPLIT,

CRITERION must equal ENTROPY or GINI, and the type of the DECDATA data set must be

PROFIT or LOSS. For ordinal targets, COSTSPLIT is superfluous because the decision matrix is

always incorporated into the criterion.

CRITERION=method

Specifies the method of searching for and evaluating candidate splitting rules. Possible methods

depend on the level of measurement appropriate for the target variable, as follows:

BINARY or NOMINAL TARGETS:

Method=CHISQ

Pearson Chi-square statistic for target vs. segments.

Method=PROBCHISQ

p-value of Pearson Chi-square statistic for target vs. segments. Default for

NOMINAL.


Method=ENTROPY

Reduction in entropy measure of node impurity.

Method=ERATIO

Reduction in entropy of split.

Method=GINI

Reduction in Gini measure of node impurity.

INTERVAL TARGETS



Yüklə 3,07 Mb.

Dostları ilə paylaş:
1   ...   117   118   119   120   121   122   123   124   ...   148




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə