The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	7/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 2 3 4 5 6 7 8 9 10 ... 148

PROC ARBORETUM Statement

statistics and predictions in the saved tree.

DECSEARCH

speciﬁes that the split search should incorporate the proﬁt or loss function speciﬁed

in the DECISION statement. See the

“Incorporating Decisions, Proﬁt, and Loss”

section on page 38 for more information. The DECSEARCH option only works with

a categorical target.

INMODEL= SAS-data-set

names a data set created from the SAVE MODEL= option, or saved from the

Enterprise Miner Tree Desktop Application. When using the INMODEL option, the

INPUT, TARGET, FREQ and DECISION statements are prohibited.

Beginning with SAS 9.1, the MODEL= data set contains the name of the training and

validation data. The DATA= option is therefore unnecessary to resume training with

the same data as was used to create the saved tree (assuming the saved name of the

training data is still valid).

MISSING= policy

speciﬁes how a splitting rule handles an observation with missing values. Table

lists the available policies.

Table 3.

Missing Value Policies

Policy

Description

BIGBRANCH

assign the observation to the largest branch

DISTRIBUTE

assign the observation to each branch with a fractional frequency propor-

tional to the number of training observations in the branch

SMALLRESIDUAL

assign to the branch minimizing SSE among observations with missing

values

USEINSEARCH

use missing values during the split search (default)

The default policy is USEINSEARCH. The MISSING= option in the INPUT state-

ment assigns a policy to the variables listed in the statement, and supersedes

the MISSING= option to the PROC ARBORETUM statement. See the

“INPUT

Statement”

section on page 25.

If a surrogate rule can assign an observation to a branch, then it does, and the missing

value policy is ignored for the speciﬁc observation. Using the CODE statement for

a tree containing a rule with MISSING=DISTRIBUTE is an error. See the

“Missing

Values”

section on page 45 for a complete description of the missing value options.

PADJUST= method1

<

method2

method3

names one or more methods for adjusting the p-values used with the PROBCHISQ

and PROBF criteria. The following methods are available.

CHAIDAFTER

applies a Bonferroni adjustment after split is chosen.

CHAIDBEFORE applies Bonferroni adjustment before split is chosen.

DEPTH

adjusts for the number of ancestor splits.

The ARBORETUM Procedure

NOGABRIEL

suppresses an adjustment that sometimes overrides CHAID.

NONE

suppresses all adjustments.

Specifying both CHAIDAFTER and CHAIDBEFORE is an error. Specifying NONE

with any other method is an error. If the PADJUST= option is not speciﬁed, the

CHAIDBEFORE and DEPTH methods are used. The PADJUST= option is ignored

unless CRITERION= PROBCHISQ or PROBF. See the

“Adjusting p-Values for the

Number of Input Values and Branches”

section on page 43 for more information.

PRIORSSEARCH

requests that the prior probabilities deﬁned in the DECISION statement be incorpo-

rated in the split search criterion for a categorical target. See the

“Incorporating Prior

Probabilities”

section on page 37 for more information.

PVARS=n | ALL

speciﬁes the number of input variables n to regard as independent when adjusting

-values for the number of inputs. PVARS=ALL speciﬁes all the input variables as

independent. When searching for a split, the ARBORETUM procedure ignores input

variables whose values are constant in the node being split, and ignores categorical

variables unless at least two values occur in more observations than speciﬁed in the

MINCATSIZE= option in the TRAIN statement. Consequently, the ARBORETUM

procedure may only search for rules using m ≤ N of the original N input variables.

The procedure will regard max((n/N )m, 1) of the m variables as independent. See

the

“Adjusting p-Values for the Number of Input Variables”

section on page 44 for

more detail. The default number n is 0, requesting no adjustment for the number of

inputs.

SPLITATDATUM

requests that a split on an interval input equal the value of the observation, if the

value is an integer, or slightly less than the value if the value is not an integer. The

alternative is to split halfway between two data values. The SPLITBETWEEN option

requests the alternative.

SPLITBETWEEN

requests that a split on an interval input be halfway between two data values. The

SPLITBETWEEN option is default. The SPLITATDATUM option is an alternative.

ASSESS Statement

ASSESS < options > ;

The ASSESS statement speciﬁes a measure for evaluating trees, evaluates all subtrees

(with the original root), chooses a best one for each possible number of leaves, and

organizes the chosen ones in a sequence, beginning with the subtree consisting of the

root only, and ending with the largest tree consisting of all the nodes. (For assessment

measures LIFT and LIFTPROFIT, the subtrees are evaluated with measures ASE and

PROFIT, respectively. See the section

“Tree Assessment and the Subtree Sequence”

on page 49.)

The ARBORETUM procedure selects the best subtree in the sequence consistent with

the options in the ASSESS statement. A subsequent SUBTREE statement can change

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 10 ... 148