The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	13/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 9 10 11 12 13 14 15 16 ... 148

The ARBORETUM Procedure

MAXBRANCH=n

restricts the number of subsets a splitting rule can produce to n or fewer. Setting n to

2 will create a binary tree. Any integer from 2 through 50 is permitted. The default

value of n is 2.

MAXDEPTH=n | MAX

speciﬁes the maximum depth of a node that the TRAIN statement will create auto-

matically unless the MAXNEWDEPTH= option equals 1. The depth of a node equals

the number of splitting rules needed to deﬁne the node. The root node has depth zero.

The children of the root have depth one, and so on.

The TRAIN statement will search for a splitting rule in a leaf only when the

MAXNEWDEPTH= option equals 1 or the depth of the leaf is < n. The BRANCH,

SEARCH, and SPLIT statements will search for splitting rules and create branches

in a leaf at any depth, regardless of the MAXDEPTH= and the MAXNEWDEPTH=

options.

The MAXDEPTH=MAX option speciﬁes n = 50, the largest possible value

of n.

The smallest acceptable value of n is 0.

Specify MAXDEPTH=0 or

MAXNEWDEPTH=0 to avoid searching for any splits while specifying other op-

tions. The MAXDEPTH= option remains in effect until explicitly changed. The

MAXNEWDEPTH= option reverts to its maximum value after the TRAIN statement

ﬁnishes. The default value of n is six.

MAXNEWDEPTH=n | MAX

speciﬁes the maximum number of new generations of nodes created from a leaf.

Specify MAXNEWDEPTH=1 to create at most one split in a leaf. Other options such

as the MAXDEPTH= option may prevent the creation of all n generations. Specify

MAXNEWDEPTH=0 to specify other options to the TRAIN statement without per-

forming a search for a splitting rule. The MAXNEWDEPTH=MAX option speciﬁes

n to be 50, the largest acceptable value, and also the default value. N is not retained

from one TRAIN statement to the next, and will equal its default value of 50 unless

explicitly changed even if it was changed in a previous TRAIN statement.

MAXRULES=n | ALL

speciﬁes how many splitting rules on different input variables are saved in each node,

including leaves. The primary splitting rule in an internal node is always saved.

Up to n − 1 additional competing rules are also saved in an internal node. The

MAXRULES=ALL option requests the ARBORETUM procedure to save all the

available splitting rules for each node.

Saved rules may be displayed in results and may be output using the RULES option

to the SAVE statement. Subsequent BRANCH, SEARCH, SETRULE, SPLIT, and

TRAIN statements use candidate rules saved in leaves.

A valid splitting rule might not exist for some input variables in some nodes. A

common explanation is that none of the feasible rules meet the threshold of worth

speciﬁed in the ALPHA= option in the TRAIN statement. Other causes occur less

often. For example, the MINCATSIZE= option in the TRAIN statement may prevent

creation of a split on a categorical input X if few observations exist for any speciﬁc

value of X. As another example, the LEAFSIZE= option may prevent any split on

TRAIN Statement

a speciﬁc input, especially one that is nearly constant and consequently permits few

candidate splits. No split exists with an input that is constant in a node. As a conse-

quence of these and other possibilities, a node may contain fewer than n rules.

The amount of memory needed to save splitting rules, especially rules using nominal

input variables with many values, may be substantial, possibly several megabytes.

(Eight bytes for each branch, and four more bytes for each categorical value is needed

for each rule in each node.) The default value of n is 3.

MAXSURROGATES | MAXSURRS=n

speciﬁes the number of surrogate rules sought for each primary splitting rule. A

surrogate rule is a backup to the primary splitting rule. The primary splitting rule

might not apply to some observations because the value of the splitting variable might

be missing or be a categorical value the rule does not recognize. Surrogate rules are

considered for such observations. The search for surrogate rules requires an extra pass

over the data, and therefore no surrogates are sought by default. See the

“Missing

Values”

section on page 45 for more information.

Surrogate rules enhance the importance of the variables they use.

See the

“IMPORTANCE= Output Data Set”

section on page 54 for more detail.

MINCATSIZE=n

speciﬁes the minimum number of observations that a given nominal input value must

have in order to use the value in a split search. Categorical values that appear in

fewer than n observations are regarded as if they were missing. If USEINSEARCH

is speciﬁed in the MISSING= option in the input statement for the splitting variable,

the categories occurring in fewer than n observations are merged into the pseudo

category for missing values for the purpose of ﬁnding a split. Otherwise observations

with infrequent categories are excluded from the split search. The policy for assigning

such observations to a branch is the same as the policy for assigning missing values

to a branch. Refer to the

“Missing Values”

section (page 45) for more detail. The

default value of n is 5.

MINWORTH=worth

speciﬁes a threshold value for the worth of a candidate splitting rule, unless

the CRITERION= option in the PROC ARBORETUM statement is speciﬁed as

PROBCHISQ or PROBF. A candidate rule whose worth is less than worth is dis-

carded. The default value is 0. When CRITERION=PROBCHISQ or PROBF, the

MINWORTH= option is ignored and the ALPHA= option is used instead.

NODES=nodelist

speciﬁes that training proceed from all leaves descendent from the nodes in the list

nodelist

. By default, the training proceeds from all leaves.

SEARCHBINS=n

speciﬁes the maximum number of consolidated input values to use in an exhaustive

or heuristic split search. See the

“Split Search Algorithm”

section beginning on page

47 for a complete explanation. The default value of n is 15 times the value speciﬁed

in the MAXBRANCH= option in the TRAIN or PROC ARBORETUM statement. If

the MAXBRANCH= option is unspeciﬁed, the default value of the SEARCHBINS=

option is 30.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 9 10 11 12 13 14 15 16 ... 148