34
The ARBORETUM Procedure
MAXBRANCH=n
restricts the number of subsets a splitting rule can produce to n or fewer. Setting n to
2 will create a binary tree. Any integer from 2 through 50 is permitted. The default
value of n is 2.
MAXDEPTH=n
| MAX
specifies the maximum depth of a node that the TRAIN statement will create auto-
matically unless the MAXNEWDEPTH= option equals 1. The depth of a node equals
the number of splitting rules needed to define the node. The root node has depth zero.
The children of the root have depth one, and so on.
The TRAIN statement will search for a splitting rule in a leaf only when the
MAXNEWDEPTH= option equals 1 or the depth of the leaf is < n. The BRANCH,
SEARCH, and SPLIT statements will search for splitting rules and create branches
in a leaf at any depth, regardless of the MAXDEPTH= and the MAXNEWDEPTH=
options.
The MAXDEPTH=MAX option specifies n = 50, the largest possible value
of n.
The smallest acceptable value of n is 0.
Specify MAXDEPTH=0 or
MAXNEWDEPTH=0 to avoid searching for any splits while specifying other op-
tions. The MAXDEPTH= option remains in effect until explicitly changed. The
MAXNEWDEPTH= option reverts to its maximum value after the TRAIN statement
finishes. The default value of n is six.
MAXNEWDEPTH=n
| MAX
specifies the maximum number of new generations of nodes created from a leaf.
Specify MAXNEWDEPTH=1 to create at most one split in a leaf. Other options such
as the MAXDEPTH= option may prevent the creation of all n generations. Specify
MAXNEWDEPTH=0 to specify other options to the TRAIN statement without per-
forming a search for a splitting rule. The MAXNEWDEPTH=MAX option specifies
n to be 50, the largest acceptable value, and also the default value. N is not retained
from one TRAIN statement to the next, and will equal its default value of 50 unless
explicitly changed even if it was changed in a previous TRAIN statement.
MAXRULES=n | ALL
specifies how many splitting rules on different input variables are saved in each node,
including leaves. The primary splitting rule in an internal node is always saved.
Up to n − 1 additional competing rules are also saved in an internal node. The
MAXRULES=ALL option requests the ARBORETUM procedure to save all the
available splitting rules for each node.
Saved rules may be displayed in results and may be output using the RULES option
to the SAVE statement. Subsequent BRANCH, SEARCH, SETRULE, SPLIT, and
TRAIN statements use candidate rules saved in leaves.
A valid splitting rule might not exist for some input variables in some nodes. A
common explanation is that none of the feasible rules meet the threshold of worth
specified in the ALPHA= option in the TRAIN statement. Other causes occur less
often. For example, the MINCATSIZE= option in the TRAIN statement may prevent
creation of a split on a categorical input X if few observations exist for any specific
value of X. As another example, the LEAFSIZE= option may prevent any split on
TRAIN Statement
35
a specific input, especially one that is nearly constant and consequently permits few
candidate splits. No split exists with an input that is constant in a node. As a conse-
quence of these and other possibilities, a node may contain fewer than n rules.
The amount of memory needed to save splitting rules, especially rules using nominal
input variables with many values, may be substantial, possibly several megabytes.
(Eight bytes for each branch, and four more bytes for each categorical value is needed
for each rule in each node.) The default value of n is 3.
MAXSURROGATES | MAXSURRS=n
specifies the number of surrogate rules sought for each primary splitting rule. A
surrogate rule is a backup to the primary splitting rule. The primary splitting rule
might not apply to some observations because the value of the splitting variable might
be missing or be a categorical value the rule does not recognize. Surrogate rules are
considered for such observations. The search for surrogate rules requires an extra pass
over the data, and therefore no surrogates are sought by default. See the
“Missing
Values”
section on page 45 for more information.
Surrogate rules enhance the importance of the variables they use.
See the
“IMPORTANCE= Output Data Set”
section on page 54 for more detail.
MINCATSIZE=n
specifies the minimum number of observations that a given nominal input value must
have in order to use the value in a split search. Categorical values that appear in
fewer than n observations are regarded as if they were missing. If USEINSEARCH
is specified in the MISSING= option in the input statement for the splitting variable,
the categories occurring in fewer than n observations are merged into the pseudo
category for missing values for the purpose of finding a split. Otherwise observations
with infrequent categories are excluded from the split search. The policy for assigning
such observations to a branch is the same as the policy for assigning missing values
to a branch. Refer to the
“Missing Values”
section (page 45) for more detail. The
default value of n is 5.
MINWORTH=worth
specifies a threshold value for the worth of a candidate splitting rule, unless
the CRITERION= option in the PROC ARBORETUM statement is specified as
PROBCHISQ or PROBF. A candidate rule whose worth is less than worth is dis-
carded. The default value is 0. When CRITERION=PROBCHISQ or PROBF, the
MINWORTH= option is ignored and the ALPHA= option is used instead.
NODES=nodelist
specifies that training proceed from all leaves descendent from the nodes in the list
nodelist
. By default, the training proceeds from all leaves.
SEARCHBINS=n
specifies the maximum number of consolidated input values to use in an exhaustive
or heuristic split search. See the
“Split Search Algorithm”
section beginning on page
47 for a complete explanation. The default value of n is 15 times the value specified
in the MAXBRANCH= option in the TRAIN or PROC ARBORETUM statement. If
the MAXBRANCH= option is unspecified, the default value of the SEARCHBINS=
option is 30.