The ARBORETUM Procedure
restricts the number of subsets a splitting rule can produce to n or fewer. Setting n to
2 will create a binary tree. Any integer from 2 through 50 is permitted. The default
value of n is 2.
speciﬁes the maximum depth of a node that the TRAIN statement will create auto-
matically unless the MAXNEWDEPTH= option equals 1. The depth of a node equals
the number of splitting rules needed to deﬁne the node. The root node has depth zero.
The children of the root have depth one, and so on.
The TRAIN statement will search for a splitting rule in a leaf only when the
MAXNEWDEPTH= option equals 1 or the depth of the leaf is < n. The BRANCH,
SEARCH, and SPLIT statements will search for splitting rules and create branches
in a leaf at any depth, regardless of the MAXDEPTH= and the MAXNEWDEPTH=
Specify MAXDEPTH=0 or
MAXNEWDEPTH=0 to avoid searching for any splits while specifying other op-
tions. The MAXDEPTH= option remains in effect until explicitly changed. The
MAXNEWDEPTH= option reverts to its maximum value after the TRAIN statement
ﬁnishes. The default value of n is six.
speciﬁes the maximum number of new generations of nodes created from a leaf.
Specify MAXNEWDEPTH=1 to create at most one split in a leaf. Other options such
as the MAXDEPTH= option may prevent the creation of all n generations. Specify
MAXNEWDEPTH=0 to specify other options to the TRAIN statement without per-
forming a search for a splitting rule. The MAXNEWDEPTH=MAX option speciﬁes
n to be 50, the largest acceptable value, and also the default value. N is not retained
from one TRAIN statement to the next, and will equal its default value of 50 unless
explicitly changed even if it was changed in a previous TRAIN statement.
MAXRULES=n | ALL
speciﬁes how many splitting rules on different input variables are saved in each node,
including leaves. The primary splitting rule in an internal node is always saved.
Up to n − 1 additional competing rules are also saved in an internal node. The
MAXRULES=ALL option requests the ARBORETUM procedure to save all the
available splitting rules for each node.
Saved rules may be displayed in results and may be output using the RULES option
to the SAVE statement. Subsequent BRANCH, SEARCH, SETRULE, SPLIT, and
TRAIN statements use candidate rules saved in leaves.
A valid splitting rule might not exist for some input variables in some nodes. A
common explanation is that none of the feasible rules meet the threshold of worth
speciﬁed in the ALPHA= option in the TRAIN statement. Other causes occur less
often. For example, the MINCATSIZE= option in the TRAIN statement may prevent
creation of a split on a categorical input X if few observations exist for any speciﬁc
value of X. As another example, the LEAFSIZE= option may prevent any split on
a speciﬁc input, especially one that is nearly constant and consequently permits few
quence of these and other possibilities, a node may contain fewer than n rules.
The amount of memory needed to save splitting rules, especially rules using nominal
input variables with many values, may be substantial, possibly several megabytes.
(Eight bytes for each branch, and four more bytes for each categorical value is needed
for each rule in each node.) The default value of n is 3.
speciﬁes the number of surrogate rules sought for each primary splitting rule. A
surrogate rule is a backup to the primary splitting rule. The primary splitting rule
might not apply to some observations because the value of the splitting variable might
be missing or be a categorical value the rule does not recognize. Surrogate rules are
considered for such observations. The search for surrogate rules requires an extra pass
over the data, and therefore no surrogates are sought by default. See the
section on page 45 for more information.
Surrogate rules enhance the importance of the variables they use.
section on page 54 for more detail.
speciﬁes the minimum number of observations that a given nominal input value must
have in order to use the value in a split search. Categorical values that appear in
fewer than n observations are regarded as if they were missing. If USEINSEARCH
is speciﬁed in the MISSING= option in the input statement for the splitting variable,
the categories occurring in fewer than n observations are merged into the pseudo
category for missing values for the purpose of ﬁnding a split. Otherwise observations
with infrequent categories are excluded from the split search. The policy for assigning
such observations to a branch is the same as the policy for assigning missing values
to a branch. Refer to the
section (page 45) for more detail. The
default value of n is 5.
speciﬁes a threshold value for the worth of a candidate splitting rule, unless
the CRITERION= option in the PROC ARBORETUM statement is speciﬁed as
PROBCHISQ or PROBF. A candidate rule whose worth is less than worth is dis-
carded. The default value is 0. When CRITERION=PROBCHISQ or PROBF, the
MINWORTH= option is ignored and the ALPHA= option is used instead.
speciﬁes that training proceed from all leaves descendent from the nodes in the list
. By default, the training proceeds from all leaves.
speciﬁes the maximum number of consolidated input values to use in an exhaustive
or heuristic split search. See the
“Split Search Algorithm”
section beginning on page
47 for a complete explanation. The default value of n is 15 times the value speciﬁed
in the MAXBRANCH= option in the TRAIN or PROC ARBORETUM statement. If
the MAXBRANCH= option is unspeciﬁed, the default value of the SEARCHBINS=
option is 30.