Split Search Algorithm
47
categories. J − J
rare
is the number of remaining categories. The sampling algorithm
selects s = (n−n
rare
)/(J −J
rare
) observations from each of the J −J
rare
remaining
categories. If s is not an integer, then the sample will contain one more observation
from some of J − J
rare
categories than others so that the total equals (n − n
rare
).
The sampling algorithm depends only on the order of the observations in the training
data and not on other random factors.
When the node is split into branches, all the observations are passed to the branches,
and new samples are created in each branch as needed.
Split Search Algorithm
The ARBORETUM procedure always selects a splitting rule with the largest worth
among all the splits evaluated. The algorithm is simple. The details, however, seem
complicated because of the many special situations.
The search for splitting and surrogate rules is done on the within-node training
sample.
The search involving a specific input variable eliminates observations
with a missing input value unless the MISSING= option for that input equals
USEINSEARCH. The search involving a specific categorical input variable elimi-
nates observations whose input value occurs less frequently in the within-node sample
than the threshold specified in the MINCATSIZE= option in the TRAIN statement.
The decision to eliminate an observation from the within-node sample is made in-
dependently for each input and for each node. An observation in which input W is
missing and input Z is not missing is eliminated from the search using W but not from
Z, unless MISSING=USEINSEARCH. An observation in which categorical input X
has a value that is common in the root node but occurs infrequently in some branch
is eliminated from searches in that branch but not the root node.
In most situations the algorithm sorts the observations. For interval and ordinal input
variables, the algorithm sorts the input values, and arbitrarily puts observations with
missing input values at the end. For nominal inputs with interval targets, the algo-
rithm sorts the input categories by the average target value among the observations in
the category. For nominal inputs and observations containing two categorical target
values in the search sample, the algorithm sorts the input categories by the proportion
of one of the target values. In the remaining situation, the algorithm does not sort.
The remaining situation consists of a nominal input and a categorical target with at
least three different values in the sample presented to the algorithm.
After the algorithm sorts, if it sorts at all, the algorithm passes over the data evaluating
every permissible binary split preserving the sort. A split between tied input values
or categories with the same average target value is not permitted, nor is a split that
leaves fewer observations in a branch than specified in the LEAFSIZE= option in the
TRAIN statement.
If the MAXBRANCH= option in the TRAIN statement specifies binary splits, then
the search is finished; the best split evaluated is the best binary split. Otherwise
the algorithm consolidates the observations into N bins, where N is less than or
equal to the limit specified in the SEARCHBINS= option in the TRAIN statement.
48
The ARBORETUM Procedure
Observations collected into the same bin remain together during the search and are
assigned to the same branch.
The consolidation algorithm uses the best of the binary split points already found
to create the bins, subject to some constraints. If the input variable is interval or
ordinal and missing values appear in the sample (and are therefore acceptable values),
then one bin is always reserved exclusively for missing values. For an interval input
X, if the number of distinct values of X is greater than the number specified in the
INTERVALBINS=n option in the TRAIN statement, then the split points will be at
least (max(X)−min(X)/(n+1) apart, where max(X) and min(X) are the maximum
and minimum values of the input in the search sample. The algorithm makes as many
bins as possible that satisfy these constraints.
Next the algorithm computes the number of candidate splits, including m-ary splits,
where 2 <= m <= n. If the number does not exceed the threshold specified in the
EXHAUSTIVE= option in the TRAIN statement, then all candidate splits are evalu-
ated, and one with the largest measure of worth is chosen. Otherwise the algorithm
uses a merge-and-shuffle heuristic approach to select which splits to evaluate.
The merge-and-shuffle algorithm first creates a branch for each bin. The algorithm
then merges a pair of branches, and then merges another pair, and so on until only
two branches remain. To choose a pair, the algorithm evaluates the worth of splitting
the merged candidate branch back into the original pair of branches, and selects a
pair that defines the splitting rule with the smallest worth. After each merge, the
algorithm reassigns a bin of observations to a different branch if the worth of the
splitting rule increases. The algorithm continues the reassignment of single bins as
long as the worth increases. The merge-and-shuffle algorithm evaluates many, but
not all, permissible splitting rules, and chooses the one with the largest worth. The
algorithm is heuristic because it does not guarantee that the best possible split is
found.
The previous paragraphs describe the search when the algorithm sorts the observa-
tions. The algorithm does not sort when the input variable is nominal and the target
variable is categorical with at least three categories in the sample being searched.
For this situation, if the number of input categories occurring in the search sample
is greater than the threshold specified in the SEARCHBINS= option, then categories
with the largest entropy of target values are consolidated into one bin, and the re-
maining N − 1 categories are assigned to the remaining N − 1 bins, one bin for each
category. The rest of the search is similar to the one for n-ary splits of sorted ob-
servations. If the number of candidate splits does not exceed the threshold specified
in the EXHAUSTIVE= option, then all candidate splits are evaluated. Otherwise the
algorithm uses the merge-and-shuffle heuristic approach.
Every rule includes an assignment of missing values to one or all branches, as de-
scribed in the
“Missing Values”
section on page 45.