The NSURRS= procedure option determines the number of surrogates sought. A surrogate is discarded if
its agreement is 1/B, where B is the number of branches. As a consequence, a node might have fewer
surrogates than the number specified in the NSURRS= option.
For a specific node and input, PROC SPLIT seeks the split with maximum worth or -log(p-value)
subject to the limit on the number of branches and the limit on the minimum number of observations
assigned to a branch. The procedure options MAXBRANCH= and LEAFSIZE= set these limits.
The measure of worth depends on the splitting criterion. The ENTROPY, GINI, and VARIANCE
reduction criteria measure worth as I(node) - sum over branches b of P(b) I(b), where I(node) denotes the
entropy, gini, or variance measure in the node, and P(b) denotes the proportion of observations in the
node assigned to branch b. If prior probabilities are specified, then the proportions P(b) are modified
The PROBCHISQ and PROBF criteria use the -log(p-value) measure. For these criteria, the best split is
the one with the smallest p-value. If the PADJUST=KASSBEFORE procedure option is in effect, as it is
by default, then the p-values are first multiplied using the appropriate Bonferroni factor. Adjusting the
called Gabriel's adjustment. If so, then Gabriel's p-value is used.
For nodes with many observations, the algorithm uses a sample for the split search, for computing the
worth, and for observing the limit on the minimum size of a branch. The NODESAMPLE= procedure
option specifies the size of the sample and is limited to 32,767 observations (ignoring any frequency
variable) on many computer platforms. The samples in different nodes are taken independently.
For nominal targets, the sample is as balanced as possible. Suppose for example that a node contains 100
observations of one value of a binary target, 1,000 observations of the other value, and
NODESAMPLE=200 or more. Then all 100 observations of the first target value are in the node sample.
For binary splits on binary or interval targets, the optimal split is always found. For other situations, the
data is first consolidated, and then either all possible splits are evaluated or else an heuristic search is
the same branch in the best split. The split search regards observations in the same consolidation group
as having the same input value. The split search is faster because fewer candidate splits need evaluating.
If, after consolidation, the number of possible splits is greater than the number specified in the
EXHAUSTIVE= procedure option, then a heuristic search is used. The heuristic algorithm alternately
merges branches and reassigns consolidated observations to different branches. The process stops when
a binary split is reached. Among all candidate splits considered, the one with the best worth is chosen.
The heuristic algorithm initially assigns each consolidated group of observations to a different branch,
even if the number of such branches is more than the limit allowed in the final split. At each merge step,
merged, the algorithm considers reassigning consolidated groups of observations to different branches.
Each consolidated group is considered in turn, and the process stops when no group is reassigned.
When using the PROBCHISQ and PROBF criteria, the p-value of the selected split on an input is
subjected to more adjustments: KASSAFTER, DEPTH, DEVILLE, and INPUTS. If the adjusted p-value
is greater than or equal to the WORTH= procedure option, the split is rejected.
After a node is split, the newly created nodes are considered for splitting. This recursive process ends
when no node can be split. The reasons a node will not split are
The node contains too few observations, as specified in the SPLITSIZE= procedure option.
The number of nodes in the path between the root node and the given node equals the number
target value, or no input is sufficiently predictive in the node.
A tree adapts itself to the training data and generally does not fit as well when applied to new data. Trees
that fit the training data too well often predict too poorly to use on new data.
When SPLITSIZE=, MAXDEPTH, and WORTH= are set to extreme values, and, for PROBCHISQ and
PROBF criteria, PADJUST=none, the tree is apt to grow until all observations in a leaf contain the same
target value. Such trees typically overfit the training data and will predict new data poorly.
A primary consideration when developing a tree for prediction is deciding how large to grow the tree, or,
what comes to the same end, what nodes to prune off the tree.
The CHAID method of tree construction specifies a significance level of a Chi-square test to stop tree
growth. The authors of the C4.5 and Classification and Regression Trees methods argue that the right
thresholds for stopping tree construction are not knowable in advance, so theyrecommend growing a tree
too large and then pruning nodes off.
PROC SPLIT allows for both methods. The WORTH= option accepts the significance level used in
CHAID. After tree construction stops, regardless of why it stops, PROC SPLIT creates a sequence of
sub-trees of the original tree, one sub-tree for each possible number of leaves. The sub-tree chosen with
three leaves has the best assessment value of all candidate sub-trees with three leaves.
After the sequence of sub-trees is established, PROC SPLIT uses one of four methods to select which
sub-tree to use for prediction. The SUBTREE= procedure option specifies the method. If SUBTREE=n,
where n is a positive integer, then PROC SPLIT uses the largest sub-tree with, at most, n leaves.
If SUBTREE=ASSESSMENT, then PROC SPLIT uses the smallest sub-tree with the best assessment
value. The assessment is based on the validation data when available. (This differs from the construction