The ARBORETUM Procedure
speciﬁes the requisite number of training observations a node must have for the
ARBORETUM procedure to consider splitting it.
By default, n is twice the
value of the LEAFSIZE= option.
For the LEAFSIZE=, MINCATSIZE=, and
SPLITSIZE= options in the TRAIN statement, and the NODESIZE= option in the
PERFORMANCE statement, the procedure counts the number of observations in a
node without adjusting the number with the values of the variable speciﬁed in the
speciﬁes that no splitting rule will be based on an input variable used in a splitting
rule of an ancestor node.
The UNDO statement is an interactive training statement that undoes the most re-
cent PRUNE, SETRULE, SPLIT, or TRAIN statement issued since the most recent
INTERACT statement. The REDO statement may restore what the UNDO statement
Form of a Splitting Rule
A splitting rule uses the value of a single input variable to assign an observation to a
branch. The branches are ordered and numbered consecutively starting with 1. Every
splitting rule (other than a surrogate rule) includes an assignment of missing values
to one or all branches, even if no missing values appear in the data. Rules deﬁning a
branch exclusively for missing values assigns the missing values to the last branch.
For interval and ordinal inputs, observations with smaller input values are assigned
to branches with smaller numbers. Consequently, a list of increasing input values
sufﬁces to specify a splitting rule. A surrogate rule may disregard the ordering and
assign smaller values of the input to any branch.
Rules need not assign any training observations to a particular branch.
specify them in the SETRULE or SPLIT interactive training statements.
The predicted proportions of categorical target values for an observation are called
the posterior probabilities of the target values.
For an observation assigned to a node, the posterior probabilities equal the predicted
within node probabilities
, which are the proportions of the target values of all the
training observations assigned to the node, adjusted for prior probabilities (if any),
and not adjusted for any proﬁt or loss coefﬁcients.
For an observation assigned to more than one leaf with fractional weights that sum
predicted within node probabilities.
The within node probabilities for a split search are the proportions of the target values
in the within node training sample, adjusted for the bias from stratiﬁed sampling,
and adjusted for prior probabilities if requested by the SPLITSEARCH option in
the PROC ARBORETUM statement, and adjusted for proﬁt or loss coefﬁcients if
requested by the DECSEARCH option in the PROC statement.
When neither priors, proﬁts, nor losses are speciﬁed, and observations are assigned
to a single leaf, and within node sampling is not used, the posterior and within node
probabilities are simply the proportions of the target values in a node τ :
(τ ) = N
(τ )/N (τ )
(τ ) is the number of training observations in τ with target value j.
(root) is the number of training observations in the root node with target
depends on whether p
incorporates priors, proﬁts, or losses. Table
by type of quantity being incorporated.
number of j observations in root
Proﬁt or Loss
The PRIORVAR option in the DECISION statement declares the existence of prior
probabilities. If prior probabilities exist, they are always incorporated in the posterior
probabilities. If the PRIORSEARCH option is speciﬁed in the PROC ARBORETUM
statement, the priors will also be incorporated in the search for a splitting rule. If
the PRIORS option to the ASSESS statement is speciﬁed, the priors will also be
incorporated in the evaluation of subtrees and consequently inﬂuence which nodes
are automatically pruned.
In all cases, the priors are incorporated by deﬁning the within node probabilities
above with ρ
, the prior probability of target value j, or, when incorporating a
proﬁt or a loss, πj
, the altered prior probability deﬁned below.
The DECSEARCH option in the PROC ARBORETUM statement requests that the
split search for a nominal target incorporate the proﬁt or loss functions speciﬁed in
the DECISION statement. Unequal misclassiﬁcation costs of Breiman et al. (1984)
are a special case in which the decision alternatives equal the target values and the
DECDATA= data set is type LOSS. The ARBORETUM procedure generalizes the
method of altered priors introduced in Breiman et al.
The search incorporates the decisions, proﬁt, or loss functions by using ρ
in the deﬁnition of within node probability, p
. Let A
denote the coefﬁcient for
decision d, target value j, in the decision matrix. Deﬁne
If the PRIORSSEARCH option is speciﬁed in the PROC ARBORETUM statement
deﬁnes the altered prior probability for target value j, where π
denotes the prior
probability of j. Intuitively, the alteration inﬂates the prior probability for those target
values having large proﬁt or loss coefﬁcients, thereby giving observations with those
target values more weight in the split search. The search incorporates the altered
priors instead of incorporating the original priors.
If the PRIORSEARCH option is not speciﬁed, then the deﬁnition of π
(τ )/N (τ ), the simple proportion of observations having target
The ARBORETUM procedure searches for rules that maximize the measure of worth
associated with the splitting criterion speciﬁed in the CRITERION= option in the
PROC ARBORETUM statement. Some measures are based on a node impurity mea-
sure, others on p-values of a statistical test. A p-value may be adjusted for the number