where Λ denotes the set of leaves, χ indicates either training or validation data,
measures, and ψ(τ, χ) is a node statistic.
The node weight, ω(τ, χ), equals the proportion of observations in data set χ in τ
unless the assessment measure incorporates prior probabilities, in which case,
ω(τ, χ) =
denotes the prior probability of target value j, and N
denotes the number
of observations with target value j in data set χ in τ .
The inclusion function, λ(τ, χ), equals 1 unless MEASURE= LIFT or LIFTPROFIT.
These measures only use a proportion γ of the data to compute the assess-
ment. The ARBORETUM procedure orders the leaves τ by descending values of
ψ(τ, training). The ﬁrst leaf has the largest value of ψ(τ, training), not the small-
est. The cumulative lift measures use observations in leaves with large values of ψ.
Let the relation τ < τ stand for ψ(τ , training) > ψ(τ, training). Deﬁne
Ω(τ, χ) =
ω(τ , χ)
that ψ(τ , training) > ψ(τ, training).
For ﬁxed χ and 0 < γ < 1, there exists a unique τ ∗ such that Ω(τ ∗, χ) ≥ γ, and
Ω(τ ∗ −1, χ) < γ. Deﬁne the inclusion function to be
λ(τ ∗, χ) =
τ < τ ∗
τ > τ ∗
in the cumulative lift measure, and will select a fraction of one particular leaf, τ ∗,
if the required number of observations, γN (root, χ), does not equal the number of
observations in a set of whole leaves.
In the deﬁnition of ψ(τ, χ) below, p
(τ, χ) denotes the proportion of observations
(τ, χ) =
The remaining sections deﬁne ψ(τ, χ) for the different assessment measures.
The ARBORETUM Procedure
For an interval target with MEASURE=PROFIT,
ψ(τ, χ) =
N (τ, χ)
(τ ) is the estimated proﬁt or loss for observation i in τ .
ψ(τ, χ) =
is the coefﬁcient in the decision matrix for
target value j, decision ˆ
d. ψ(τ, χ) represents proﬁt, revenue, or loss according to
whether the DECDATA= data set in the DECISION statement has type PROFIT,
REVENUE, or LOSS. Note that ψ(τ, χ) does not incorporate decision costs, and
therefore does not represents proﬁt if the DECDATA= data set has type REVENUE.
Formula for Misclassiﬁcation Rate
j is the predicted target value in the node.
For an interval target using MEASURE=ASE,
y(τ ) is the average of the target variable among the training observations in
node τ .
For a categorical target using MEASURE=ASE,
equals 1 if observation i has target value j, and equals 0 otherwise, and
(τ, training), the predicted probability of target value j for observations
ψ(τ, χ) = (1 − 2
If the assessment measure incorporates prior probabilities (if any) and χ represents
ψ(τ, training) = (1 −
ψ(τ, χ) = p
When the ARBORETUM procedure begins, it reserves memory in the computer for
the calculations necessary for growing the tree. Later the procedure will read the
entire training data and perform as many tasks as the reserved memory can accom-
modate, postponing other tasks for a subsequent pass of the data. Typically, the pro-
cedure spends most of its time accessing the data, and therefore reducing the number
of passes of the data will also reduce the execution time.
Passes Over the Data
Each of the following tasks for a node require a pass over the entire training data:
• compute node statistics
• search for a split on an input variable
• determine a rule for missing values for a speciﬁed split
• search for a surrogate rule on an input variable
If only one task were done at a time, the number of passes over the training data
would approximately equal the number of nodes times the number of input variables.
Surrogate splits would require more passes. The number of additional passes equals
the number of inputs minus one. The actual number is typically less for three reasons.
First, if no split on an input variable is found in a node, then no search is attempted on
that input in any descendent node. (See the description of the
in the TRAIN statement for some situations in which no split exists on an input.)
Second, the procedure does not search for any splits in nodes at the depth speciﬁed in
the MAXDEPTH= option in the TRAIN statement. Third, given sufﬁcient memory,
the procedure may perform several tasks during the same pass.
The procedure computes node statistics before beginning a split search in that node.
Consequently, creating a node and ﬁnding a split requires at least two passes of the
data. The procedure will search for a split in a node on every input variable in one
pass of the data if enough memory is available. The search for surrogate splits begins