Formula for Profit and Loss
51
where Λ denotes the set of leaves, χ indicates either training or validation data,
ω(τ, χ) is a weight for the node τ , λ(τ, χ) is an inclusion function for cumulative lift
measures, and ψ(τ, χ) is a node statistic.
The node weight, ω(τ, χ), equals the proportion of observations in data set χ in τ
unless the assessment measure incorporates prior probabilities, in which case,
ω(τ, χ) =
j
π
j
N
j
(τ, χ)/N
j
(root, χ)
where π
j
denotes the prior probability of target value j, and N
j
denotes the number
of observations with target value j in data set χ in τ .
The inclusion function, λ(τ, χ), equals 1 unless MEASURE= LIFT or LIFTPROFIT.
These measures only use a proportion γ of the data to compute the assess-
ment. The ARBORETUM procedure orders the leaves τ by descending values of
ψ(τ, training). The first leaf has the largest value of ψ(τ, training), not the small-
est. The cumulative lift measures use observations in leaves with large values of ψ.
Let the relation τ < τ stand for ψ(τ , training) > ψ(τ, training). Define
Ω(τ, χ) =
τ <τ
ω(τ , χ)
Intuitively, Ω(τ, χ) is the number of observations in the χ data set in leaves τ such
that ψ(τ , training) > ψ(τ, training).
For fixed χ and 0 < γ < 1, there exists a unique τ ∗ such that Ω(τ ∗, χ) ≥ γ, and
Ω(τ ∗ −1, χ) < γ. Define the inclusion function to be
λ(τ ∗, χ) =
1
τ < τ ∗
(γ−Ω(τ ∗−1,χ))
ω(τ ∗,χ)
τ = τ ∗
0
τ > τ ∗
Note that 0 < λ(τ ∗, χ) ≤ 1. Intuitively, λ(τ, χ) selects which leaves to include
in the cumulative lift measure, and will select a fraction of one particular leaf, τ ∗,
if the required number of observations, γN (root, χ), does not equal the number of
observations in a set of whole leaves.
In the definition of ψ(τ, χ) below, p
j
(τ, χ) denotes the proportion of observations
with target value j in data set χ in τ . If the assessment measure incorporates prior
probabilities,
p
j
(τ, χ) =
π
j
N
j
(τ, χ)/N
j
(root, χ)
i
π
i
N
i
(τ, χ)/N
i
(root, χ)
The remaining sections define ψ(τ, χ) for the different assessment measures.
52
The ARBORETUM Procedure
Formula for Profit and Loss
For an interval target with MEASURE=PROFIT,
ψ(τ, χ) =
N (τ,χ)
i=1
E
i
(τ )
N (τ, χ)
where E
i
(τ ) is the estimated profit or loss for observation i in τ .
For a categorical target,
ψ(τ, χ) =
j
A
j ˆ
d
p
j
(τ, χ)
where ˆ
d is the node decision, and A
j ˆ
d
is the coefficient in the decision matrix for
target value j, decision ˆ
d. ψ(τ, χ) represents profit, revenue, or loss according to
whether the DECDATA= data set in the DECISION statement has type PROFIT,
REVENUE, or LOSS. Note that ψ(τ, χ) does not incorporate decision costs, and
therefore does not represents profit if the DECDATA= data set has type REVENUE.
Formula for Misclassification Rate
ψ(τ, χ) =
j=ˆ
j
p
j
(τ, χ)
where ˆ
j is the predicted target value in the node.
Formula for Average Square Error and Gini
For an interval target using MEASURE=ASE,
ψ(τ, χ) =
N (τ,χ)
i=1
(y
i
− ˆ
y(τ ))
2
N (τ, χ)
where ˆ
y(τ ) is the average of the target variable among the training observations in
node τ .
For a categorical target using MEASURE=ASE,
ψ(τ, χ) =
N (τ,χ)
i=1
J
j=1
(δ
ij
− ˆ
p
j
(τ ))
2
N (τ, χ)
where δ
ij
equals 1 if observation i has target value j, and equals 0 otherwise, and
ˆ
p
j
(τ ) = p
j
(τ, training), the predicted probability of target value j for observations
in τ . A simpler, equivalent expression is
ψ(τ, χ) = (1 − 2
j
p
j
(τ, χ)ˆ
p
j
(τ ) +
j
ˆ
p
2
j
(τ ))
Passes Over the Data
53
If the assessment measure incorporates prior probabilities (if any) and χ represents
the training data, then the expression reduces to the Gini index,
ψ(τ, training) = (1 −
j
p
2
j
(τ, training))
Formula for Lift
For MEASURE=LIFT,
ψ(τ, χ) = p
event
(τ, χ)
For
MEASURE=LIFTPROFIT,
ψ(τ, χ)
is
the
same
as
defined
for
MEASURE=PROFIT.
Performance Considerations
When the ARBORETUM procedure begins, it reserves memory in the computer for
the calculations necessary for growing the tree. Later the procedure will read the
entire training data and perform as many tasks as the reserved memory can accom-
modate, postponing other tasks for a subsequent pass of the data. Typically, the pro-
cedure spends most of its time accessing the data, and therefore reducing the number
of passes of the data will also reduce the execution time.
Passes Over the Data
Each of the following tasks for a node require a pass over the entire training data:
• compute node statistics
• search for a split on an input variable
• determine a rule for missing values for a specified split
• search for a surrogate rule on an input variable
If only one task were done at a time, the number of passes over the training data
would approximately equal the number of nodes times the number of input variables.
Surrogate splits would require more passes. The number of additional passes equals
the number of inputs minus one. The actual number is typically less for three reasons.
First, if no split on an input variable is found in a node, then no search is attempted on
that input in any descendent node. (See the description of the
MAXRULES= option
in the TRAIN statement for some situations in which no split exists on an input.)
Second, the procedure does not search for any splits in nodes at the depth specified in
the MAXDEPTH= option in the TRAIN statement. Third, given sufficient memory,
the procedure may perform several tasks during the same pass.
The procedure computes node statistics before beginning a split search in that node.
Consequently, creating a node and finding a split requires at least two passes of the
data. The procedure will search for a split in a node on every input variable in one
pass of the data if enough memory is available. The search for surrogate splits begins