36
The ARBORETUM Procedure
SPLITSIZE=n
specifies the requisite number of training observations a node must have for the
ARBORETUM procedure to consider splitting it.
By default, n is twice the
value of the LEAFSIZE= option.
For the LEAFSIZE=, MINCATSIZE=, and
SPLITSIZE= options in the TRAIN statement, and the NODESIZE= option in the
PERFORMANCE statement, the procedure counts the number of observations in a
node without adjusting the number with the values of the variable specified in the
FREQ statement.
USEVARONCE
specifies that no splitting rule will be based on an input variable used in a splitting
rule of an ancestor node.
UNDO Statement
UNDO ;
The UNDO statement is an interactive training statement that undoes the most re-
cent PRUNE, SETRULE, SPLIT, or TRAIN statement issued since the most recent
INTERACT statement. The REDO statement may restore what the UNDO statement
undoes.
Details
Form of a Splitting Rule
A splitting rule uses the value of a single input variable to assign an observation to a
branch. The branches are ordered and numbered consecutively starting with 1. Every
splitting rule (other than a surrogate rule) includes an assignment of missing values
to one or all branches, even if no missing values appear in the data. Rules defining a
branch exclusively for missing values assigns the missing values to the last branch.
For interval and ordinal inputs, observations with smaller input values are assigned
to branches with smaller numbers. Consequently, a list of increasing input values
suffices to specify a splitting rule. A surrogate rule may disregard the ordering and
assign smaller values of the input to any branch.
Rules need not assign any training observations to a particular branch.
The
ARBORETUM procedure does not automatically generate such rules, but a user may
specify them in the SETRULE or SPLIT interactive training statements.
Posterior and Within Node Probabilities
The predicted proportions of categorical target values for an observation are called
the posterior probabilities of the target values.
For an observation assigned to a node, the posterior probabilities equal the predicted
within node probabilities
, which are the proportions of the target values of all the
training observations assigned to the node, adjusted for prior probabilities (if any),
and not adjusted for any profit or loss coefficients.
Incorporating Prior Probabilities
37
For an observation assigned to more than one leaf with fractional weights that sum
to one, the posterior probabilities are the weighted averages over the leaves of the
predicted within node probabilities.
The within node probabilities for a split search are the proportions of the target values
in the within node training sample, adjusted for the bias from stratified sampling,
and adjusted for prior probabilities if requested by the SPLITSEARCH option in
the PROC ARBORETUM statement, and adjusted for profit or loss coefficients if
requested by the DECSEARCH option in the PROC statement.
When neither priors, profits, nor losses are specified, and observations are assigned
to a single leaf, and within node sampling is not used, the posterior and within node
probabilities are simply the proportions of the target values in a node τ :
p
j
= proportion
j
(τ ) = N
j
(τ )/N (τ )
where N
j
(τ ) is the number of training observations in τ with target value j.
When incorporating priors, profits, or losses, this becomes
p
j
=
ρ
j
proportion
j
(τ )/proportion
j
(root)
i
ρ
i
proportion
i
(τ )/proportion
i
(root)
or equivalently,
p
j
=
ρ
j
N
j
(τ )/N
j
(root)
i
ρ
i
N
i
(τ )/N
i
(root)
where N
j
(root) is the number of training observations in the root node with target
value j, and ρ
j
depends on whether p
j
incorporates priors, profits, or losses. Table
7
defines ρ
j
by type of quantity being incorporated.
Table 7.
ρ
j
by Type of Incorporated Quantity
Incorporated Quantity
ρ
j
Description
Nothing
N
j
(root)
number of j observations in root
Prior Probabilities
π
j
prior probability
Profit or Loss
π
a
j
altered prior probability
Incorporating Prior Probabilities
The PRIORVAR option in the DECISION statement declares the existence of prior
probabilities. If prior probabilities exist, they are always incorporated in the posterior
probabilities. If the PRIORSEARCH option is specified in the PROC ARBORETUM
statement, the priors will also be incorporated in the search for a splitting rule. If
the PRIORS option to the ASSESS statement is specified, the priors will also be
38
The ARBORETUM Procedure
incorporated in the evaluation of subtrees and consequently influence which nodes
are automatically pruned.
In all cases, the priors are incorporated by defining the within node probabilities
above with ρ
j
= π
j
, the prior probability of target value j, or, when incorporating a
profit or a loss, πj
a
, the altered prior probability defined below.
Incorporating Decisions, Profit, and Loss
The DECSEARCH option in the PROC ARBORETUM statement requests that the
split search for a nominal target incorporate the profit or loss functions specified in
the DECISION statement. Unequal misclassification costs of Breiman et al. (1984)
are a special case in which the decision alternatives equal the target values and the
DECDATA= data set is type LOSS. The ARBORETUM procedure generalizes the
method of altered priors introduced in Breiman et al.
The search incorporates the decisions, profit, or loss functions by using ρ
j
= π
a
j
in the definition of within node probability, p
j
. Let A
jd
denote the coefficient for
decision d, target value j, in the decision matrix. Define
a
j
=
d
|A
jd
|
If the PRIORSSEARCH option is specified in the PROC ARBORETUM statement
requesting the search to incorporate prior probabilities, then
π
a
j
=
a
j
π
j
i
a
i
π
i
defines the altered prior probability for target value j, where π
j
denotes the prior
probability of j. Intuitively, the alteration inflates the prior probability for those target
values having large profit or loss coefficients, thereby giving observations with those
target values more weight in the split search. The search incorporates the altered
priors instead of incorporating the original priors.
If the PRIORSEARCH option is not specified, then the definition of π
a
j
changes by
replacing π
j
with N
j
(τ )/N (τ ), the simple proportion of observations having target
value j:
π
a
j
=
a
j
N
j
(τ )/N (τ )
i
a
i
N
i
(τ )/N (τ )
Splitting Criteria
The ARBORETUM procedure searches for rules that maximize the measure of worth
associated with the splitting criterion specified in the CRITERION= option in the
PROC ARBORETUM statement. Some measures are based on a node impurity mea-
sure, others on p-values of a statistical test. A p-value may be adjusted for the number
Dostları ilə paylaş: |