For a particular node, variable, and number of branches, the procedure can ﬁnd the
statistic or χ-square statistic.
κ(τ, v, b). Otherwise no adjustment is made yet. This procedure repeats for each
possible number of branches, producing a single candidate split for each number
of branches, and the chooses the one with the best adjusted or unadjusted p-value
accordingly as PADJUST=CHAIDBEFORE is or is not speciﬁed.
If the PADJUST=CHAIDAFTER is speciﬁed, the ﬁnal candidate split in the
node for the variable is is multiplied by κ(τ, v, b).
If either the PVARS=n or
PADJUST=DEPTH option is speciﬁed in the PROC ARBORETUM statement, the
-value is further multiplied by a factor to adjust for the number of variables or the
ARBORETUM statement, the candidate is discarded, and the procedure proposes no
split of τ using the variable.
Adjusting p-Values for the Number of Input Values and Branches
The PADJUST=CHAIDAFTER or CHAIDBEFORE option in the PROC statement
requests the ARBORETUM procedure to multiply the p-value of the χ
adjust for using multiple signiﬁcance tests. If κp is larger than the p-value of an alter-
native conservative signiﬁcance test called Gabriel’s, then Gabriel’s p-value is used
instead of κp unless the PADJUST=NOGABRIEL option is speciﬁed.
Let B denote the number of branches, and c the number of input variable values
available to the split search. If the MISSING=USEINSEARCH option is speciﬁed in
the INPUT statement, c includes the missing value. For an interval input, c represents
values described in the
“Split Search Algorithm”
section beginning on
whether the MISSING=USEINSEARCH option is speciﬁed.
for non-nominal, without USEINSEARCH
The Bonferroni adjustment is described further in Kass (1980). Hawkins and Kass
(1982) suggested bounding κp with a p-value from a more conservative test. Unless
the PADJUST=NOGABRIEL is speciﬁed,
p = min(κP r(χ
), P r(χ
The ARBORETUM Procedure
where J is the number of target values.
Adjusting p-Values for the Depth of the Node
The PADJUST=DEPTH option in the PROC statement requests the ARBORETUM
procedure to multiply the p-value by a depth factor to account for the probability of
error in creating the current node. The unadjusted p-value estimates the probability
that the observed association between the target values and the split of the data into
subsets could happen by chance, given the existence of the current node. The depth
adjustment attempts to incorporate the probability that the current node being split is
a chance occurrence to begin with.
The depth factor for node τ is the product of the number of branches in each ancestor
The PVARS=m option in the PROC statement requests the ARBORETUM proce-
dure to adjust the p-value to account for multiple signiﬁcance tests with independent
input variables. Let M (root) denote the number of input variables, and M (τ ) de-
note the number of input variables for which the ARBORETUM procedure searches
for a splitting rule in a speciﬁc node. (M (τ ) may be less than M (root) because
the ARBORETUM procedure does not search on variables that are constant in τ ,
or on categorical variables that do not satisfy the MINCATSIZE= option in the
TRAIN statement, or on variables that have been excluded in an ancestor node.) The
ARBORETUM procedure multiplies the p-value by max((m/M (root))M (τ ), 1) to
adjust for the multiple tests on different input variables in the node. Specifying m = 0
requests the procedure to make no adjustment for the number of independent input
Splitting Criteria for an Ordinal Target
To evaluate splitting rules for an ordinal target, the ARBORETUM procedure uses
loss coefﬁcients A
deﬁning the penalty of misclassifying target value j as k. The
DECDATA= option in the DECISION statement. For an ordinal target, the decision
matrix must have type LOSS, the decision alternatives must equal the target values,
must be ≥ 0. By default, A
= |k − j|.
into the node impurity mea-
k(τ ) denote a target value in
node τ minimizing the loss,
i(τ ) = −