Adjusting p-Values for the Number of Input Values and Branches
43
For a particular node, variable, and number of branches, the procedure can find the
best candidate without computing a p-value by finding the candidate with the largest
F
statistic or χ-square statistic.
If the PADJUST=CHAIDBEFORE option is specified, the p-value is multiplied by
κ(τ, v, b). Otherwise no adjustment is made yet. This procedure repeats for each
possible number of branches, producing a single candidate split for each number
of branches, and the chooses the one with the best adjusted or unadjusted p-value
accordingly as PADJUST=CHAIDBEFORE is or is not specified.
If the PADJUST=CHAIDAFTER is specified, the final candidate split in the
node for the variable is is multiplied by κ(τ, v, b).
If either the PVARS=n or
PADJUST=DEPTH option is specified in the PROC ARBORETUM statement, the
p
-value is further multiplied by a factor to adjust for the number of variables or the
depth of the node τ in the tree, to arrive at a final adjusted p-value of the candidate
split.
If the adjusted p-value is greater than the value of the ALPHA= option in the PROC
ARBORETUM statement, the candidate is discarded, and the procedure proposes no
split of τ using the variable.
Adjusting p-Values for the Number of Input Values and Branches
The PADJUST=CHAIDAFTER or CHAIDBEFORE option in the PROC statement
requests the ARBORETUM procedure to multiply the p-value of the χ
2
statistic com-
puted for the PROBCHISQ criterion for a nominal target by a Bonferroni factor κ to
adjust for using multiple significance tests. If κp is larger than the p-value of an alter-
native conservative significance test called Gabriel’s, then Gabriel’s p-value is used
instead of κp unless the PADJUST=NOGABRIEL option is specified.
Let B denote the number of branches, and c the number of input variable values
available to the split search. If the MISSING=USEINSEARCH option is specified in
the INPUT statement, c includes the missing value. For an interval input, c represents
consolidated
values described in the
“Split Search Algorithm”
section beginning on
page 47.
The Bonferroni factor κ depends on whether the input variable is nominal, and
whether the MISSING=USEINSEARCH option is specified.
κ =
B−1
i=0
(−1)
i (B−i)
c
i!(B−i)!
for a nominal input
c−1
B−1
for non-nominal, without USEINSEARCH
B−1+B(c−B)
c−1
c−1
B−1
for non-nominal, with USEINSEARCH
The Bonferroni adjustment is described further in Kass (1980). Hawkins and Kass
(1982) suggested bounding κp with a p-value from a more conservative test. Unless
the PADJUST=NOGABRIEL is specified,
p = min(κP r(χ
2
(B−1,J −1)
> χ
2
), P r(χ
2
(c−1,J −1)
> χ
2
))
44
The ARBORETUM Procedure
where J is the number of target values.
Adjusting p-Values for the Depth of the Node
The PADJUST=DEPTH option in the PROC statement requests the ARBORETUM
procedure to multiply the p-value by a depth factor to account for the probability of
error in creating the current node. The unadjusted p-value estimates the probability
that the observed association between the target values and the split of the data into
subsets could happen by chance, given the existence of the current node. The depth
adjustment attempts to incorporate the probability that the current node being split is
a chance occurrence to begin with.
The depth factor for node τ is the product of the number of branches in each ancestor
node:
Depth(τ ) =
τ
τ
B(τ )
Adjusting p
-Values for the Number of Input Variables
The PVARS=m option in the PROC statement requests the ARBORETUM proce-
dure to adjust the p-value to account for multiple significance tests with independent
input variables. Let M (root) denote the number of input variables, and M (τ ) de-
note the number of input variables for which the ARBORETUM procedure searches
for a splitting rule in a specific node. (M (τ ) may be less than M (root) because
the ARBORETUM procedure does not search on variables that are constant in τ ,
or on categorical variables that do not satisfy the MINCATSIZE= option in the
TRAIN statement, or on variables that have been excluded in an ancestor node.) The
ARBORETUM procedure multiplies the p-value by max((m/M (root))M (τ ), 1) to
adjust for the multiple tests on different input variables in the node. Specifying m = 0
requests the procedure to make no adjustment for the number of independent input
variables.
Splitting Criteria for an Ordinal Target
To evaluate splitting rules for an ordinal target, the ARBORETUM procedure uses
loss coefficients A
jk
defining the penalty of misclassifying target value j as k. The
coefficients are the same as the ones in the decision matrix, if one is specified in
DECDATA= option in the DECISION statement. For an ordinal target, the decision
matrix must have type LOSS, the decision alternatives must equal the target values,
and A
jk
must be ≥ 0. By default, A
jk
= |k − j|.
The ARBORETUM procedure always incorporates A
jk
into the node impurity mea-
sure in the splitting criteria for an ordinal target. Let ˆ
k(τ ) denote a target value in
node τ minimizing the loss,
j
A
jk
p
j
. For CRITERION=ENTROPY, define the im-
purity measure,
i(τ ) = −
J
j=1
(A
jˆ
k(τ )
+ 1)p
j
log
2
p
j