Multiple Testing Assumptions
41
The search for a split on a variable does not depend on normality, but the evaluation
of the selected split on the variable in terms of a p-value does. The procedure uses the
p
-value to compare the best split on one variable to that of another, and to compare
against the threshold significance level specified in the ALPHA= option in the TRAIN
statement. Consequently, one potential risk of nonnormality is that the best splitting
variable is rejected in favor of another variable because the p-values are incorrect.
The more important risk is that of mistaking an insignificant split as significant, one
whose p-value is smaller than ALPHA, thereby creating a split that disappoints when
applied to a new sample drawn from the same distribution. The risk is that signifi-
cance testing would not prevent the tree from overfitting.
No assumption is made about the distribution of the input variable defining the split-
ting rule. The split search only depends on the ranks of the values of an interval
or ordinal input. Consequently, any monotonic transformation of an interval input
variable results in the same splitting rule.
Multiple Testing Assumptions
The application of significance tests to splitting rules began in the early 1970s with
the CHAID methodology of Kass(1980). CHAID and its derivatives assume that the
number of independent significance tests equals the number of candidate splits. Even
though a p-value is only computed for a single candidate split on a variable, and the
split search might not examine every possible split, CHAID regards every possible
split as representing a test.
Let α denote the significance level of a test; α equals the probability of mistakenly
rejecting the hypothesis of no association between the target values and the branches
when there is indeed no association. For m independent tests, the probability of
mistakenly rejecting at least one of them equals one minus the probability of rejecting
none of them, which equals
P (one or more spurious splits) = 1 − (1 − α)
m
For example, this equals 0.401 for α = 0.05 and m = 10.
Expanding the polynomial of degree m,
(1 − α)
m
=
m
k=0
(−1)
k
m!
k!(m − k)!
α
k
Multiplying by minus one and then adding one yields
P (one or more spurious splits) =
m
k=1
(−1)
k+1
m!
k!(m − k)!
α
k
The Bonferroni approximation assumes that terms in α
2
and higher are ignorable:
P
Bonf erroni
(one or more spurious splits) = mα
42
The ARBORETUM Procedure
The CHAID methodology uses this expression to evaluate the best split on a variable,
using m equal to the number of possible splits on the variable in the node, and α
equal to the p-value of the split. Let κ(τ, v, b) denote the number of candidate splits
of node τ into b branches using input variable v. The CHAID methodology sets m
equal to κ(τ, v, b).
Setting m equal to the number of possible splits on all variables would produce a
much larger value of m than using the number of splits on a single variable. If no
input variable were predictive of the target in node τ , a split of node τ would occur
by chance using
m =
v
M AXBRAN CH
b=2
κ(τ, v, b)
in the above expression for the probability, where MAXBRANCH denotes the value
of the MAXBRANCH= option in the TRAIN statement.
This value of m and the CHAID value of m are often unrealistically large for comput-
ing the probability of a spurious split in a node. The main difficulty is that candidate
splits are not independent, but formulating an estimate of the significance probability
without assuming independence seems impossible. Incorporating the correlation be-
tween tests would decrease the estimated probability of a spurious split. Consider an
extreme example for illustration: suppose two variables are identical. The candidate
splits using one of the variables would be identical to those of the other, and the tests
using one would simply repeat those of the other. Incorporating the (perfect) corre-
lation of the two variables would reduce the estimate of the probability of a spurious
split by half.
A common situation exposing the awkwardness of the assumption of independent
tests is that of a search for a binary split on an interval variable with no tied values.
A split at one point assigns most observations to the same branch that a split on a
nearby point does, and consequently all splits on nearby points are highly correlated.
Regarding all candidate splits as independent creates an m so unrealistically large
that an estimate of the probability of a spurious split is near certainty. To avoid this,
some analysts first group the values of an interval input variable into 10 or so ordinal
values. The INTERVALBINS= option in the TRAIN statement sets the number of
groups for this purpose. The groups are created separately in each node. Even after
this grouping, the ARBORETUM procedure may consolidate the remaining values,
thereby reducing the number of candidate splits. See the
“Split Search Algorithm”
section beginning on page 47 for more information.
Adjusting p-Values for Multiple Tests
When
specifying CRITERION=PROBF or
CRITERION=PROBCHISQ,
the
ARBORETUM procedure may adjust the p-value of the significance test when
comparing candidate splits with each other, or when comparing a p-value with the
significance threshold specified in the ALPHA= option in the PROC ARBORETUM
statement.