The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	16/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 12 13 14 15 16 17 18 19 ... 148

Multiple Testing Assumptions
Adjusting p-Values for Multiple Tests

Multiple Testing Assumptions

The search for a split on a variable does not depend on normality, but the evaluation

of the selected split on the variable in terms of a p-value does. The procedure uses the

-value to compare the best split on one variable to that of another, and to compare

against the threshold signiﬁcance level speciﬁed in the ALPHA= option in the TRAIN

statement. Consequently, one potential risk of nonnormality is that the best splitting

variable is rejected in favor of another variable because the p-values are incorrect.

The more important risk is that of mistaking an insigniﬁcant split as signiﬁcant, one

whose p-value is smaller than ALPHA, thereby creating a split that disappoints when

applied to a new sample drawn from the same distribution. The risk is that signiﬁ-

cance testing would not prevent the tree from overﬁtting.

No assumption is made about the distribution of the input variable deﬁning the split-

ting rule. The split search only depends on the ranks of the values of an interval

or ordinal input. Consequently, any monotonic transformation of an interval input

variable results in the same splitting rule.

Multiple Testing Assumptions

The application of signiﬁcance tests to splitting rules began in the early 1970s with

the CHAID methodology of Kass(1980). CHAID and its derivatives assume that the

number of independent signiﬁcance tests equals the number of candidate splits. Even

though a p-value is only computed for a single candidate split on a variable, and the

split search might not examine every possible split, CHAID regards every possible

split as representing a test.

Let α denote the signiﬁcance level of a test; α equals the probability of mistakenly

rejecting the hypothesis of no association between the target values and the branches

when there is indeed no association. For m independent tests, the probability of

mistakenly rejecting at least one of them equals one minus the probability of rejecting

none of them, which equals

P (one or more spurious splits) = 1 − (1 − α)

For example, this equals 0.401 for α = 0.05 and m = 10.

Expanding the polynomial of degree m,

(1 − α)

k=0

(−1)

k!(m − k)!

Multiplying by minus one and then adding one yields

P (one or more spurious splits) =

k=1

(−1)

k+1

k!(m − k)!

k

The Bonferroni approximation assumes that terms in α

and higher are ignorable:

Bonf erroni

(one or more spurious splits) = mα

The ARBORETUM Procedure

The CHAID methodology uses this expression to evaluate the best split on a variable,

using m equal to the number of possible splits on the variable in the node, and α

equal to the p-value of the split. Let κ(τ, v, b) denote the number of candidate splits

of node τ into b branches using input variable v. The CHAID methodology sets m

equal to κ(τ, v, b).

Setting m equal to the number of possible splits on all variables would produce a

much larger value of m than using the number of splits on a single variable. If no

input variable were predictive of the target in node τ , a split of node τ would occur

by chance using

m =

M AXBRAN CH

b=2

κ(τ, v, b)

in the above expression for the probability, where MAXBRANCH denotes the value

of the MAXBRANCH= option in the TRAIN statement.

This value of m and the CHAID value of m are often unrealistically large for comput-

ing the probability of a spurious split in a node. The main difﬁculty is that candidate

splits are not independent, but formulating an estimate of the signiﬁcance probability

without assuming independence seems impossible. Incorporating the correlation be-

tween tests would decrease the estimated probability of a spurious split. Consider an

extreme example for illustration: suppose two variables are identical. The candidate

splits using one of the variables would be identical to those of the other, and the tests

using one would simply repeat those of the other. Incorporating the (perfect) corre-

lation of the two variables would reduce the estimate of the probability of a spurious

split by half.

A common situation exposing the awkwardness of the assumption of independent

tests is that of a search for a binary split on an interval variable with no tied values.

A split at one point assigns most observations to the same branch that a split on a

nearby point does, and consequently all splits on nearby points are highly correlated.

Regarding all candidate splits as independent creates an m so unrealistically large

that an estimate of the probability of a spurious split is near certainty. To avoid this,

some analysts ﬁrst group the values of an interval input variable into 10 or so ordinal

values. The INTERVALBINS= option in the TRAIN statement sets the number of

groups for this purpose. The groups are created separately in each node. Even after

this grouping, the ARBORETUM procedure may consolidate the remaining values,

thereby reducing the number of candidate splits. See the

“Split Search Algorithm”

section beginning on page 47 for more information.

Adjusting p-Values for Multiple Tests

When

specifying CRITERION=PROBF or

CRITERION=PROBCHISQ,

the

ARBORETUM procedure may adjust the p-value of the signiﬁcance test when

comparing candidate splits with each other, or when comparing a p-value with the

signiﬁcance threshold speciﬁed in the ALPHA= option in the PROC ARBORETUM

statement.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 12 13 14 15 16 17 18 19 ... 148