The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	17/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 13 14 15 16 17 18 19 20 ... 148

Adjusting p-Values for the Number of Input Values and Branches

For a particular node, variable, and number of branches, the procedure can ﬁnd the

best candidate without computing a p-value by ﬁnding the candidate with the largest

statistic or χ-square statistic.

If the PADJUST=CHAIDBEFORE option is speciﬁed, the p-value is multiplied by

κ(τ, v, b). Otherwise no adjustment is made yet. This procedure repeats for each

possible number of branches, producing a single candidate split for each number

of branches, and the chooses the one with the best adjusted or unadjusted p-value

accordingly as PADJUST=CHAIDBEFORE is or is not speciﬁed.

If the PADJUST=CHAIDAFTER is speciﬁed, the ﬁnal candidate split in the

node for the variable is is multiplied by κ(τ, v, b).

If either the PVARS=n or

PADJUST=DEPTH option is speciﬁed in the PROC ARBORETUM statement, the

-value is further multiplied by a factor to adjust for the number of variables or the

depth of the node τ in the tree, to arrive at a ﬁnal adjusted p-value of the candidate

split.

If the adjusted p-value is greater than the value of the ALPHA= option in the PROC

ARBORETUM statement, the candidate is discarded, and the procedure proposes no

split of τ using the variable.

Adjusting p-Values for the Number of Input Values and Branches

The PADJUST=CHAIDAFTER or CHAIDBEFORE option in the PROC statement

requests the ARBORETUM procedure to multiply the p-value of the χ

statistic com-

puted for the PROBCHISQ criterion for a nominal target by a Bonferroni factor κ to

adjust for using multiple signiﬁcance tests. If κp is larger than the p-value of an alter-

native conservative signiﬁcance test called Gabriel’s, then Gabriel’s p-value is used

instead of κp unless the PADJUST=NOGABRIEL option is speciﬁed.

Let B denote the number of branches, and c the number of input variable values

available to the split search. If the MISSING=USEINSEARCH option is speciﬁed in

the INPUT statement, c includes the missing value. For an interval input, c represents

consolidated

values described in the

“Split Search Algorithm”

section beginning on

page 47.

The Bonferroni factor κ depends on whether the input variable is nominal, and

whether the MISSING=USEINSEARCH option is speciﬁed.

κ =













B−1

i=0

(−1)

i (B−i)

i!(B−i)!

for a nominal input

c−1

B−1

for non-nominal, without USEINSEARCH

B−1+B(c−B)

c−1

B−1

for non-nominal, with USEINSEARCH

The Bonferroni adjustment is described further in Kass (1980). Hawkins and Kass

(1982) suggested bounding κp with a p-value from a more conservative test. Unless

the PADJUST=NOGABRIEL is speciﬁed,

p = min(κP r(χ

(B−1,J −1)

> χ

), P r(χ

(c−1,J −1)

> χ

2

))

The ARBORETUM Procedure

where J is the number of target values.

Adjusting p-Values for the Depth of the Node

The PADJUST=DEPTH option in the PROC statement requests the ARBORETUM

procedure to multiply the p-value by a depth factor to account for the probability of

error in creating the current node. The unadjusted p-value estimates the probability

that the observed association between the target values and the split of the data into

subsets could happen by chance, given the existence of the current node. The depth

adjustment attempts to incorporate the probability that the current node being split is

a chance occurrence to begin with.

The depth factor for node τ is the product of the number of branches in each ancestor

node:

Depth(τ ) =

B(τ )

Adjusting p-Values for the Number of Input Variables

The PVARS=m option in the PROC statement requests the ARBORETUM proce-

dure to adjust the p-value to account for multiple signiﬁcance tests with independent

input variables. Let M (root) denote the number of input variables, and M (τ ) de-

note the number of input variables for which the ARBORETUM procedure searches

for a splitting rule in a speciﬁc node. (M (τ ) may be less than M (root) because

the ARBORETUM procedure does not search on variables that are constant in τ ,

or on categorical variables that do not satisfy the MINCATSIZE= option in the

TRAIN statement, or on variables that have been excluded in an ancestor node.) The

ARBORETUM procedure multiplies the p-value by max((m/M (root))M (τ ), 1) to

adjust for the multiple tests on different input variables in the node. Specifying m = 0

requests the procedure to make no adjustment for the number of independent input

variables.

Splitting Criteria for an Ordinal Target

To evaluate splitting rules for an ordinal target, the ARBORETUM procedure uses

loss coefﬁcients A

deﬁning the penalty of misclassifying target value j as k. The

coefﬁcients are the same as the ones in the decision matrix, if one is speciﬁed in

DECDATA= option in the DECISION statement. For an ordinal target, the decision

matrix must have type LOSS, the decision alternatives must equal the target values,

and A

must be ≥ 0. By default, A

= |k − j|.

The ARBORETUM procedure always incorporates A

into the node impurity mea-

sure in the splitting criteria for an ordinal target. Let ˆ

k(τ ) denote a target value in

node τ minimizing the loss,

. For CRITERION=ENTROPY, deﬁne the im-

purity measure,

i(τ ) = −

j=1

jˆ

k(τ )

+ 1)p

log

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 13 14 15 16 17 18 19 20 ... 148