The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	15/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 11 12 13 14 15 16 17 18 ... 148

Statistical Tests and p-Values

of branches and input values, the depth of the node in the tree, and the number of in-

dependent input variables for which candidate splits exist in the node. A measure

for a categorical target may incorporate prior probabilities. A measure for a nominal

target may incorporate proﬁt or loss functions, including unequal misclassiﬁcation

costs. A measure for ordinal targets must incorporate distances between target val-

ues. The ARBORETUM procedure creates a distance function from a loss function

speciﬁed with the DECISION statement.

This section deﬁnes the formulas for computing the worth of a rule s that splits node

τ into B branches, creating nodes, {τ

: b = 1, 2, ..., B}. N (τ ) denotes the number

of observations in node τ used in the search for the rule s.

Reduction in Node Impurity

The impurity i(τ ) of node τ is a nonnegative number that equals zero if all observa-

tions in τ have the same target value, and is large if the target values in τ are very

different. The option CRITERION=VARIANCE speciﬁes average square error as the

impurity measure for an interval target:

i(τ ) =

N (τ )

i=1

− ¯

Y )

where N (τ ) is the number of observations in τ , Y

is the target value of observation

i, and ¯

Y is the average of Y

in τ .

The option CRITERION=ENTROPY speciﬁes entropy as the impurity measure for a

categorical target:

i(τ ) = −

j=1

log

where p

is the proportion of observations with target value j in τ , possibly adjusted

by prior probabilities, a proﬁt function, or a loss function.

The option CRITERION=GINI speciﬁes the Gini index as the impurity measure,

which is also the average square error for a categorical target:

i(τ ) = 1 −

j=1

For a binary target, CRITERION=GINI creates the same binary splits as

CRITERION=ENTROPY.

The worth of a split s is measured as the reduction in node impurity:

∆i(s, τ ) = i(τ ) −

b=1

p(τ

|τ )i(τ

)

where the sum is over the B branches the split s deﬁnes, and p(τ

|τ ) is the proportion

of observations in τ assigned to branch b.

The ARBORETUM Procedure

Statistical Tests and p-Values

An alternative to using the reduction in node impurity is to test for a signiﬁcant differ-

ence of the target values between the different branches deﬁned by a candidate split.

The worth of the split is equal to − log

(p), where p is the p-value (possibly ad-

justed) of the test. The minus sign ensures that the worth is nonnegative, with larger

values being more signiﬁcant. The ARBORETUM procedure never computes the

raw p-value because it is often smaller than the precision of the computer. Instead,

the procedure computes log

(p) directly.

For an interval target, the CRITERION=PROBF option requests using the F-statistic:

F =

between

/(B − 1)

within

/(N (τ ) − B)

where

between

b=1

N (τ

)( ¯

Y (τ

) − ¯

Y (τ ))

within

b=1

N (τ

)

i=1

− ¯

Y (τ

))

The p-value equals the probability z ≥ F where z is a random variable from an F

distribution with N (τ ) − B, B − 1 degrees of freedom.

For a nominal target, the CRITERION=PROBCHISQ option requests using the chi-

square statistic:

= N (τ )

b=1

j=1

(τ

) − p(τ

|τ )p

(τ ))

p(τ

|τ )p

(τ )

The p-value equals the probability χ

≥ χ

, where χ

is a random variable from a

chi

-square distribution with ν = (B − 1)(J − 1) degrees of freedom.

The ARBORETUM procedure provides no statistical test for an ordinal target.

Distributional Assumptions

The F-test assumes that the interval target values Y

(τ

b

) are normally distributed

around a mean that may depend on the branch, τ

. The chi-square test assumes that

the difference between the actual and predicted number of observations for a given

target value j in a given branch τ

, N

(τ

) − p(τ

|τ )N

(τ ), is normally distributed.

Normality is never checked in practice. Even if the distribution over all training

observations were normal, the distribution in a branch need not be. The central limit

theorem guarantees approximate normality for large N (τ

). However, every split

decreases N (τ

) and thereby degrades the approximation provided by the theorem.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 11 12 13 14 15 16 17 18 ... 148