Statistical Tests and p-Values
39
of branches and input values, the depth of the node in the tree, and the number of in-
dependent input variables for which candidate splits exist in the node. A measure
for a categorical target may incorporate prior probabilities. A measure for a nominal
target may incorporate profit or loss functions, including unequal misclassification
costs. A measure for ordinal targets must incorporate distances between target val-
ues. The ARBORETUM procedure creates a distance function from a loss function
specified with the DECISION statement.
This section defines the formulas for computing the worth of a rule s that splits node
τ into B branches, creating nodes, {τ
b
: b = 1, 2, ..., B}. N (τ ) denotes the number
of observations in node τ used in the search for the rule s.
Reduction in Node Impurity
The impurity i(τ ) of node τ is a nonnegative number that equals zero if all observa-
tions in τ have the same target value, and is large if the target values in τ are very
different. The option CRITERION=VARIANCE specifies average square error as the
impurity measure for an interval target:
i(τ ) =
1
N (τ )
N (τ )
i=1
(Y
i
− ¯
Y )
2
where N (τ ) is the number of observations in τ , Y
i
is the target value of observation
i, and ¯
Y is the average of Y
i
in τ .
The option CRITERION=ENTROPY specifies entropy as the impurity measure for a
categorical target:
i(τ ) = −
J
j=1
p
j
log
2
p
j
where p
j
is the proportion of observations with target value j in τ , possibly adjusted
by prior probabilities, a profit function, or a loss function.
The option CRITERION=GINI specifies the Gini index as the impurity measure,
which is also the average square error for a categorical target:
i(τ ) = 1 −
J
j=1
p
2
j
For a binary target, CRITERION=GINI creates the same binary splits as
CRITERION=ENTROPY.
The worth of a split s is measured as the reduction in node impurity:
∆i(s, τ ) = i(τ ) −
B
b=1
p(τ
b
|τ )i(τ
b
)
where the sum is over the B branches the split s defines, and p(τ
b
|τ ) is the proportion
of observations in τ assigned to branch b.
40
The ARBORETUM Procedure
Statistical Tests and p
-Values
An alternative to using the reduction in node impurity is to test for a significant differ-
ence of the target values between the different branches defined by a candidate split.
The worth of the split is equal to − log
10
(p), where p is the p-value (possibly ad-
justed) of the test. The minus sign ensures that the worth is nonnegative, with larger
values being more significant. The ARBORETUM procedure never computes the
raw p-value because it is often smaller than the precision of the computer. Instead,
the procedure computes log
10
(p) directly.
For an interval target, the CRITERION=PROBF option requests using the F-statistic:
F =
SS
between
/(B − 1)
SS
within
/(N (τ ) − B)
where
SS
between
=
B
b=1
N (τ
b
)( ¯
Y (τ
b
) − ¯
Y (τ ))
2
SS
within
=
B
b=1
N (τ
b
)
i=1
(Y
bi
− ¯
Y (τ
b
))
2
The p-value equals the probability z ≥ F where z is a random variable from an F
distribution with N (τ ) − B, B − 1 degrees of freedom.
For a nominal target, the CRITERION=PROBCHISQ option requests using the chi-
square statistic:
χ
2
= N (τ )
B
b=1
J
j=1
(p
j
(τ
b
) − p(τ
b
|τ )p
j
(τ ))
2
p(τ
b
|τ )p
j
(τ )
The p-value equals the probability χ
2
ν
≥ χ
2
, where χ
2
ν
is a random variable from a
chi
-square distribution with ν = (B − 1)(J − 1) degrees of freedom.
The ARBORETUM procedure provides no statistical test for an ordinal target.
Distributional Assumptions
The F-test assumes that the interval target values Y
i
(τ
b
) are normally distributed
around a mean that may depend on the branch, τ
b
. The chi-square test assumes that
the difference between the actual and predicted number of observations for a given
target value j in a given branch τ
b
, N
j
(τ
b
) − p(τ
b
|τ )N
j
(τ ), is normally distributed.
Normality is never checked in practice. Even if the distribution over all training
observations were normal, the distribution in a branch need not be. The central limit
theorem guarantees approximate normality for large N (τ
b
). However, every split
decreases N (τ
b
) and thereby degrades the approximation provided by the theorem.