Missing Values
45
For CRITERION=GINI, define the impurity measure,
i(τ ) =
J
j=1
(A
jˆ
k(τ )
+ 1)p
j
(1 − p
j
)
Missing Values
If the value of the target variable is missing, the observation is excluded from training
and evaluating the tree.
If the value of an input variable X is missing, the MISSING= option in the INPUT
statement that declares X determines how the ARBORETUM procedure treats the
observation. If the option is omitted from the INPUT statement, then the MISSING=
option in the PROC ARBORETUM statement determines the policy for X. If the
option is omitted from PROC statement also, then MISSING=USEINSEARCH is
assumed for X.
Specify MISSING=USEINSEARCH to incorporate missing values in the calcula-
tion of the worth of a spltting rule, and consequently to produce a splitting rule
that associates missing values with a branch that maximizes the worth of the split.
For a nominal input variable, a new nominal category representing missing val-
ues is created for the duration of the split search. For an ordinal or interval in-
put variable, a rule preserves the ordering of the nonmissing values when assigning
them to branches, but may assign missing values to any single branch. Specifying
MISSING=USEINSEARCH may produce a branch exclusively for missing values.
This is desirable when the existence of a missing value is predictive of a target value.
If the MISSING=BIGBRANCH, DISTRIBUTE, or SMALLRESIDUAL option is
specified for X and X is missing, the observation is excluded from the search for a
split on X.
If MISSING= SMALLRESIDUAL, the rule uses the branch with the smallest resid-
ual sum of squares among observations in the within-node training sample with miss-
ing values of X. For a categorical target, the residual sum of squares is
N
missing
i=1
J
j=1
(δ
ij
− p
j
(nonmissing))
2
where the outer sum is over observations with missing values of X, δ
ij
equals 1 if
observation i has target value j, and equals 0 otherwise, and p
j
(nonmissing) is the
within node probability of target value j based on observations with nonmissing X in
the within-node training sample and assigned to the branch. When prior probabilities
are not specified, p
j
(nonmissing) is the proportion of such observations with target
value j. Otherwise, p
j
(nonmissing) incorporates the prior probabilities (and never
incorporates profit or loss coefficients) using the formula described in the
“Posterior
and Within Node Probabilities”
section beginning on page 36.
46
The ARBORETUM Procedure
If MISSING= SMALLRESIDUAL or USEINSEARCH and no missing values oc-
cur in the within-node training sample for X, then the splitting rule assigns miss-
ing values to the branch with the most observations in the within-node sample, as if
MISSING= BIGBRANCH were specified. If more than one branch has this same
maximum number of observations, then the missing values are assigned to the first
such branch. Assigning observations to the largest branch does not help create homo-
geneous branches, but some branch must be assigned in order for the rule to handle
missing values in the future (when applied to observations not in the training data),
and the MISSING=BIGBRANCH policy is the least harmful one possible without
any information about the association of missing values with the target.
When a rule is applied to an observation, and the rule requires an input variable whose
value is missing or an unrecognized category, surrogate rules are considered before
the MISSING= option is. A surrogate rule is a backup to the main splitting rule. For
example, the main rule might use variable CITY and the surrogate might use variable
REGION. If CITY is missing and REGION is not missing, the surrogate is used. If
REGION is also missing, then the next surrogate is considered.
If none of the surrogates can be applied to the observation, then the MISSING=
option for the splitting variable governs what happens to the observation.
If
MISSING=USEINSEARCH and no surrogates are applicable, the observation is
assigned to the branch for missing values specified in the splitting rule.
If
MISSING=DISTRIBUTE, the observation is in effect copied, one copy for each
branch. The copy assigned to a branch is given a fractional frequency proportional
to the number of training observations assigned to the branch. The CODE statement
cannot handle rules with MISSING=DISTRIBUTE.
Unseen Categorical Values
A splitting rule using a categorical variable might not recognize all possible values of
the variable. Some categories might not have been in the training data. Others might
have been so infrequent in the within-node training sample that the ARBORETUM
procedure excluded them. The MINCATSIZE= option in the TRAIN statement spec-
ifies the minimum number of occurrences required for a categorical value to partic-
ipate in the search for a splitting rule. Splitting rules treat unseen categorical values
as they would missing values.
Within Node Training Sample
The search for a splitting rule is based on a sample of the training data assigned to
the node. The NODESIZE=n option in the PERFORMANCE statement specifies the
number of observations to use in the sample. The procedure counts and samples the
observations in a node without adjusting for values of the variable specified in the
FREQ statement, if any. If the count is larger than n, then the split search for that
node is based on a random sample of size n.
For a categorical target variable, the sample uses as many observations as possible
in each category. Some categories might occur infrequently enough so that all the
observations are in the sample. Let J
rare
denote the number of these categories, and
let n
rare
denote the total number of observations in the node with these infrequent