The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	18/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 14 15 16 17 18 19 20 21 ... 148

Missing Values

For CRITERION=GINI, deﬁne the impurity measure,

i(τ ) =

j=1

jˆ

k(τ )

+ 1)p

(1 − p

)

Missing Values

If the value of the target variable is missing, the observation is excluded from training

and evaluating the tree.

If the value of an input variable X is missing, the MISSING= option in the INPUT

statement that declares X determines how the ARBORETUM procedure treats the

observation. If the option is omitted from the INPUT statement, then the MISSING=

option in the PROC ARBORETUM statement determines the policy for X. If the

option is omitted from PROC statement also, then MISSING=USEINSEARCH is

assumed for X.

Specify MISSING=USEINSEARCH to incorporate missing values in the calcula-

tion of the worth of a spltting rule, and consequently to produce a splitting rule

that associates missing values with a branch that maximizes the worth of the split.

For a nominal input variable, a new nominal category representing missing val-

ues is created for the duration of the split search. For an ordinal or interval in-

put variable, a rule preserves the ordering of the nonmissing values when assigning

them to branches, but may assign missing values to any single branch. Specifying

MISSING=USEINSEARCH may produce a branch exclusively for missing values.

This is desirable when the existence of a missing value is predictive of a target value.

If the MISSING=BIGBRANCH, DISTRIBUTE, or SMALLRESIDUAL option is

speciﬁed for X and X is missing, the observation is excluded from the search for a

split on X.

If MISSING= SMALLRESIDUAL, the rule uses the branch with the smallest resid-

ual sum of squares among observations in the within-node training sample with miss-

ing values of X. For a categorical target, the residual sum of squares is

missing

i=1

j=1

(δ

− p

(nonmissing))

where the outer sum is over observations with missing values of X, δ

equals 1 if

observation i has target value j, and equals 0 otherwise, and p

(nonmissing) is the

within node probability of target value j based on observations with nonmissing X in

the within-node training sample and assigned to the branch. When prior probabilities

are not speciﬁed, p

(nonmissing) is the proportion of such observations with target

value j. Otherwise, p

(nonmissing) incorporates the prior probabilities (and never

incorporates proﬁt or loss coefﬁcients) using the formula described in the

“Posterior

and Within Node Probabilities”

section beginning on page 36.

The ARBORETUM Procedure

If MISSING= SMALLRESIDUAL or USEINSEARCH and no missing values oc-

cur in the within-node training sample for X, then the splitting rule assigns miss-

ing values to the branch with the most observations in the within-node sample, as if

MISSING= BIGBRANCH were speciﬁed. If more than one branch has this same

maximum number of observations, then the missing values are assigned to the ﬁrst

such branch. Assigning observations to the largest branch does not help create homo-

geneous branches, but some branch must be assigned in order for the rule to handle

missing values in the future (when applied to observations not in the training data),

and the MISSING=BIGBRANCH policy is the least harmful one possible without

any information about the association of missing values with the target.

When a rule is applied to an observation, and the rule requires an input variable whose

value is missing or an unrecognized category, surrogate rules are considered before

the MISSING= option is. A surrogate rule is a backup to the main splitting rule. For

example, the main rule might use variable CITY and the surrogate might use variable

REGION. If CITY is missing and REGION is not missing, the surrogate is used. If

REGION is also missing, then the next surrogate is considered.

If none of the surrogates can be applied to the observation, then the MISSING=

option for the splitting variable governs what happens to the observation.

MISSING=USEINSEARCH and no surrogates are applicable, the observation is

assigned to the branch for missing values speciﬁed in the splitting rule.

MISSING=DISTRIBUTE, the observation is in effect copied, one copy for each

branch. The copy assigned to a branch is given a fractional frequency proportional

to the number of training observations assigned to the branch. The CODE statement

cannot handle rules with MISSING=DISTRIBUTE.

Unseen Categorical Values

A splitting rule using a categorical variable might not recognize all possible values of

the variable. Some categories might not have been in the training data. Others might

have been so infrequent in the within-node training sample that the ARBORETUM

procedure excluded them. The MINCATSIZE= option in the TRAIN statement spec-

iﬁes the minimum number of occurrences required for a categorical value to partic-

ipate in the search for a splitting rule. Splitting rules treat unseen categorical values

as they would missing values.

Within Node Training Sample

The search for a splitting rule is based on a sample of the training data assigned to

the node. The NODESIZE=n option in the PERFORMANCE statement speciﬁes the

number of observations to use in the sample. The procedure counts and samples the

observations in a node without adjusting for values of the variable speciﬁed in the

FREQ statement, if any. If the count is larger than n, then the split search for that

node is based on a random sample of size n.

For a categorical target variable, the sample uses as many observations as possible

in each category. Some categories might occur infrequently enough so that all the

observations are in the sample. Let J

rare

denote the number of these categories, and

let n

rare

denote the total number of observations in the node with these infrequent

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 14 15 16 17 18 19 20 21 ... 148