The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	14/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 10 11 12 13 14 15 16 17 ... 148

USEVARONCE
Details Form of a Splitting Rule
Posterior and Within Node Probabilities
Incorporating Prior Probabilities
Incorporating Decisions, Proﬁt, and Loss
Splitting Criteria

The ARBORETUM Procedure

SPLITSIZE=n

speciﬁes the requisite number of training observations a node must have for the

ARBORETUM procedure to consider splitting it.

By default, n is twice the

value of the LEAFSIZE= option.

For the LEAFSIZE=, MINCATSIZE=, and

SPLITSIZE= options in the TRAIN statement, and the NODESIZE= option in the

PERFORMANCE statement, the procedure counts the number of observations in a

node without adjusting the number with the values of the variable speciﬁed in the

FREQ statement.

USEVARONCE

speciﬁes that no splitting rule will be based on an input variable used in a splitting

rule of an ancestor node.

UNDO Statement

UNDO ;

The UNDO statement is an interactive training statement that undoes the most re-

cent PRUNE, SETRULE, SPLIT, or TRAIN statement issued since the most recent

INTERACT statement. The REDO statement may restore what the UNDO statement

undoes.

Details

Form of a Splitting Rule

A splitting rule uses the value of a single input variable to assign an observation to a

branch. The branches are ordered and numbered consecutively starting with 1. Every

splitting rule (other than a surrogate rule) includes an assignment of missing values

to one or all branches, even if no missing values appear in the data. Rules deﬁning a

branch exclusively for missing values assigns the missing values to the last branch.

For interval and ordinal inputs, observations with smaller input values are assigned

to branches with smaller numbers. Consequently, a list of increasing input values

sufﬁces to specify a splitting rule. A surrogate rule may disregard the ordering and

assign smaller values of the input to any branch.

Rules need not assign any training observations to a particular branch.

The

ARBORETUM procedure does not automatically generate such rules, but a user may

specify them in the SETRULE or SPLIT interactive training statements.

Posterior and Within Node Probabilities

The predicted proportions of categorical target values for an observation are called

the posterior probabilities of the target values.

For an observation assigned to a node, the posterior probabilities equal the predicted

within node probabilities

, which are the proportions of the target values of all the

training observations assigned to the node, adjusted for prior probabilities (if any),

and not adjusted for any proﬁt or loss coefﬁcients.

Incorporating Prior Probabilities

For an observation assigned to more than one leaf with fractional weights that sum

to one, the posterior probabilities are the weighted averages over the leaves of the

predicted within node probabilities.

The within node probabilities for a split search are the proportions of the target values

in the within node training sample, adjusted for the bias from stratiﬁed sampling,

and adjusted for prior probabilities if requested by the SPLITSEARCH option in

the PROC ARBORETUM statement, and adjusted for proﬁt or loss coefﬁcients if

requested by the DECSEARCH option in the PROC statement.

When neither priors, proﬁts, nor losses are speciﬁed, and observations are assigned

to a single leaf, and within node sampling is not used, the posterior and within node

probabilities are simply the proportions of the target values in a node τ :

j

= proportion

(τ ) = N

(τ )/N (τ )

where N

j

(τ ) is the number of training observations in τ with target value j.

When incorporating priors, proﬁts, or losses, this becomes

proportion

(τ )/proportion

(root)

proportion

(τ )/proportion

(root)

or equivalently,

(τ )/N

(root)

(τ )/N

(root)

where N

(root) is the number of training observations in the root node with target

value j, and ρ

depends on whether p

incorporates priors, proﬁts, or losses. Table

deﬁnes ρ

by type of quantity being incorporated.

Table 7.

by Type of Incorporated Quantity

Incorporated Quantity

j

Description

Nothing

(root)

number of j observations in root

Prior Probabilities

prior probability

Proﬁt or Loss

a

j

altered prior probability

Incorporating Prior Probabilities

The PRIORVAR option in the DECISION statement declares the existence of prior

probabilities. If prior probabilities exist, they are always incorporated in the posterior

probabilities. If the PRIORSEARCH option is speciﬁed in the PROC ARBORETUM

statement, the priors will also be incorporated in the search for a splitting rule. If

the PRIORS option to the ASSESS statement is speciﬁed, the priors will also be

The ARBORETUM Procedure

incorporated in the evaluation of subtrees and consequently inﬂuence which nodes

are automatically pruned.

In all cases, the priors are incorporated by deﬁning the within node probabilities

above with ρ

= π

, the prior probability of target value j, or, when incorporating a

proﬁt or a loss, πj

, the altered prior probability deﬁned below.

Incorporating Decisions, Proﬁt, and Loss

The DECSEARCH option in the PROC ARBORETUM statement requests that the

split search for a nominal target incorporate the proﬁt or loss functions speciﬁed in

the DECISION statement. Unequal misclassiﬁcation costs of Breiman et al. (1984)

are a special case in which the decision alternatives equal the target values and the

DECDATA= data set is type LOSS. The ARBORETUM procedure generalizes the

method of altered priors introduced in Breiman et al.

The search incorporates the decisions, proﬁt, or loss functions by using ρ

= π

in the deﬁnition of within node probability, p

. Let A

denote the coefﬁcient for

decision d, target value j, in the decision matrix. Deﬁne

If the PRIORSSEARCH option is speciﬁed in the PROC ARBORETUM statement

requesting the search to incorporate prior probabilities, then

deﬁnes the altered prior probability for target value j, where π

denotes the prior

probability of j. Intuitively, the alteration inﬂates the prior probability for those target

values having large proﬁt or loss coefﬁcients, thereby giving observations with those

target values more weight in the split search. The search incorporates the altered

priors instead of incorporating the original priors.

If the PRIORSEARCH option is not speciﬁed, then the deﬁnition of π

changes by

replacing π

with N

(τ )/N (τ ), the simple proportion of observations having target

value j:

(τ )/N (τ )

(τ )/N (τ )

Splitting Criteria

The ARBORETUM procedure searches for rules that maximize the measure of worth

associated with the splitting criterion speciﬁed in the CRITERION= option in the

PROC ARBORETUM statement. Some measures are based on a node impurity mea-

sure, others on p-values of a statistical test. A p-value may be adjusted for the number

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 10 11 12 13 14 15 16 17 ... 148