The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	21/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 17 18 19 20 21 22 23 24 ... 148

Formula for Proﬁt and Loss

where Λ denotes the set of leaves, χ indicates either training or validation data,

ω(τ, χ) is a weight for the node τ , λ(τ, χ) is an inclusion function for cumulative lift

measures, and ψ(τ, χ) is a node statistic.

The node weight, ω(τ, χ), equals the proportion of observations in data set χ in τ

unless the assessment measure incorporates prior probabilities, in which case,

ω(τ, χ) =

(τ, χ)/N

(root, χ)

where π

denotes the prior probability of target value j, and N

denotes the number

of observations with target value j in data set χ in τ .

The inclusion function, λ(τ, χ), equals 1 unless MEASURE= LIFT or LIFTPROFIT.

These measures only use a proportion γ of the data to compute the assess-

ment. The ARBORETUM procedure orders the leaves τ by descending values of

ψ(τ, training). The ﬁrst leaf has the largest value of ψ(τ, training), not the small-

est. The cumulative lift measures use observations in leaves with large values of ψ.

Let the relation τ < τ stand for ψ(τ , training) > ψ(τ, training). Deﬁne

Ω(τ, χ) =

τ <τ

ω(τ , χ)

Intuitively, Ω(τ, χ) is the number of observations in the χ data set in leaves τ such

that ψ(τ , training) > ψ(τ, training).

For ﬁxed χ and 0 < γ < 1, there exists a unique τ ∗ such that Ω(τ ∗, χ) ≥ γ, and

Ω(τ ∗ −1, χ) < γ. Deﬁne the inclusion function to be

λ(τ ∗, χ) =











τ < τ ∗

(γ−Ω(τ ∗−1,χ))

ω(τ ∗,χ)

τ = τ ∗

τ > τ ∗

Note that 0 < λ(τ ∗, χ) ≤ 1. Intuitively, λ(τ, χ) selects which leaves to include

in the cumulative lift measure, and will select a fraction of one particular leaf, τ ∗,

if the required number of observations, γN (root, χ), does not equal the number of

observations in a set of whole leaves.

In the deﬁnition of ψ(τ, χ) below, p

(τ, χ) denotes the proportion of observations

with target value j in data set χ in τ . If the assessment measure incorporates prior

probabilities,

j

(τ, χ) =

(τ, χ)/N

(root, χ)

π

i

(τ, χ)/N

(root, χ)

The remaining sections deﬁne ψ(τ, χ) for the different assessment measures.

The ARBORETUM Procedure

Formula for Proﬁt and Loss

For an interval target with MEASURE=PROFIT,

ψ(τ, χ) =

N (τ,χ)

i=1

(τ )

N (τ, χ)

where E

(τ ) is the estimated proﬁt or loss for observation i in τ .

For a categorical target,

ψ(τ, χ) =

A

j ˆ

(τ, χ)

where ˆ

d is the node decision, and A

j ˆ

is the coefﬁcient in the decision matrix for

target value j, decision ˆ

d. ψ(τ, χ) represents proﬁt, revenue, or loss according to

whether the DECDATA= data set in the DECISION statement has type PROFIT,

REVENUE, or LOSS. Note that ψ(τ, χ) does not incorporate decision costs, and

therefore does not represents proﬁt if the DECDATA= data set has type REVENUE.

Formula for Misclassiﬁcation Rate

ψ(τ, χ) =

j=ˆ

j

p

(τ, χ)

where ˆ

j is the predicted target value in the node.

Formula for Average Square Error and Gini

For an interval target using MEASURE=ASE,

ψ(τ, χ) =

N (τ,χ)

i=1

− ˆ

y(τ ))

N (τ, χ)

where ˆ

y(τ ) is the average of the target variable among the training observations in

node τ .

For a categorical target using MEASURE=ASE,

ψ(τ, χ) =

N (τ,χ)

i=1

j=1

(δ

− ˆ

(τ ))

N (τ, χ)

where δ

equals 1 if observation i has target value j, and equals 0 otherwise, and

(τ ) = p

(τ, training), the predicted probability of target value j for observations

in τ . A simpler, equivalent expression is

ψ(τ, χ) = (1 − 2

p

j

(τ, χ)ˆ

(τ ) +

(τ ))

Passes Over the Data

If the assessment measure incorporates prior probabilities (if any) and χ represents

the training data, then the expression reduces to the Gini index,

ψ(τ, training) = (1 −

p

2

(τ, training))

Formula for Lift

For MEASURE=LIFT,

ψ(τ, χ) = p

event

(τ, χ)

For

MEASURE=LIFTPROFIT,

ψ(τ, χ)

the

same

deﬁned

for

MEASURE=PROFIT.

Performance Considerations

When the ARBORETUM procedure begins, it reserves memory in the computer for

the calculations necessary for growing the tree. Later the procedure will read the

entire training data and perform as many tasks as the reserved memory can accom-

modate, postponing other tasks for a subsequent pass of the data. Typically, the pro-

cedure spends most of its time accessing the data, and therefore reducing the number

of passes of the data will also reduce the execution time.

Passes Over the Data

Each of the following tasks for a node require a pass over the entire training data:

• compute node statistics

• search for a split on an input variable

• determine a rule for missing values for a speciﬁed split

• search for a surrogate rule on an input variable

If only one task were done at a time, the number of passes over the training data

would approximately equal the number of nodes times the number of input variables.

Surrogate splits would require more passes. The number of additional passes equals

the number of inputs minus one. The actual number is typically less for three reasons.

First, if no split on an input variable is found in a node, then no search is attempted on

that input in any descendent node. (See the description of the

MAXRULES= option

in the TRAIN statement for some situations in which no split exists on an input.)

Second, the procedure does not search for any splits in nodes at the depth speciﬁed in

the MAXDEPTH= option in the TRAIN statement. Third, given sufﬁcient memory,

the procedure may perform several tasks during the same pass.

The procedure computes node statistics before beginning a split search in that node.

Consequently, creating a node and ﬁnding a split requires at least two passes of the

data. The procedure will search for a split in a node on every input variable in one

pass of the data if enough memory is available. The search for surrogate splits begins

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 17 18 19 20 21 22 23 24 ... 148