The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	25/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 21 22 23 24 25 26 27 28 ... 148

SEQUENCE= Output Data Set
Examples Example 1. Prior Probabilities with Biased Samples

Example 1. Prior Probabilities with Biased Samples

The DUMMY option in the SCORE statement species that the OUT= data set contain

numeric variables –i– for integers i ranging from 1 to the number of leaves. The

value of –i– equals the proportion of the observation assigned to the leaf with leaf

identiﬁcation number i. The sum of these variables equals one for each observation.

Unless the MISSING=DISTRIBUTE option is speciﬁed in some INPUT statement

or in the PROC statement, exactly one of the variables –i– equals one, and the rest

are zero. When the MISSING=DISTRIBUTE option is speciﬁed, observations are

distributed over more than one leaf, and –i– equals the proportion of the observation

assigned the leaf i.

SEQUENCE= Output Data Set

The SEQUENCE= option in the SAVE statement speciﬁes a data set to contain ﬁt

statistics for all subtrees in the subtree sequence. See the

“Tree Assessment and the

Subtree Sequence”

section beginning on page 49 for an explanation of the subtree

sequence. Each observation describes a subtree with a different number of leaves.

The variables are

• –ASSESS–, the assessment value

• –VASSESS–, the assessment value based on validation data

• –SEQUENCE–, the assessment value used for creating the subtree sequence

if different from –ASSESS–

• –VSEQUENCE–, the validation assessment value used for creating the sub-

tree sequence if different from –VASSESS–

• ﬁt statistics variables output by the OUTFIT= option of the SCORE statement

Examples

Example 1. Prior Probabilities with Biased Samples

This example illustrates the need for prior probabilities when the training data con-

tains different proportions of categorical target values than does the data to which the

model is intended to apply. A common situation is that of a binary target in which one

value occurs infrequently. Some analysts will remove a portion of the observations

with the more frequent target value from the training data. One reason is to reduce

the volume of data without changing the predictions very much. Another stems from

the belief that the algorithm performs better when starting with equal proportions of

the target values.

This example compares four approaches to analyzing data in which one value of a

binary target dominates:

ORIGINAL

no sampling or use of prior probabilities

BIASED

sampling of observations with the majority target value

PRIORS

using prior probabilities in prediction and assessment

The ARBORETUM Procedure

PSEARCH

using prior probabilities in the split search as well

Prior probabilities convert the predictions from the BIASED analysis to predictions

that would obtain from analyzing the original data. Using prior probabilities in the

split search produces splits similar to those found in the original data, undermining

any attempt to train the tree on roughly equal amounts of the two target values.

Suppose variables COIN1 and COIN2 represent the outcomes of tossing two coins in

which heads and tails are equally likely. Let the variable HEADS be 1 if both COIN1

and COIN2 are heads, and 0 otherwise. Evidently, the probability that HEADS

equals 1 is 1/4.

COIN1 and COIN2 completely determine HEADS, and a decision tree should pre-

dict HEADS perfectly, regardless of what prior probabilities are speciﬁed or what

proportion of target values are in the training data, as long as at least one instance

of each of the four possible combinations of COIN1 and COIN2 are in the training

data.

Prior probabilities become necessary when the input variables inﬂuence the target

without completely determining it. The SAS DATA step below generates a random

variable X1 equal to COIN1 75 percent of the time, and generates a random variable

X2 similarly. When COIN1 and COIN2 are both heads, the chance that both X1 and

X2 are heads is 0.75 times 0.75, which equals 0.56.

data original;

keep heads x1 x2;

call streaminit(9754321);

do i = 1 to 10000;

coin1

= rand(’bernoulli’, 0.5);

coin2

= rand(’bernoulli’, 0.5);

heads = coin1 eq 1 and coin2 eq 1;

x1 = coin1;

if rand(’bernoulli’, 0.25) ne 0 then x1 = 1 - x1;

x2 = coin2;

if rand(’bernoulli’, 0.25) ne 0 then x2 = 1 - x2;

output;

end;

The proportion of observations in the data set ORIGINAL for which HEADS equals

is about 1/4. By removing every third observation in which HEADS is not equal to

, the DATA step below creates a data set named BIASED, in which the proportion

of observations with HEADS equal to one is about 1/2.

data biased;

retain n_tails 0;

drop n_tails;

set original;

if heads eq 0 then do;

n_tails + 1;

Example 1. Prior Probabilities with Biased Samples

63

if mod( n_tails, 3) eq 0 then output;

end;

else do;

output;

end;

In the following code, the ARBORETUM procedure creates a tree from the data set

ORIGINAL and saves three data sets containing results:

ORIGINAL–PATH description of each path to a leaf

ORIGNAL–NODES counts and predictions in each leaf

ORIGNAL–SEQ assessment of each subtree

The KEEP= and RENAME= data set options specify which variables to keep and

what to name them. Selecting and renaming variables now will make it easy to merge

and print results from different runs of the procedure later. The procedure is also run

using the BIASED data set. The code is not shown because it is identical to that

below except that ‘biased’ replaces ‘original’ everywhere.

proc arboretum data=original ;

input

x1 x2 / level=binary;

target heads / level=binary;

assess;

save path = original_path

nodestats =original_nodes(keep

= leaf n p_heads1

where =(leaf ne .)

rename=(n

=

n_original

npriors

=

np_original

p_heads1 =

p_original))

sequence

=original_seq(keep=_assess_ rename=(_assess_ = original))

;

The following DATA step creates a variable, PRIOR, equal to the prior probability

of the corresponding value of HEADS.

data priors;

input heads prior;

datalines;

0

0.75

1

0.25

;

The ARBORETUM procedure is run a second time on the BIASED data set, this time

with a DECISION statement to include the prior probabilities speciﬁed in the PRIOR

variable of the PRIORS data set. The procedure uses the priors to adjust the poste-

rior probabilities, but not to adjust the overall evaluation of a subtree unless explicitly

requested. The

“Incorporating Prior Probabilities in the Tree Assessment”

section

on page 66 compares the two approaches using the assessment values saved from

the two ASSESS statements here. The ﬁrst ASSESS statement computes the subtree

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 21 22 23 24 25 26 27 28 ... 148