The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	27/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 23 24 25 26 27 28 29 30 ... 148

Incorporating Prior Probabilities in the Split Search

67

Output 1.4.

Subtree Sequences of Misclassiﬁcation Rates

Obs

original

biased

priors

passess

1

0.2420

0.48928

0.48928

0.25000

2

0.2420

0.33279

0.48928

0.25000

3

0.2162

0.28629

0.28629

0.21579

4

0.2162

0.28629

0.28629

0.21579

Incorporating Prior Probabilities in the Split Search

The worth of a splitting rule for a categorical target depends on the relative propor-

tions of the different target values within each branch (and within the node being

split). When the PRIORSSEARCH option to the PROC ARBORETUM statement

is speciﬁed, the ARBORETUM procedure uses the prior probabilities to modify the

proportions and consequently modify the worth of a splitting rule. This section com-

pares the worth computed for the same split in four situations:

ORIGINAL

4,946 unbiased observations from the ORIGINAL data

BIASED

the BIASED subsample, no priors

PRIORS

the BIASED subsample with prior probabilities

PSEARCH

the BIASED subsample with the PRIORSSEARCH option

The worth of a split depends on the number of observations. To compare the split

worth, the number of observations in the ORIGINAL training data is reduced here

to the same number (4,946) as in the BIASED data. All candidate splitting rules are

saved in a data set named ORIGINAL–RULES.

proc arboretum data=original(obs=4946);

input

x1 x2

/ level=binary;

target heads / level=binary;

assess;

save rules

=original_rules ;

The splitting rules for the BIASED and PRIORS run have already been saved in the

data sets BIASED–RULES and PRIORS–RULES, respectively.

The following code uses the PRIORSSEARCH option and saves the splitting rules in

the data set PSEARCH–RULES.

proc arboretum data=biased priorssearch ;

input x1 x2

/ level=binary;

target heads / level=binary;

decision decdata=priors priorvar=prior;

assess priors;

save rules

=psearch_rules;

The ARBORETUM Procedure

The WORTHDS macro extracts the worth of the best splitting rule on X1 and on

X2 in the root node as saved in the RULES= option in the SAVE statement of the

ARBORETUM procedure.

%macro worthds( prefix);

data &prefix._worth;

length var

$ 2;

length variable $ 2;

keep &prefix

variable ;

set &prefix._rules end=the_end;

retain x1 x2 var;

if node eq 1 and stat=’VARIABLE’ then

var = left(trim(character_value));

if node eq 1 and stat=’WORTH’ then do;

if var eq ’x1’ then x1 = numeric_value;

else

x2 = numeric_value;

end;

if the_end ne 0 then do;

variable = ’x1’;

&prefix

= x1;

output;

variable = ’x2’;

&prefix

= x2;

output;

end;

%mend;

The following code extracts the split worth from each of four ARBORETUM runs

and combines them into one data set, which is then printed.

%worthds( original);

%worthds( biased);

%worthds( priors);

%worthds( psearch);

data worth;

set original_worth;

set biased_worth;

set priors_worth;

set psearch_worth;

References

69

Output 1.5.

Split Worth for Four ARBORETUM Runs

Obs

variable

original

biased

priors

psearch

1

x1

88.7930

127.151

127.151

93.5635

2

x2

97.8053

127.467

127.467

93.4600

Output 1.5

shows the worth of the same split on X1 in the four situations, and sim-

ilarly for X2. X1 and X2 produce splits of equal worth, as expected, except in the

truncated ORIGINAL training data, where, presumably, chance variation in the sam-

ple resulted in a better split for X2. Using the BIASED training data, only the col-

umn from the ARBORETUM run with the PRIORSSEARCH option produced a split

worth comparable to that of the ORIGINAL data set.

References

Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984), Classiﬁcation and

Regression Trees

, Belmont, CA: Wadsworth, Inc.

Hawkins, D.M., ed. (1982), Topics in Applied Multivariate Analysis, Cambridge:

Cambridge University Press.

Hawkins, D.M. and Kass, G.V. (1982), “Automatic Interaction Detection,” in

Hawkins (1982).

Kass, G.V. (1980), “An Exploratory Technique for Investigating Large Quantities of

Categorical Data,” Applied Statistics, 29, 119–127.

Quinlan, R.J. (1987), “Simplifying Decision Trees,” International Journal of Man-

Machine Studies

, 27, 221–234.

Quinlan, R.J. (1993), C4.5: Programs for Machine Learning, San Mateo, CA:

Morgan Kaufmann Publishers, Inc.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 23 24 25 26 27 28 29 30 ... 148