The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	9/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 5 6 7 8 9 10 11 12 ... 148

The ARBORETUM Procedure

FORMAT= format

speciﬁes the format to use in the DATA step code for numeric values that do not have

a format from the input data set. The default format is BEST20.

LINESIZE | LS= n

speciﬁes the line size for generated code. The default is 72. The permissible range is

64 to 254.

NOLEAFID

suppresses the creation of variables –NODE– and –LEAF– containing the node

and leaf identiﬁcation numbers of the leaf to which the observation is assigned. The

variables are created by default.

NOPREDICTION

suppresses the code for computing predicted variables, such as P–:. The default is

PREDICTION, requesting such code.

PMML

requests XML output instead of SAS DATA step code.

RESIDUAL

requests the DATA step code to create variables, such as residuals, that require the

target variable. These variables are the ones with a “yes” in the “Target” column of

table

in the section

“Variable Names and Conditions for Their Creation”

on page

59. Using the DATA step code generated by the RESIDUAL option with a data set

that does not contain the target variable produces confusing notes and warnings. The

default is NORESIDUAL, suppressing the generation of the DATA step code for these

variables.

DECISION Statement

DECISION DECDATA=

SAS-data-set

options >

;

The DECISION statement speciﬁes decision functions and prior probabilities for cat-

egorical targets. The ARBORETUM procedure uses the term decision in the sense

of decision theory: a decision is one of a set of alternatives, each associated with a

function of posterior probabilities. For an observation i, a tree determines the deci-

sion d

whose associated function evaluates to the best value, E

(d). The interpreta-

tion of best as well as the form of the function depends on whether the type of the

DECDATA= data set is proﬁt, revenue, or loss. The SAS DATA step TYPE= option

species the data set type. If the DECDATA= data set has no type, the ARBORETUM

procedure assumes a type of proﬁt.

The following formulas deﬁne E

(d) and d

. The sum is over the J categorical target

values, and p

denotes the posterior probability of target value j for observation i.

The coefﬁcient, A

, for target value j, decision d, is speciﬁed in the DECDATA=

data set.

DECISION Statement

PROFIT

(d) =

j=1

= argmax

(d)

REVENUE

(d) =

j=1

− C

= argmax

(d)

where C

is the cost of decision d for observation i, speciﬁed in

the COST= option.

LOSS

(d) =

j=1

= argmin

(d)

The decision functions do not affect the creation of the tree unless the DECSEARCH

option is speciﬁed in the PROC ARBORETUM statement. However, the decision

functions determine a proﬁt or loss measure for assessing trees, and consequently

may greatly affect what nodes are pruned and omitted from the ﬁnal subtree. See the

“Tree Assessment and the Subtree Sequence”

section on page 49 for more informa-

tion about retrospective pruning.

FREQ, INPUT, and TARGET statements must appear before the DECISION state-

ment. The DECISION statement is optional. When the DECISION statement is

omitted, neither decision alternatives nor prior probabilities are deﬁned. Specifying

the DECISION statement and the INMODEL= option in the PROC statement is an

error.

COST=costs

speciﬁes a list of cost constants and cost variables associated with the decision alter-

natives speciﬁed in the DECVARS= option. The ﬁrst cost in the list corresponds to

the ﬁrst alternative in the DECVARS= list, the second cost with the second alterna-

tive, and so on. The number of costs must equal the number of alternatives speciﬁed

in the DEVARS= list.

The costs specify the terms C

in the REVENUE formula for E

(d), and conse-

quently the COST= option requires a DECDATA= data set of type REVENUE.

A cost constant is a number specifying the same value to C

for all observations i. A

cost variable is the name of a numeric variable in the training data set speciﬁed in the

DATA= option in the PROC ARBORETUM statement. The value of this variable for

observation i is assigned to C

. The ARBORETUM procedure does not recognize

abbreviations of lists of variables in the COST= option. For example, D1-D3, ABC-

XYZ, and PQR: are invalid representations of lists of variables.

The ARBORETUM Procedure

DECDATA=SAS-data-set

speciﬁes the input data set containing the decision coefﬁcients A

and prior proba-

bilities. The DECDATA= data set must contain the target variable. One observation

must appear for each target value in the training data set speciﬁed in the DATA=

option of the PROC ARBORETUM statement.

DECVARS=decision-alternatives

speciﬁes the variables in the DECDATA= data set deﬁning the coefﬁcients, A

. The

labels of the variables deﬁne the names of the decision alternatives. For a variable

without a label, the name of the decision alternative is the name of the variable.

If the DECVARS= option is omitted, no decision functions are deﬁned.

PRIORVAR=pvar

speciﬁes the variable pvar in the DECDATA= data set that contains the prior prob-

abilities of categorical target values. The

“Terminology”

section on page 6 deﬁnes

prior probabilities. Pvar must have nonnegative numeric values. The ARBORETUM

procedure rescales the values to sum to one, and ignores training observations with a

target value for which pvar equals zero.

Prior probabilities do not affect the creation of the tree unless the PRIORSSPLIT

option to the PROC ARBORETUM statement is speciﬁed. Prior probabilities affect

the posterior probabilities, and consequently affect the model predictions and assess-

ment.

DESCRIBE Statement

DESCRIBE < options > ;

The DESCRIBE statement causes the ARBORETUM procedure to output a simple

description of the rules that deﬁne each leaf, along with a few statistics. The descrip-

tion is much easier to understand than the equivalent information output using the

CODE statement.

The options to the DESCRIBE statement have the same form and function as those

in the CODE statement.

CATALOG= catname | FILE= ﬁlename

speciﬁes where to output the description. See the

“CODE Statement”

section on page

21 for more information.

FORMAT= format

speciﬁes the format to use in the description for numeric values that do not have a

format from the input data set. The default format is BEST20.

LINESIZE | LS= n

speciﬁes the line size for description. The default is 72. The permissible range is 64

to 254.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 5 6 7 8 9 10 11 12 ... 148