22
The ARBORETUM Procedure
FORMAT= format
specifies the format to use in the DATA step code for numeric values that do not have
a format from the input data set. The default format is BEST20.
LINESIZE | LS= n
specifies the line size for generated code. The default is 72. The permissible range is
64 to 254.
NOLEAFID
suppresses the creation of variables –NODE– and –LEAF– containing the node
and leaf identification numbers of the leaf to which the observation is assigned. The
variables are created by default.
NOPREDICTION
suppresses the code for computing predicted variables, such as P–:. The default is
PREDICTION, requesting such code.
PMML
requests XML output instead of SAS DATA step code.
RESIDUAL
requests the DATA step code to create variables, such as residuals, that require the
target variable. These variables are the ones with a “yes” in the “Target” column of
table
9
in the section
“Variable Names and Conditions for Their Creation”
on page
59. Using the DATA step code generated by the RESIDUAL option with a data set
that does not contain the target variable produces confusing notes and warnings. The
default is NORESIDUAL, suppressing the generation of the DATA step code for these
variables.
DECISION Statement
DECISION DECDATA=
SAS-data-set
<
options >
;
The DECISION statement specifies decision functions and prior probabilities for cat-
egorical targets. The ARBORETUM procedure uses the term decision in the sense
of decision theory: a decision is one of a set of alternatives, each associated with a
function of posterior probabilities. For an observation i, a tree determines the deci-
sion d
i
whose associated function evaluates to the best value, E
i
(d). The interpreta-
tion of best as well as the form of the function depends on whether the type of the
DECDATA= data set is profit, revenue, or loss. The SAS DATA step TYPE= option
species the data set type. If the DECDATA= data set has no type, the ARBORETUM
procedure assumes a type of profit.
The following formulas define E
i
(d) and d
i
. The sum is over the J categorical target
values, and p
ij
denotes the posterior probability of target value j for observation i.
The coefficient, A
jd
, for target value j, decision d, is specified in the DECDATA=
data set.
DECISION Statement
23
PROFIT
E
i
(d) =
J
j=1
A
jd
p
ij
d
i
= argmax
d
E
i
(d)
REVENUE
E
i
(d) =
J
j=1
A
jd
p
ij
− C
id
d
i
= argmax
d
E
i
(d)
where C
id
is the cost of decision d for observation i, specified in
the COST= option.
LOSS
E
i
(d) =
J
j=1
A
jd
p
ij
d
i
= argmin
d
E
i
(d)
The decision functions do not affect the creation of the tree unless the DECSEARCH
option is specified in the PROC ARBORETUM statement. However, the decision
functions determine a profit or loss measure for assessing trees, and consequently
may greatly affect what nodes are pruned and omitted from the final subtree. See the
“Tree Assessment and the Subtree Sequence”
section on page 49 for more informa-
tion about retrospective pruning.
FREQ, INPUT, and TARGET statements must appear before the DECISION state-
ment. The DECISION statement is optional. When the DECISION statement is
omitted, neither decision alternatives nor prior probabilities are defined. Specifying
the DECISION statement and the INMODEL= option in the PROC statement is an
error.
COST=costs
specifies a list of cost constants and cost variables associated with the decision alter-
natives specified in the DECVARS= option. The first cost in the list corresponds to
the first alternative in the DECVARS= list, the second cost with the second alterna-
tive, and so on. The number of costs must equal the number of alternatives specified
in the DEVARS= list.
The costs specify the terms C
id
in the REVENUE formula for E
i
(d), and conse-
quently the COST= option requires a DECDATA= data set of type REVENUE.
A cost constant is a number specifying the same value to C
id
for all observations i. A
cost variable is the name of a numeric variable in the training data set specified in the
DATA= option in the PROC ARBORETUM statement. The value of this variable for
observation i is assigned to C
id
. The ARBORETUM procedure does not recognize
abbreviations of lists of variables in the COST= option. For example, D1-D3, ABC-
XYZ, and PQR: are invalid representations of lists of variables.
24
The ARBORETUM Procedure
DECDATA=SAS-data-set
specifies the input data set containing the decision coefficients A
jd
and prior proba-
bilities. The DECDATA= data set must contain the target variable. One observation
must appear for each target value in the training data set specified in the DATA=
option of the PROC ARBORETUM statement.
DECVARS=decision-alternatives
specifies the variables in the DECDATA= data set defining the coefficients, A
jd
. The
labels of the variables define the names of the decision alternatives. For a variable
without a label, the name of the decision alternative is the name of the variable.
If the DECVARS= option is omitted, no decision functions are defined.
PRIORVAR=pvar
specifies the variable pvar in the DECDATA= data set that contains the prior prob-
abilities of categorical target values. The
“Terminology”
section on page 6 defines
prior probabilities. Pvar must have nonnegative numeric values. The ARBORETUM
procedure rescales the values to sum to one, and ignores training observations with a
target value for which pvar equals zero.
Prior probabilities do not affect the creation of the tree unless the PRIORSSPLIT
option to the PROC ARBORETUM statement is specified. Prior probabilities affect
the posterior probabilities, and consequently affect the model predictions and assess-
ment.
DESCRIBE Statement
DESCRIBE < options > ;
The DESCRIBE statement causes the ARBORETUM procedure to output a simple
description of the rules that define each leaf, along with a few statistics. The descrip-
tion is much easier to understand than the equivalent information output using the
CODE statement.
The options to the DESCRIBE statement have the same form and function as those
in the CODE statement.
CATALOG= catname | FILE= filename
specifies where to output the description. See the
“CODE Statement”
section on page
21 for more information.
FORMAT= format
specifies the format to use in the description for numeric values that do not have a
format from the input data set. The default format is BEST20.
LINESIZE | LS= n
specifies the line size for description. The default is 72. The permissible range is 64
to 254.
Dostları ilə paylaş: |