Variable Names and Conditions for Their Creation
59
Table 8.
(continued)
STAT
NUMERIC–VALUE
CHARACTER–VALUE
LABEL
Variable label
MISSING
Branch
‘MISSING VALUES ONLY’, or
blank
WORTH
worth, or − log
10
(p)
blank
AGREEMENT
agreement
BRANCHES
number of branches
CUTPOINT
split value of interval
BRANCH
branch
formatted category value
ORDER
branch
branch in interval surrogate
SCORE Statement OUT= Output Data Set
The OUT= option in the SCORE statement creates a data set by appending new vari-
ables to the data set specified in the DATA= option. Which new variables appear
depends on other options in the SCORE statement, the level of measurement of the
target variable, and whether a profit or loss function is specified in the DECISION
statement.
Variable Names and Conditions for Their Creation
The names of all the possible new variables are listed in Table
9
.
Table 9.
New Variables in the OUT= Data Set
Variable
Description
Target
Other
Variables for Prediction
F–name
actual, formatted category
yes
I–name
predicted, formatted category
no
P–namevalue
predicted value
no
R–namevalue residual from the prediction
yes
U–name
predicted, unformatted category
no
V–namevalue
predicted value computed with validation data
no
–WARN–
indications of problems with the prediction
no
Variables for Decisions
DECDATA= type
BL–name–
best possible loss from any decision
yes
LOSS
BP–name–
best possible profit from any decision
yes
PROFIT, REVENUE
CL–name–
loss computed from the target value
yes
LOSS
CP–name–
profit computed from the target value
yes
PROFIT, REVENUE
D–name–
label of the chosen decision alternative
no
any
EL–name–
expected loss from the chosen decision
no
LOSS
EP–name–
expected profit from the chosen decision
no
PROFIT, REVENUE
IC–name–
investment cost
no
REVENUE
ROI–name–
return on investment
yes
REVENUE
Variables for Leaf Assignment
Option
–i–
proportion of the observation in leaf i
no
DUMMY
–LEAF–
leaf identification number
no
LEAF
–NODE–
node identification number
no
LEAF
60
The ARBORETUM Procedure
The names of most of these variables incorporate the name of the target variable. For
a categorical target variable, namevalue represents the name of the target concate-
nated with a formatted target value. For example, a categorical target variable named
Response, with values ‘0’ and ‘1’, will generate new variables, P–Response0 and
P–Response1. For an interval target, namevalue simply represents the name of
the target. For example, an interval target variable, Sales, will generate the variable
P–Sales.
The NOPREDICTION option to the SCORE statement suppresses the creation of
the prediction and decision variables. Otherwise, the conditions necessary for cre-
ating these variables are as follows. Variables P–namevalue and –WARN– are
always created. Variables I–name and U–name appear when the target is categor-
ical. When ROLE=TRAIN, VALID, or TEST, the DATA= data set must contain the
target variable, and the OUT= data set will contain R–namevalue and, for a categor-
ical target, F–name. The V–namevalue variable is created if validation data was
used during the creation of the tree.
When decision alternatives are specified in the DECVARS= option in the DECISION
statement, the variable D–name– is created, as is either EL–name– or EP–name–
depending on whether or not the type of the DECDATA= data set is LOSS or PROFIT,
respectively. If the type is REVENUE, then variables IC–name– and ROI–name–
are also created.
When ROLE=TRAIN, VALID, or TEST, either the variables
BL–name– and CL–name–, or the variables BP–name– and CP–name–, are
created.
Decision Variables
The labels of the variables specified in the DECVARS= option in the DECISION
statement are the names of the decision alternatives. For a variable without a label, the
name of the decision alternative is the name of the variable. The variable D–name–
in the OUT= data set contains the name of the decision alternative assigned to the
observation.
Leaf Assignment Variables
Each node is uniquely identified with a positive integer. Once an identification num-
ber is assigned to a node, the number is never reassigned to another node, even after
the node is pruned. Consequently, most subtrees in the subtree sequence will not have
consecutive node identifiers.
Each leaf has a leaf identification number in addition to the node identifier. The leaf
identifiers range from 1 to the number of leaves. The leaf numbers are reassigned
whenever a new subtree is selected from the subtree sequence.
For an observation in the OUT= data set assigned to a single leaf, the variables
–NODE– and –LEAF– contain the node and leaf identification numbers, respec-
tively. For an observation assigned to more than one leaf, the variables –NODE–
and –LEAF– contain missing values. An observation is assigned to more than one
leaf when the observation is missing a value required by one of the splitting rules,
and the MISSING=DISTRIBUTE option in the INPUT statement for the required
variable dictates that the observation be distributed among the branches.