the selection. The output statements, CODE, DESCRIBE, SAVE, and SCORE, use
nodes not in the selected subtree.
The ASSESS, SUBTREE, and output statements end initialization and interactive
training. If the SUBTREE statement or an output statement is the ﬁrst statement after
initialization or interactive training, an ASSESS statement is implied. The procedure
computes a new subtree sequence using the most recently speciﬁed ASSESS state-
ment options, and selects the best subtree before executing the SUBTREE or output
If the ASSESS, SUBTREE, or output statement immediately follows initialization
so that no interactive training statements appear, and the tree contains no more than
the root node, then these statements will create a tree. Otherwise, if the root node is
already split using information imported using the INMODEL= option in the PROC
statement, then further split searches will not occur unless explicitly requested with
interactive training statements.
summarizes the options available in the ASSESS statement. An option re-
Assess Statement Options
speciﬁes categorical target value for LIFT
speciﬁes the assessment measure
ignores prior probabilities in subtree search
speciﬁes proportion of observations for LIFT
speciﬁes training or validation data for choosing subtrees
incorporates prior probabilities in subtree search
speciﬁes validation data set
terminates a previous VALIDATA= option
The following list describes these options. See the
“Tree Assessment and the Subtree
section beginning on page 49 for more detail.
speciﬁes a formatted value of a categorical target to use with the LIFT assessment
measure. If the EVENT= option is absent in one ASSESS statement, the last value
speciﬁed in any ASSESS statement is used. If the EVENT= option has never been
speciﬁed, the least frequent target value in the training data is used. The EVENT
option is ignored with an interval target and with other assessment measures.
speciﬁes the assessment measure. Table
summarizes the available measures.
The ARBORETUM Procedure
Average square error
Average or proportion among the highest ranked observations
Average proﬁt or loss among the highest ranked observations
Average proﬁt or loss from the decision function
The default measure is PROFIT if the DECISION statement speciﬁes a proﬁt or loss
function or if the target variable is ordinal. Otherwise the default measure for a nom-
inal target is MISC, and the default for an interval target is ASE. MISC is applicable
to nominal and ordinal targets. ASE is applicable to any kind of target.
For an interval target, the LIFT measure is the average target value among observa-
tions predicted to have the highest average. The PROPORTION= option speciﬁes
the proportion of observations to use. For a categorical target, the LIFT measure is
the proportion of observations with the target value speciﬁed in the EVENT= op-
tion among observations with the highest posterior probability of the EVENT= target
speciﬁes whether to ignore prior probabilities when creating the sequence of subtrees.
The default is NOPRIORS, ignoring prior probabilities. The section
on page 50 describes how prior probabilities enter into the
formulae for evaluating subtrees.
The PROPORTION= option speciﬁes the proportion of observations to use with the
LIFT and LIFTPROFIT assessment measures. The PROPORTION= option is ig-
nored unless LIFT or LIFTPROFIT is speciﬁed. The value must be between 0 and
1. If absent, the most recent value speciﬁed with LIFT or LIFTPROFIT is used.
Requesting LIFT or LIFTPROFIT without ever specifying the PROPORTION= op-
tion is an error.
PRUNEDATA= VALID | TRAIN
speciﬁes whether to use training or validation data when evaluating subtrees for in-
clusion in the subtree sequence. The default is VALID. If PRUNEDATA=VALID and
validation data exists, then the subtree chosen for a given number of leaves is one
with the best assessment value using the validation data.
NOVALIDATA | VALIDATA= SAS-data-set
speciﬁes the validation data set.
The NOVALIDATA option nulliﬁes any
VALIDATA= option appearing in a previous ASSESS statement.
The BRANCH statement is an interactive training statement that splits leaves into
branches using the primary candidate splitting rule deﬁned in the leaves.
SETRULE and SEARCH statements create candidate rules. The PRUNE statement
leaf. The BRANCH statement will not split a leaf without a candidate rule.
restricts the creation of branches to leaves descendent to nodes in the list of node
restricts branching to the one leaf with the best candidate splitting rule. If a list of
nodes is speciﬁed in the NODES= option, the ONE option only considers the leaves
descendent to nodes in the list.
The CODE statement generates SAS DATA step code that mimics the computations
done by the SCORE statement. The DATA step code creates the same variables de-
scribed in the section
“SCORE Statement OUT= Output Data Set”
on page 59. Using
the CODE statement for a tree containing a rule with MISSING=DISTRIBUTE is an
speciﬁes where to output the code. Catname speciﬁes a catalog entry by providing a
compound name with one to four of the levels in the form, library.catalog.entry.type.
The default library is determined by the SAS system option USER=, usually WORK.
The default entry is SASCODE, and the default type is SOURCE. Filename speciﬁes
the name of the ﬁle to contain the code. Filename can be either:
1. A quoted string, the value of which is the name (including the extension, if
any) of the ﬁle to be opened.
2. An unquoted SAS name of no more than eight characters. If this name has
been assigned as a ﬁleref in a FILENAME statement, the ﬁle speciﬁed in the
FILENAME statement is opened. The special ﬁlerefs LOG and PRINT are
always assigned. If the speciﬁed name is not an assigned ﬁleref, the speciﬁed
value is concatenated with the extension .txt before opening. For example, if
FOO is not an assigned ﬁleref, FILE=FOO would cause FOO.txt to be opened.
If the name has more than eight characters, an error message is printed.
If no catalog or ﬁle is speciﬁed, then the code is output to the SAS log.
requests creation of a dummy variable for each leaf node. The variables have names,
–i–, for i = 1, 2, ..., L, where L is the number of leaves. The value of the dummy
variable –i– is 1 for observations assigned exclusively to leaf i, and 0 for observa-
tions not in leaf i. For observations distributed over more than one leaf, –i– equals
the proportion of the observation assigned the leaf i. The default is NODUMMY,
suppressing the creation of dummy variables.