The ARBORETUM Procedure
The default value is 10,000. See the
“Within Node Training Sample”
section on page
46 for more detail.
The PRUNE statement is an interactive training statement that deletes all nodes de-
scendent to any node in the list of node identiﬁers, nodeids. The splitting rules in the
nodes remain available as candidate rules unless the KEEPRULES= or DROPVARS=
option deletes them. A subsequent SEARCH, SPLIT, or TRAIN statement will use
the remaining rules instead of performing a new search. The NODES=LEAVES op-
tion deletes no nodes, but deletes rules from the current leaves as the KEEPRULES=
and DROPVARS= options specify.
deletes rules based on any of the named input variables.
requests to keep only the top n ranked rules in nodes nodeids. By default, all the
rules are kept. The PRUNE statement deletes any rules using input variables in the
list names before before considering the KEEPRULES= option.
The REDO statement is an interactive training statement that reverses a previous
UNDO statement. These statements can work in series in that a series of REDO
statements will reverse a series of UNDO statements. If a previous UNDO state-
ment is followed by any statement other than an UNDO or REDO, then REDO does
The SAVE statement outputs tree information into SAS data sets. Unless otherwise
stated, the information describes a subtree selected in the ASSESS or SUBTREE
statement, which may omit nodes from the largest tree in the sequence.
names the output data set to contain the variable importance.
See the section
“IMPORTANCE= Output Data Set”
on page 54 for more information.
names the output data set to encode the information necessary for use with the
INMODEL= option in a subsequent invocation of the ARBORETUM procedure. The
output data set may also be input to the Enterprise Miner Tree Desktop Application
for a visual display. The section
“Enterprise Miner Tree Desktop Application”
page 8 describes the Desktop Application.
speciﬁes what nodes to output in the NODESTAT=, PATH=, and RULES= data
sets. By default, the NODESTAT= and RULES= data sets contain information for
all nodes, and the PATH= data set contains information for all leaves in the current
names the output data set to contain node information.
“NODESTATS= Output Data Set”
on page 56 for more information.
names the output data set describing the path to nodes. See the section
Output Data Set”
names the output data set describing the splitting rules. See the section
names the output data set to contain statistics on each subtree in the sequence of
subtrees. See the section
“SEQUENCE= Output Data Set”
on page 61 for more
names the output data set to contain summary statistics. For categorical targets, the
summary statistics consists of the counts and proportions of observations correctly
classiﬁed. For interval targets, the summary statistics include the average square
error and R-squared ( = 1 - average squared error / sum of square errors from the
The SCORE statement reads a data set containing the input variables used by the tree
and outputs a data set containing the original variables plus new variables to contain
predictions, residuals, decisions, and leaf assignments. The SCORE statement may
names the input data set. If the DATA= option is absent, the procedure uses the
causes the OUT= data set to contain dummy variables, –i–, for i = 1, 2, ..., L, where
L is the number of leaves. The value of the dummy variable –i– is 1 for observations
assigned exclusively to leaf i, and 0 for observations not in leaf i. For observations
distributed over more than one leaf, –i– equals the proportion of the observation
assigned the leaf i. The default is NODUMMY, suppressing the creation of dummy
lists the nodes containing the observations to score. If an observation is not assigned
to any node in the list, it does not contribute to the ﬁt statistics and is not output. The
default is to use all the observations.
suppresses the creation of variables –NODE– and –LEAF– containing the node
and leaf identiﬁcation numbers of the leaf to which the observation is assigned. The
variables are created by default.
suppresses the generation of prediction variables, such as P–*. The default is
PREDICTION, requesting prediction variables.
names the output data set to contain the scored data. If the OUT= option is absent,
the ARBORETUM procedure creates a data set name using the DATAn convention.
Specify OUT=–NULL– to avoid creating a scored data set. The
OUT= Output Data Set”
section on page 59 describes the variables in the OUT= data
names the output data set to contain the ﬁt statistics.
speciﬁes the role of the input data set, and determines the ﬁt statistics to compute. For
ROLE=TRAIN, VALID, or TEST, observations without a target value are ignored.
SEARCH < options > ;
The search statement is an interactive training statement that searches for splitting
rules in leaves. It behaves like the TRAIN statement except no branches are formed.
The options for the SEARCH statement are the same as those for the TRAIN state-
missing > < /var-values>
The SETRULE statement is an interactive training statement that speciﬁes the pri-
mary candidate splitting rule for leaf node id. If the node already has a candidate rule
for the variable speciﬁed in the VAR= option, and the missing and var-values options
are omitted, then the candidate rule for the variable is set to the primary candidate
rule for the node. Otherwise, variable values must be assigned to branches using the
option or the MISSONLY missing option. The SEARCH statement is use-
ful for ﬁnding good variable values for a splitting rule. The options in the SETRULE
statement are the same as in the SPLIT statement. Unlike the SPLIT statement, the
SETRULE statement does not search for a split, does not create branches, and re-
quires the VAR= option.