28
The ARBORETUM Procedure
The default value is 10,000. See the
“Within Node Training Sample”
section on page
46 for more detail.
PRUNE Statement
PRUNE NODES=
nodeids
| LEAVES
n
> <
/ DROPVARS=
names
>
;
The PRUNE statement is an interactive training statement that deletes all nodes de-
scendent to any node in the list of node identifiers, nodeids. The splitting rules in the
nodes remain available as candidate rules unless the KEEPRULES= or DROPVARS=
option deletes them. A subsequent SEARCH, SPLIT, or TRAIN statement will use
the remaining rules instead of performing a new search. The NODES=LEAVES op-
tion deletes no nodes, but deletes rules from the current leaves as the KEEPRULES=
and DROPVARS= options specify.
DROPVARS= names
deletes rules based on any of the named input variables.
KEEPRULES= n
requests to keep only the top n ranked rules in nodes nodeids. By default, all the
rules are kept. The PRUNE statement deletes any rules using input variables in the
list names before before considering the KEEPRULES= option.
REDO Statement
REDO ;
The REDO statement is an interactive training statement that reverses a previous
UNDO statement. These statements can work in series in that a series of REDO
statements will reverse a series of UNDO statements. If a previous UNDO state-
ment is followed by any statement other than an UNDO or REDO, then REDO does
nothing.
SAVE Statement
SAVE < options > ;
The SAVE statement outputs tree information into SAS data sets. Unless otherwise
stated, the information describes a subtree selected in the ASSESS or SUBTREE
statement, which may omit nodes from the largest tree in the sequence.
IMPORTANCE= SAS-data-set
names the output data set to contain the variable importance.
See the section
“IMPORTANCE= Output Data Set”
on page 54 for more information.
MODEL= SAS-data-set
names the output data set to encode the information necessary for use with the
INMODEL= option in a subsequent invocation of the ARBORETUM procedure. The
output data set may also be input to the Enterprise Miner Tree Desktop Application
for a visual display. The section
“Enterprise Miner Tree Desktop Application”
on
page 8 describes the Desktop Application.
SCORE Statement
29
NODES=nodes
specifies what nodes to output in the NODESTAT=, PATH=, and RULES= data
sets. By default, the NODESTAT= and RULES= data sets contain information for
all nodes, and the PATH= data set contains information for all leaves in the current
subtree.
NODESTAT= SAS-data-set
names the output data set to contain node information.
See the section
“NODESTATS= Output Data Set”
on page 56 for more information.
PATH= SAS-data-set
names the output data set describing the path to nodes. See the section
“PATH=
Output Data Set”
on page 57 for more information.
RULES= SAS-data-set
names the output data set describing the splitting rules. See the section
“RULES=
Output Data Set”
on page 58 for more information.
SEQUENCE= SAS-data-set
names the output data set to contain statistics on each subtree in the sequence of
subtrees. See the section
“SEQUENCE= Output Data Set”
on page 61 for more
information.
SUMMARY= SAS-data-set
names the output data set to contain summary statistics. For categorical targets, the
summary statistics consists of the counts and proportions of observations correctly
classified. For interval targets, the summary statistics include the average square
error and R-squared ( = 1 - average squared error / sum of square errors from the
prediction).
SCORE Statement
SCORE < options >
;
The SCORE statement reads a data set containing the input variables used by the tree
and outputs a data set containing the original variables plus new variables to contain
predictions, residuals, decisions, and leaf assignments. The SCORE statement may
be repeated.
DATA=SAS-data-set
names the input data set. If the DATA= option is absent, the procedure uses the
training data.
DUMMY
causes the OUT= data set to contain dummy variables, –i–, for i = 1, 2, ..., L, where
L is the number of leaves. The value of the dummy variable –i– is 1 for observations
assigned exclusively to leaf i, and 0 for observations not in leaf i. For observations
distributed over more than one leaf, –i– equals the proportion of the observation
assigned the leaf i. The default is NODUMMY, suppressing the creation of dummy
variables.
30
The ARBORETUM Procedure
NODES=nodes
lists the nodes containing the observations to score. If an observation is not assigned
to any node in the list, it does not contribute to the fit statistics and is not output. The
default is to use all the observations.
NOLEAFID
suppresses the creation of variables –NODE– and –LEAF– containing the node
and leaf identification numbers of the leaf to which the observation is assigned. The
variables are created by default.
NOPREDICTION
suppresses the generation of prediction variables, such as P–*. The default is
PREDICTION, requesting prediction variables.
OUT=SAS-data-set
names the output data set to contain the scored data. If the OUT= option is absent,
the ARBORETUM procedure creates a data set name using the DATAn convention.
Specify OUT=–NULL– to avoid creating a scored data set. The
“SCORE Statement
OUT= Output Data Set”
section on page 59 describes the variables in the OUT= data
set.
OUTFIT=SAS-data-set
names the output data set to contain the fit statistics.
ROLE=TRAIN | VALID | TEST | SCORE
specifies the role of the input data set, and determines the fit statistics to compute. For
ROLE=TRAIN, VALID, or TEST, observations without a target value are ignored.
SEARCH Statement
SEARCH < options > ;
The search statement is an interactive training statement that searches for splitting
rules in leaves. It behaves like the TRAIN statement except no branches are formed.
The options for the SEARCH statement are the same as those for the TRAIN state-
ment.
SETRULE Statement
SETRULE NODE=
id
VAR=
var
<
missing > < /var-values>
;
The SETRULE statement is an interactive training statement that specifies the pri-
mary candidate splitting rule for leaf node id. If the node already has a candidate rule
for the variable specified in the VAR= option, and the missing and var-values options
are omitted, then the candidate rule for the variable is set to the primary candidate
rule for the node. Otherwise, variable values must be assigned to branches using the
var-values
option or the MISSONLY missing option. The SEARCH statement is use-
ful for finding good variable values for a splitting rule. The options in the SETRULE
statement are the same as in the SPLIT statement. Unlike the SPLIT statement, the
SETRULE statement does not search for a split, does not create branches, and re-
quires the VAR= option.