The SPLIT Procedure
PRUNE Statement
Deletes all nodes descended from any specified node.
Interaction: The PRUNE statement requires an INTREE= or INDMSPLIT procedure option.
PRUNE list-of-node-identifiers;
Required Argument
list-of-node-identifiers
Specifies the nodes that will have no children.
Range:
Integer > 0
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPLIT Procedure
SCORE Statement
Specifies that the data be scored.
SCORE
DATA=SAS-data-set
OUT=SAS-data-set
<
score-option(s)>
NODES=node-list>;
Required Arguments
DATA=SAS-data-set
Specifies input data that contains inputs and, optionally, targets.
OUT=SAS-data-set
Output data set with outputs.
Options
DUMMY
Includes dummy variables for each node. For each observation the value of the dummy variables
is 1 if the observation appears in the node and 0 if it does not.
NODES=node-list
Specifies a list of nodes used to score the observations. If an observation does not fall into any
node list, it does not contribute to the statistics and is not output. If an observation occurs in more
than one node, it contributes multiple times to the statistics and is output once for each node it
occurs in.
Interaction:
The NODES= option requires the INTREE= or INDMSPLIT procedure
option.
Default:
The default is the list of leaf nodes. Omitting the NODES= option results in
the decisions, utilities, and leaf assignment being output for each observation
in the DATA= data set.
NOLEAFID
Does not include lead identifiers or node numbers.
NOPRED
Does not include predicted values.
OUTFIT=SAS-data-set
Output data set with fit statistics.
ROLE=role-value
Specifies the role of the DATA= data set. The ROLE= option primarily affects what fit statistics
are computed and what their names and labels are. Role-value can be:
TRAIN
The default when DATA= data set name in the PROC statement is the same as the data set
name in the SCORE statement.
VALID | VALIDATION
The default when DATA= data set name in the SCORE statement is the same as DATA=
data set name in the VALIDATA= option in the PROC statement.
TEST
The default when DATA= data set name in the SCORE statement is not the same as the
data set name in the DATA= or VALIDATA= option in the PROC statement.
SCORE
Residuals, computed profit, and fit statistics are not produced.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPLIT Procedure
TARGET Statement
Specifies an output variable.
TARGET variable < /
LEVEL=measurement>;
Required Argument
variable
Specifies the variable that the model-fitting tries to predict.
Options
LEVEL=measurement
Specifies the measurement level, where measurement can be:
BINARY
NOMINAL
ORDINAL
INTERVAL
Default:
LEVEL=INTERVAL.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPLIT Procedure
Details
Missing Values
Observations in which the target value is missing are ignored when training or validating the tree.
If EXCLUDEMISS is specified, then observations with missing values are excluded during the search
for a splitting rule. A search uses only one variable, and so only the observations missing on the single
candidate input are excluded. An observation missing input x but not missing input y is used in the
search for a split on y but not x. After a split is chosen, the rule is amended to assign missing values to
the largest branch.
If EXCLUDEMISS is not specified, the search for a split on an input treats missing values as a special,
acceptable value, and includes them in the search. All observations with missing values are assigned to
the same branch.
The branch may or may not contain other observations. The branch chosen is the one that maximizes the
split worth.
For splits on a categorical variable, this amounts to treating a missing value as a separate category. For
numerical variables, it amounts to treating missing values as having the same unknown non-missing
value.
One advantage of using missing data during the search is that the worth of split is computed with the
same number of observations for each input. Another advantage is that an association of the missing
values with the target values can contribute to the predictive ability of the split. One disadvantage is that
missing values could unjustifiably dominate the choice of split.
When a split is applied to an observation in which the required input value is missing, surrogate splitting
rules are considered before assigning the observation to the branch for missing values.
A surrogate splitting rule is a backup to the main splitting rule. For example, the main splitting rule
might use county as input and the surrogate might use region. If the county is unknown and the region is
known, the surrogate is used.
If several surrogate rules exist, each surrogate is considered in sequence until one can be applied to the
observation. If none can be applied, the main rule assigns the observation to the branch designated for
missing values.
The surrogates are considered in the order of their agreement with the main splitting rule. The agreement
is measured as the proportion of training observations it and the main rule assign to the same branch. The
measure excludes the observations that the main rule cannot be applied to. Among the remaining
observations, those on which the surrogate rule cannot be applied count as observations not assigned to
the same branch. Thus, an observation with a missing value on the input used in the surrogate rule but