56
The ARBORETUM Procedure
• RATIO, the ratio of V–IMPORTANCE to IMPORTANCE, or missing if
IMPORTANCE is less than 0.0001
The NSURROGATES variable is omitted unless surrogates are requested in the
MAXSURROGATES= option in the TRAIN statement.
The V–IMPORTANCE and RATIO variables are omitted unless the VALIDATA=
option appears in the ASSESS statement.
NODESTATS= Output Data Set
The NODESTATS= option in the SAVE statement specifies the output data set to
contain statistics for each node in the selected subtree. The ASSESS and SUBTREE
statements determine the subtree.
Each observation describes one node.
The
NODESTATS= data set contains the following variables:
• NODE, the id of the node
• PARENT, the id of the parent node, or missing if the node is the root
• BRANCH, an integer, beginning with 1, indicating which branch this node is
from the parent, or missing if the node is the root
• LEAF, an integer, beginning with 1, indicating the left-to-right position of the
leaf in the tree, or missing if the node is not a leaf
• NBRANCHES, the number of branches emanating from this node, or 0 for
leaf nodes
• DEPTH, the number of splits from the root node to this node
• TRAVERSAL, an integer indicating when this node appears in a depth-first,
left-to-right traversal
• LINKWIDTH, a suggested width for displaying the line from the parent to this
node
• LINKCOLOR, a suggested RGB color value for displaying the line from the
parent to this node
• NODETEXT, a character value of a node statistic
• ABOVETEXT, a character value pertaining to the definition of the branch to
this node
• BELOWTEXT, the name or label of the input variable used to split this node,
or blank
• N, the number of training observations
• NPRIORS, the number N adjusted for prior probabilities
• VN, the number of validation observations
• VNPRIORS, the number VN adjusted for prior probabilities
• –RASE–, the root average square error
• –VRASE–, the root average square error based on validation data
PATH= Output Data Set
57
• I–: D–: EL–: EP–: P–: U–: V–: variables output in the OUT= option in
the SCORE statement
The variables VN, VNPRIORS, and –VRASE– only occur if validation data is spec-
ified. The variables NPRIORS and VNPRIORS only occur for categorical targets.
The variables –RASE– and –VRASE– only occur for interval targets. The colon
in a name expression such as, I–: refers to all variables whose name begins with,
I–. The section
“Variable Names and Conditions for Their Creation”
on page 59
describes the variables output by the SCORE statement.
If no prior probabilities are specified in the DECISION statement, then N and
NPRIORS are equal. NPRIORS times P–namej equals the number of train-
ing observations with categorical target value j, adjusted for prior probabilities.
VNPRIORS times V–namej equals the number of validation observations with cat-
egory j, adjusted for prior probabilities.
The number of training observations with target value j, not adjusted for prior prob-
abilities, is
N
j
= N
P
namej
N
j
(root)/π
j
i
P
namei
N
i
(root)/π
i
where N
j
(root) is the number of observations in the root node with target value j,
and π
j
denotes the prior probability for j.
PATH= Output Data Set
The PATH= option in the SAVE statement specifies the output data set describing
the observations the tree assigns to a node. The description consists of a set of rela-
tionships between variables and values. Observations that satisfy all the relations are
assigned to the node.
The PATH= output data set describes the path to each leaf in the current subtree unless
the NODES= option specifies which nodes to describe.
The PATH= data set contains the following variables:
• NODE, the id of the node
• LEAF, the leaf number, if the node is a leaf
• VARNAME, the name of the variable
• VARIABLE, the variable label, or name if no label
• RELATION, a character variable containing the relation that an observation
value must have to be in the node
• CHARACTER–VALUE, the formatted value of the variable
• NUMERIC–VALUE, the numeric value of a numeric variable
58
The ARBORETUM Procedure
Each observation contains a single variable value, unless the relation is MISSING or
NOT MISSING. The relation MISSING indicates that missing values of the variable
are accepted in the node. The relation NOT MISSING indicates that missing values
of the variable are excluded from the node. If the relation is not MISSING or NOT
MISSING, than the contents of the observation depend on the level of measurement
of the variable.
For a nominal variable, CHARACTER–VALUE contains one formatted value of the
variable, RELATION is ‘=’, and NUMERIC–VALUE is missing.
For an interval or ordinal variable, the path determines a range of values in the node.
The upper end of the range may be infinite, or the lower end may be infinitely neg-
ative, but at least one end will be finite (otherwise RELATION would equal NOT
MISSING). The first observation contains the lower end of the range, and the second
contains the upper end. If an end is unbounded, CHARACTER–VALUE is blank and
NUMERIC–VALUE is missing for that observation. Otherwise, for an interval vari-
able, both CHARACTER–VALUE and NUMERIC–VALUE contain the end value,
and RELATION contains ‘>=’ or ‘<’.
For an ordinal variable, CHARACTER–VALUE contains the formatted value of an
end, and NUMERIC–VALUE is missing. RELATION is ‘>=’ or ‘<=’.
RULES= Output Data Set
The RULES= option in the SAVE statement specifies the output data set describing
the splitting rules in each node, including surrogate rules, unused competing rules,
and candidate rules in leaf nodes. The data set only contains nodes in the subtree
determined by the ASSESS or SUBTREE statement.
The RULES= data set contains the following variables:
• NODE, the id of the node
• ROLE, a character variable with four possible values: ‘PRIMARY’ for the
primary splitting rule, ‘COMPETITOR’ for a competing rule, ‘SURROGATE’
for a surrogate rule, and ‘CANDIDATE’ for a candidate splitting rule in a leaf
• RANK, the rank among other rules with the same role
• STAT, a character variable containing the name of the statistic in the
NUMERIC–VALUE or CHARACTER–VALUE variable.
• NUMERIC–VALUE, the numeric value of the statistic, if any
• CHARACTER–VALUE, the character value of the statistic, if any
A single rule is described using several observations. The STAT variable determines
what an observation describes. Table
8
summaries the possible values of STAT.
Table 8.
Statistics
STAT
NUMERIC–VALUE
CHARACTER–VALUE
VARIABLE
Variable name
Dostları ilə paylaş: |