missing> / var-values > >
The SPLIT statement is an interactive training statement that speciﬁes how to split a
leaf node. Only id is required. The VAR= option speciﬁes which input variable to
use in the splitting rule and is required if any other option is speciﬁed. The missing
option speciﬁes which branch to assign missing values, and the var-values option
speciﬁes which branch to assign nonmissing values. The missing option requires the
option, and together they determine the number of branches.
speciﬁes which input variable to use in the splitting rule. If the VAR= option is
omitted and the leaf contains a candidate split, then the ARBORETUM procedure
will use the primary candidate rule to create branches. Otherwise, the procedure will
search for a splitting rule and create the branches if a rule is found.
speciﬁes which branch to assign an observation in which var is missing. The missing
option may be one of the following:
speciﬁes branch b for missing values. If b is greater than the
branch speciﬁed in the var-values option is used.
MISSDISTRIBUTE speciﬁes that observations with missing values be distributed
over all branches. Using the CODE statement with a rule using
MISSDISTRIBUTE is an error.
is added to the branches speciﬁed in the var-values option. If var-
is absent, a binary split is created, and observations with a
If the var-values option is omitted, the missing option is ignored.
If the missing option is omitted, then the ARBORETUM procedure will
honor the MISSING= option in the INPUT statement for the variable.
MISSING=USEINSEARCH and the var-values option speciﬁes the branches for
among the observations with missing values in the within-node training sample is
assigned to missing values as if MISSING=SMALLRESIDUAL were speciﬁed.
speciﬁes a list of values of the variable. The following Table
summarizes the form
List of Variable Values
Form of List
all values, using commas to separate branches
minimum branch values in increasing order
splitting values in increasing order
The ARBORETUM Procedure
For nominal and ordinal variables, specify formatted values in quotes. For a nominal
variable, specify a list of all values with a comma (‘,’) inserted to separate categories
assigned to different branches. Categories appearing before the ﬁrst comma are as-
signed to the ﬁrst branch.
For an ordinal variable, specify the smallest value for each branch except the ﬁrst
branch. Only one value should appear for a binary split. Commas are prohibited.
For an interval variable, specify an increasing list of values for separating branches.
A single value speciﬁes a binary split. An observation with a value less than the ﬁrst
speciﬁed number is assigned to the ﬁrst branch. An observation whose value equals
the ﬁrst number is assigned to the second branch. A list of n numbers speciﬁes n+1
The missing and var-values options determine the number of branches, overriding the
MAXBRANCHES= option in the TRAIN statement.
If the var-values option is not speciﬁed, a split is made on the candidate rule for
stored in the leaf. If no candidate rule exists, the ARBORETUM procedure
will search for a split using variable and create the branches if a split is found.
SUBTREE BEST | LARGEST | NLEAVES=
The SUBTREE statement selects a subtree from the sequence of subtrees. See the
“Tree Assessment and the Subtree Sequence”
section beginning on page 49 for more
selects the smallest subtree with the best assessment value.
selects the largest subtree. The largest subtree is the tree with all the nodes.
selects the largest subtree with no more than nleaves leaves.
The TARGET statement names the variable the model tries to predict. The
speciﬁes the level of measurement of the target variable. The default is INTERVAL
for a numeric variable, NOMINAL for a character variable.
speciﬁes the ordering of the values of an ordinal target variable. The ORDER= option
is only available when LEVEL=ORDINAL is speciﬁed, and would have no impact
with a target variable with only two values. The option is the same as the ORDINAL=
option in the INPUT statement.
TRAIN < / options > ;
The TRAIN statement grows the tree by searching for splitting rules in leaves, ap-
plying the rules to create branches, and repeating the process in the newly formed
leaves. Most options remain in effect for subsequent SEARCH, SPLIT, and TRAIN
statements. The exceptions are MAXNEWDEPTH= and NODES=.
speciﬁes a threshold p-value for the signiﬁcance level of a candidate split-
ting rules, applicable for splitting criteria that depend on p-values, namely,
CRITERION=PROBF and PROBCHISQ. The default value of p is 0.20.
splitting criteria not based on p-values, the ARBORETUM procedure uses the value
speciﬁes the maximum allowable splits in a complete enumeration of all possible
splits. The exhaustive method of searching for a split examines all possible splits. If
the number of possible splits is greater than n, then a heuristic search is done instead
of an exhaustive search. The exhaustive and heuristic search methods only apply to
multiway splits, and to binary splits on nominal targets with more than two values.
“Split Search Algorithm”
The default value of n is 5,000.
indirectly speciﬁes the minimum allowable width between two successive candidate
split points on an interval input. The width equals (max(x) − min(x))/(n + 1),
where max(x) and min(x) are the maximum and minimum of the input variable
values in the within-node sample being searched. The width is computed separately
for each input and each node. The INTERVALBINS= option may indirectly modify
-value adjustments. The search algorithm ignores the INTERVALBINS= option
speciﬁed in the SEARCHBINS= option. The default value of n is 100.
speciﬁes the smallest number of training observations a new branch may have. The
default value equals the number of observations in the training data divided by 1,000,
or 5, if 5 is larger, or 5,000, if 5,000 is smaller.
The number n applies to the within-node training sample used during the split
search, described in the
“Within Node Training Sample”
section on page 46. The
LEAFSIZE= option does not use the values of the variable in the FREQ statement to
adjust the count of observations in the leaf.