SPLIT Statement
31
SPLIT Statement
SPLIT NODE=
id
<
VAR=
var
< <
missing> / var-values > >
;
The SPLIT statement is an interactive training statement that specifies how to split a
leaf node. Only id is required. The VAR= option specifies which input variable to
use in the splitting rule and is required if any other option is specified. The missing
option specifies which branch to assign missing values, and the var-values option
specifies which branch to assign nonmissing values. The missing option requires the
var-values
option, and together they determine the number of branches.
VAR=var
specifies which input variable to use in the splitting rule. If the VAR= option is
omitted and the leaf contains a candidate split, then the ARBORETUM procedure
will use the primary candidate rule to create branches. Otherwise, the procedure will
search for a splitting rule and create the branches if a rule is found.
missing
specifies which branch to assign an observation in which var is missing. The missing
option may be one of the following:
MISSBRANCH=
b
specifies branch b for missing values. If b is greater than the
number of branches implied by the var-values option, then the last
branch specified in the var-values option is used.
MISSDISTRIBUTE specifies that observations with missing values be distributed
over all branches. Using the CODE statement with a rule using
MISSDISTRIBUTE is an error.
MISSONLY
reserves the last branch exclusively for missing values. The branch
is added to the branches specified in the var-values option. If var-
values
is absent, a binary split is created, and observations with a
nonmissing value of the variable are assigned to the first branch.
If the var-values option is omitted, the missing option is ignored.
If the missing option is omitted, then the ARBORETUM procedure will
honor the MISSING= option in the INPUT statement for the variable.
If
MISSING=USEINSEARCH and the var-values option specifies the branches for
nonmissing values, then the branch that creates the smallest residual square error
among the observations with missing values in the within-node training sample is
assigned to missing values as if MISSING=SMALLRESIDUAL were specified.
var-values
specifies a list of values of the variable. The following Table
6
summarizes the form
of the list appropriate for the different levels of measurement.
Table 6.
List of Variable Values
Measurement Level
Form of List
Nominal
all values, using commas to separate branches
Ordinal
minimum branch values in increasing order
Interval
splitting values in increasing order
32
The ARBORETUM Procedure
For nominal and ordinal variables, specify formatted values in quotes. For a nominal
variable, specify a list of all values with a comma (‘,’) inserted to separate categories
assigned to different branches. Categories appearing before the first comma are as-
signed to the first branch.
For an ordinal variable, specify the smallest value for each branch except the first
branch. Only one value should appear for a binary split. Commas are prohibited.
For an interval variable, specify an increasing list of values for separating branches.
A single value specifies a binary split. An observation with a value less than the first
specified number is assigned to the first branch. An observation whose value equals
the first number is assigned to the second branch. A list of n numbers specifies n+1
branches.
The missing and var-values options determine the number of branches, overriding the
MAXBRANCHES= option in the TRAIN statement.
If the var-values option is not specified, a split is made on the candidate rule for
variable
stored in the leaf. If no candidate rule exists, the ARBORETUM procedure
will search for a split using variable and create the branches if a split is found.
SUBTREE Statement
SUBTREE BEST | LARGEST | NLEAVES=
nleaves
;
The SUBTREE statement selects a subtree from the sequence of subtrees. See the
“Tree Assessment and the Subtree Sequence”
section beginning on page 49 for more
information.
BEST
selects the smallest subtree with the best assessment value.
LARGEST
selects the largest subtree. The largest subtree is the tree with all the nodes.
NLEAVES= nleaves
selects the largest subtree with no more than nleaves leaves.
TARGET Statement
TARGET variable < / options > ;
The TARGET statement names the variable the model tries to predict. The
“INPUT
Statement”
section (page 25) describes the LEVEL and ORDER options more com-
pletely.
LEVEL= INTERVAL | ORDINAL | NOMINAL | BINARY
specifies the level of measurement of the target variable. The default is INTERVAL
for a numeric variable, NOMINAL for a character variable.
ORDER= ASCENDING
ORDER= ASCFORMATTED
ORDER= DESCENDING
TRAIN Statement
33
ORDER= DESFORMATTED
ORDER= DSORDER
specifies the ordering of the values of an ordinal target variable. The ORDER= option
is only available when LEVEL=ORDINAL is specified, and would have no impact
with a target variable with only two values. The option is the same as the ORDINAL=
option in the INPUT statement.
TRAIN Statement
TRAIN < / options > ;
The TRAIN statement grows the tree by searching for splitting rules in leaves, ap-
plying the rules to create branches, and repeating the process in the newly formed
leaves. Most options remain in effect for subsequent SEARCH, SPLIT, and TRAIN
statements. The exceptions are MAXNEWDEPTH= and NODES=.
ALPHA=p
specifies a threshold p-value for the significance level of a candidate split-
ting rules, applicable for splitting criteria that depend on p-values, namely,
CRITERION=PROBF and PROBCHISQ. The default value of p is 0.20.
For
splitting criteria not based on p-values, the ARBORETUM procedure uses the value
associated with the MINWORTH= option instead of p.
EXHAUSTIVE=n
specifies the maximum allowable splits in a complete enumeration of all possible
splits. The exhaustive method of searching for a split examines all possible splits. If
the number of possible splits is greater than n, then a heuristic search is done instead
of an exhaustive search. The exhaustive and heuristic search methods only apply to
multiway splits, and to binary splits on nominal targets with more than two values.
See the
“Split Search Algorithm”
section on page 47 for a more complete description.
The default value of n is 5,000.
INTERVALBINS=n
indirectly specifies the minimum allowable width between two successive candidate
split points on an interval input. The width equals (max(x) − min(x))/(n + 1),
where max(x) and min(x) are the maximum and minimum of the input variable
values in the within-node sample being searched. The width is computed separately
for each input and each node. The INTERVALBINS= option may indirectly modify
p
-value adjustments. The search algorithm ignores the INTERVALBINS= option
if the number of distinct input values in the node sample is less then the number
specified in the SEARCHBINS= option. The default value of n is 100.
LEAFSIZE=n
specifies the smallest number of training observations a new branch may have. The
default value equals the number of observations in the training data divided by 1,000,
or 5, if 5 is larger, or 5,000, if 5,000 is smaller.
The number n applies to the within-node training sample used during the split
search, described in the
“Within Node Training Sample”
section on page 46. The
LEAFSIZE= option does not use the values of the variable in the FREQ statement to
adjust the count of observations in the leaf.
Dostları ilə paylaş: |