PROC ARBORETUM Statement
17
statistics and predictions in the saved tree.
DECSEARCH
specifies that the split search should incorporate the profit or loss function specified
in the DECISION statement. See the
“Incorporating Decisions, Profit, and Loss”
section on page 38 for more information. The DECSEARCH option only works with
a categorical target.
INMODEL= SAS-data-set
names a data set created from the SAVE MODEL= option, or saved from the
Enterprise Miner Tree Desktop Application. When using the INMODEL option, the
INPUT, TARGET, FREQ and DECISION statements are prohibited.
Beginning with SAS 9.1, the MODEL= data set contains the name of the training and
validation data. The DATA= option is therefore unnecessary to resume training with
the same data as was used to create the saved tree (assuming the saved name of the
training data is still valid).
MISSING= policy
specifies how a splitting rule handles an observation with missing values. Table
3
lists the available policies.
Table 3.
Missing Value Policies
Policy
Description
BIGBRANCH
assign the observation to the largest branch
DISTRIBUTE
assign the observation to each branch with a fractional frequency propor-
tional to the number of training observations in the branch
SMALLRESIDUAL
assign to the branch minimizing SSE among observations with missing
values
USEINSEARCH
use missing values during the split search (default)
The default policy is USEINSEARCH. The MISSING= option in the INPUT state-
ment assigns a policy to the variables listed in the statement, and supersedes
the MISSING= option to the PROC ARBORETUM statement. See the
“INPUT
Statement”
section on page 25.
If a surrogate rule can assign an observation to a branch, then it does, and the missing
value policy is ignored for the specific observation. Using the CODE statement for
a tree containing a rule with MISSING=DISTRIBUTE is an error. See the
“Missing
Values”
section on page 45 for a complete description of the missing value options.
PADJUST= method1
<
method2
<
method3
>>
names one or more methods for adjusting the p-values used with the PROBCHISQ
and PROBF criteria. The following methods are available.
CHAIDAFTER
applies a Bonferroni adjustment after split is chosen.
CHAIDBEFORE applies Bonferroni adjustment before split is chosen.
DEPTH
adjusts for the number of ancestor splits.
18
The ARBORETUM Procedure
NOGABRIEL
suppresses an adjustment that sometimes overrides CHAID.
NONE
suppresses all adjustments.
Specifying both CHAIDAFTER and CHAIDBEFORE is an error. Specifying NONE
with any other method is an error. If the PADJUST= option is not specified, the
CHAIDBEFORE and DEPTH methods are used. The PADJUST= option is ignored
unless CRITERION= PROBCHISQ or PROBF. See the
“Adjusting p-Values for the
Number of Input Values and Branches”
section on page 43 for more information.
requests that the prior probabilities defined in the DECISION statement be incorpo-
rated in the split search criterion for a categorical target. See the
“Incorporating Prior
Probabilities”
section on page 37 for more information.
PVARS=n
| ALL
specifies the number of input variables n to regard as independent when adjusting
p
-values for the number of inputs. PVARS=ALL specifies all the input variables as
independent. When searching for a split, the ARBORETUM procedure ignores input
variables whose values are constant in the node being split, and ignores categorical
variables unless at least two values occur in more observations than specified in the
MINCATSIZE= option in the TRAIN statement. Consequently, the ARBORETUM
procedure may only search for rules using m ≤ N of the original N input variables.
The procedure will regard max((n/N )m, 1) of the m variables as independent. See
the
“Adjusting p-Values for the Number of Input Variables”
section on page 44 for
more detail. The default number n is 0, requesting no adjustment for the number of
inputs.
SPLITATDATUM
requests that a split on an interval input equal the value of the observation, if the
value is an integer, or slightly less than the value if the value is not an integer. The
alternative is to split halfway between two data values. The SPLITBETWEEN option
requests the alternative.
SPLITBETWEEN
requests that a split on an interval input be halfway between two data values. The
SPLITBETWEEN option is default. The SPLITATDATUM option is an alternative.
ASSESS Statement
ASSESS < options > ;
The ASSESS statement specifies a measure for evaluating trees, evaluates all subtrees
(with the original root), chooses a best one for each possible number of leaves, and
organizes the chosen ones in a sequence, beginning with the subtree consisting of the
root only, and ending with the largest tree consisting of all the nodes. (For assessment
measures LIFT and LIFTPROFIT, the subtrees are evaluated with measures ASE and
PROFIT, respectively. See the section
“Tree Assessment and the Subtree Sequence”
on page 49.)
The ARBORETUM procedure selects the best subtree in the sequence consistent with
the options in the ASSESS statement. A subsequent SUBTREE statement can change