The FREQ statement identiﬁes a variable that contains the frequency of occurrence
of each observation. The ARBORETUM procedure treats each observation as if it
appears n times, where n is the value of the FREQ variable for the observation. The
value of n may be fractional to indicate partial observations. If the value of n is close
to zero, negative, or missing, the observation is ignored. When the FREQ statement
is not speciﬁed, each observation is assigned a frequency of 1.
The LEAFSIZE=, MINCATSIZE=, and SPLITSIZE= options in the TRAIN state-
ment, and the NODESIZE= option in the PERFORMANCE statement ignore the
FREQ statement in that the options do not use the variable values to adjust the speci-
ﬁed number of observations.
INPUT variables < / options > ;
The INPUT statement names input variables with common options. The INPUT
statement may be repeated.
LEVEL= INTERVAL | NOMINAL | ORDINAL
speciﬁes the level of measurement, as deﬁned in the
section on page
6. The default level is INTERVAL for a numeric variable, NOMINAL for a character
speciﬁes the missing value policy for the inputs. The option is the same as the
MISSING= option in the PROC ARBORETUM statement, except that it only ap-
plies to the variables in the INPUT statement. If the option is omitted, the policy
speciﬁed in the MISSING= option in the PROC ARBORETUM statement applies to
speciﬁes the sorting order of the values of an ordinal input variable. The ORDER=
option is only available when LEVEL=ORDINAL is speciﬁed. The following table
shows how PROC ARBORETUM interprets values of the ORDER= option.
Value of ORDER=
Variable Values Sorted By
ascending order of unformatted values (default)
ascending order of formatted values
descending order of unformatted values
descending order of formatted values
order of appearance in the input data set
The ARBORETUM Procedure
which no explicit format is declared, the ordering often deviates from the numeric
one, and is consequently unexpected and undesired. ORDER=ASCENDING and
DESCENDING orders the values of numeric variables by their numeric values.
When ORDER=ASCENDING or DESCENDING, and more than one unformatted
value have the same formatted value, the ARBORETUM procedure uses the smallest
unformatted value (with the same formatted value) to determine the ordering of the
formatted values. A splitting rule on an ordinal input assigns a range of formatted
values to a branch. The range will correspond to a range of unformatted values if all
unformatted values with the same formatted value deﬁne an interval that contains no
The sorting order of character values, including formatted values, may be machine
dependent. For more information on sorting order, see the chapter on the SORT
procedure in the SAS Procedures Guide. The default sorting order is ASCENDING.
requests that a split on an interval input equal the value of the observation, if the
value is an integer, or slightly less than the value if the value is not an integer. The
alternative is to split an interval variable halfway between two values.
requests that a split on an interval input be halfway between two data values. The
SPLITBETWEEN option is default, unless the SPLITATDATUM option is speciﬁed
in the PROC ARBORETUM statement. The SPLITATDATUM option is the alterna-
INTERACT PRUNED | LARGEST | NLEAVES=
The INTERACT statement declares the start of interactive training statements. If
more than one node exists in the largest subtree, then one of the options, PRUNED,
LARGEST, or NLEAVES= must appear to specify which subtree to use. Nodes not
in the speciﬁed subtree are permanently deleted. See the
“Tree Assessment and the
section beginning on page 49 for more information about the
The INTERACT statement is required before using any of the interactive statements:
BRANCH, PRUNE, REDO, SEARCH, SETRULE, SPLIT, TRAIN, or UNDO. If
an INTERACT statement appears before any assessment or output statement, then
no training is performed before the INTERACT statement executes, and no split is
created unless requested with an interactive statement (or unless a tree is already input
using the INMODEL= option to the PROC statement).
permanently deletes all nodes not in the selected subtree.
maintains all the nodes in the largest tree.
selects the subtree with nleaves leaves.
The MAKEMACRO statement speciﬁes the name of a macro variable to contain the
number of leaves in the current subtree.
PERFORMANCE < options > ;
The PERFORMANCE statement speciﬁes options affecting the speed of computa-
tions with little or no impact on the results. See the
section on page 53 for more information.
DISK | MULTIPASS | RAM
speciﬁes where to put the working copy of the training data. The RAM option re-
quests that the working copy be stored in memory if enough memory is available for
it and still allow for a single split search in one pass of the data. The DISK option
requests that the working copy be stored in a disk utility ﬁle. Storing the copy on disk
may free a considerable amount of memory for calculations, possibly speeding up
the program. The MULTIPASS option requests that the training data be read multi-
ple times instead of copying it to memory or a disk utility ﬁle. MULTIPASS is slower
than DISK because the DISK copy is converted to encodings directly usable in the
calculations. The MULTIPASS option is only preferable when the training data will
not ﬁt in RAM or in a disk utility ﬁle.
speciﬁes the maximum amount of memory to allocate for the computations and the
working copy of the training data if the data is stored in memory. The default value
depends on the computer and may considerably prolong the execution time if SAS
cannot distinguish physical memory from virtual memory.
The SAS MEMSIZE system option sets an upper limit to bytes.
speciﬁes the number of training observations to use when searching for a splitting
rule. NODESIZE=ALL requests to use all the observations. For larger data sets,
using a large within-node sample may require more passes of the data, resulting in a
longer running time. See the
section on page 53 for
The procedure counts the number of training observations in a node without adjusting
the number with the values of the variable speciﬁed in the FREQ statement. If the
count is larger than n, then the split search for that node is based on a random sample
of size n. For categorical targets, the sample uses as many observations with less
frequent target values as possible. The acceptable range is from two to two billion on