INPUT Statement
25
FREQ Statement
FREQ variable ;
The FREQ statement identifies a variable that contains the frequency of occurrence
of each observation. The ARBORETUM procedure treats each observation as if it
appears n times, where n is the value of the FREQ variable for the observation. The
value of n may be fractional to indicate partial observations. If the value of n is close
to zero, negative, or missing, the observation is ignored. When the FREQ statement
is not specified, each observation is assigned a frequency of 1.
The LEAFSIZE=, MINCATSIZE=, and SPLITSIZE= options in the TRAIN state-
ment, and the NODESIZE= option in the PERFORMANCE statement ignore the
FREQ statement in that the options do not use the variable values to adjust the speci-
fied number of observations.
INPUT Statement
INPUT variables < / options > ;
The INPUT statement names input variables with common options. The INPUT
statement may be repeated.
LEVEL= INTERVAL | NOMINAL | ORDINAL
specifies the level of measurement, as defined in the
“Terminology”
section on page
6. The default level is INTERVAL for a numeric variable, NOMINAL for a character
variable.
MISSING= policy
specifies the missing value policy for the inputs. The option is the same as the
MISSING= option in the PROC ARBORETUM statement, except that it only ap-
plies to the variables in the INPUT statement. If the option is omitted, the policy
specified in the MISSING= option in the PROC ARBORETUM statement applies to
these variables.
ORDER= ASCENDING
ORDER= ASCFORMATTED
ORDER= DESCENDING
ORDER= DESFORMATTED
ORDER= DSORDER
specifies the sorting order of the values of an ordinal input variable. The ORDER=
option is only available when LEVEL=ORDINAL is specified. The following table
shows how PROC ARBORETUM interprets values of the ORDER= option.
Value of ORDER=
Variable Values Sorted By
ASCENDING
ascending order of unformatted values (default)
ASCFORMATTED
ascending order of formatted values
DESCENDING
descending order of unformatted values
DESFORMATTED
descending order of formatted values
DSORDER
order of appearance in the input data set
26
The ARBORETUM Procedure
The
“Terminology”
section on page 6 discusses formatted values.
When
ORDER=ASCFORMATTED or DESFORMATTED for numeric input variables for
which no explicit format is declared, the ordering often deviates from the numeric
one, and is consequently unexpected and undesired. ORDER=ASCENDING and
DESCENDING orders the values of numeric variables by their numeric values.
When ORDER=ASCENDING or DESCENDING, and more than one unformatted
value have the same formatted value, the ARBORETUM procedure uses the smallest
unformatted value (with the same formatted value) to determine the ordering of the
formatted values. A splitting rule on an ordinal input assigns a range of formatted
values to a branch. The range will correspond to a range of unformatted values if all
unformatted values with the same formatted value define an interval that contains no
other values.
The sorting order of character values, including formatted values, may be machine
dependent. For more information on sorting order, see the chapter on the SORT
procedure in the SAS Procedures Guide. The default sorting order is ASCENDING.
SPLITATDATUM
requests that a split on an interval input equal the value of the observation, if the
value is an integer, or slightly less than the value if the value is not an integer. The
alternative is to split an interval variable halfway between two values.
SPLITBETWEEN
requests that a split on an interval input be halfway between two data values. The
SPLITBETWEEN option is default, unless the SPLITATDATUM option is specified
in the PROC ARBORETUM statement. The SPLITATDATUM option is the alterna-
tive.
INTERACT Statement
INTERACT PRUNED | LARGEST | NLEAVES=
nleaves
;
The INTERACT statement declares the start of interactive training statements. If
more than one node exists in the largest subtree, then one of the options, PRUNED,
LARGEST, or NLEAVES= must appear to specify which subtree to use. Nodes not
in the specified subtree are permanently deleted. See the
“Tree Assessment and the
Subtree Sequence”
section beginning on page 49 for more information about the
subtree sequence.
The INTERACT statement is required before using any of the interactive statements:
BRANCH, PRUNE, REDO, SEARCH, SETRULE, SPLIT, TRAIN, or UNDO. If
an INTERACT statement appears before any assessment or output statement, then
no training is performed before the INTERACT statement executes, and no split is
created unless requested with an interactive statement (or unless a tree is already input
using the INMODEL= option to the PROC statement).
PRUNED
permanently deletes all nodes not in the selected subtree.
LARGEST
maintains all the nodes in the largest tree.
PRUNE Statement
27
NLEAVES= nleaves
selects the subtree with nleaves leaves.
MAKEMACRO Statement
MAKEMACRO NLEAVES=
macname
;
The MAKEMACRO statement specifies the name of a macro variable to contain the
number of leaves in the current subtree.
PERFORMANCE Statement
PERFORMANCE < options > ;
The PERFORMANCE statement specifies options affecting the speed of computa-
tions with little or no impact on the results. See the
“Performance Considerations”
section on page 53 for more information.
DISK | MULTIPASS | RAM
specifies where to put the working copy of the training data. The RAM option re-
quests that the working copy be stored in memory if enough memory is available for
it and still allow for a single split search in one pass of the data. The DISK option
requests that the working copy be stored in a disk utility file. Storing the copy on disk
may free a considerable amount of memory for calculations, possibly speeding up
the program. The MULTIPASS option requests that the training data be read multi-
ple times instead of copying it to memory or a disk utility file. MULTIPASS is slower
than DISK because the DISK copy is converted to encodings directly usable in the
calculations. The MULTIPASS option is only preferable when the training data will
not fit in RAM or in a disk utility file.
MEMSIZE= bytes
specifies the maximum amount of memory to allocate for the computations and the
working copy of the training data if the data is stored in memory. The default value
depends on the computer and may considerably prolong the execution time if SAS
cannot distinguish physical memory from virtual memory.
The SAS MEMSIZE system option sets an upper limit to bytes.
NODESIZE=n | ALL
specifies the number of training observations to use when searching for a splitting
rule. NODESIZE=ALL requests to use all the observations. For larger data sets,
using a large within-node sample may require more passes of the data, resulting in a
longer running time. See the
“Performance Considerations”
section on page 53 for
more detail.
The procedure counts the number of training observations in a node without adjusting
the number with the values of the variable specified in the FREQ statement. If the
count is larger than n, then the split search for that node is based on a random sample
of size n. For categorical targets, the sample uses as many observations with less
frequent target values as possible. The acceptable range is from two to two billion on
most machines.
Dostları ilə paylaş: |