The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	10/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 6 7 8 9 10 11 12 13 ... 148

INPUT Statement

25

FREQ Statement

FREQ variable ;

The FREQ statement identiﬁes a variable that contains the frequency of occurrence

of each observation. The ARBORETUM procedure treats each observation as if it

appears n times, where n is the value of the FREQ variable for the observation. The

value of n may be fractional to indicate partial observations. If the value of n is close

to zero, negative, or missing, the observation is ignored. When the FREQ statement

is not speciﬁed, each observation is assigned a frequency of 1.

The LEAFSIZE=, MINCATSIZE=, and SPLITSIZE= options in the TRAIN state-

ment, and the NODESIZE= option in the PERFORMANCE statement ignore the

FREQ statement in that the options do not use the variable values to adjust the speci-

ﬁed number of observations.

INPUT Statement

INPUT variables < / options > ;

The INPUT statement names input variables with common options. The INPUT

statement may be repeated.

LEVEL= INTERVAL | NOMINAL | ORDINAL

speciﬁes the level of measurement, as deﬁned in the

“Terminology”

section on page

6. The default level is INTERVAL for a numeric variable, NOMINAL for a character

variable.

MISSING= policy

speciﬁes the missing value policy for the inputs. The option is the same as the

MISSING= option in the PROC ARBORETUM statement, except that it only ap-

plies to the variables in the INPUT statement. If the option is omitted, the policy

speciﬁed in the MISSING= option in the PROC ARBORETUM statement applies to

these variables.

ORDER= ASCENDING

ORDER= ASCFORMATTED

ORDER= DESCENDING

ORDER= DESFORMATTED

ORDER= DSORDER

speciﬁes the sorting order of the values of an ordinal input variable. The ORDER=

option is only available when LEVEL=ORDINAL is speciﬁed. The following table

shows how PROC ARBORETUM interprets values of the ORDER= option.

Value of ORDER=

Variable Values Sorted By

ASCENDING

ascending order of unformatted values (default)

ASCFORMATTED

ascending order of formatted values

DESCENDING

descending order of unformatted values

DESFORMATTED

descending order of formatted values

DSORDER

order of appearance in the input data set

The ARBORETUM Procedure

The

“Terminology”

section on page 6 discusses formatted values.

When

ORDER=ASCFORMATTED or DESFORMATTED for numeric input variables for

which no explicit format is declared, the ordering often deviates from the numeric

one, and is consequently unexpected and undesired. ORDER=ASCENDING and

DESCENDING orders the values of numeric variables by their numeric values.

When ORDER=ASCENDING or DESCENDING, and more than one unformatted

value have the same formatted value, the ARBORETUM procedure uses the smallest

unformatted value (with the same formatted value) to determine the ordering of the

formatted values. A splitting rule on an ordinal input assigns a range of formatted

values to a branch. The range will correspond to a range of unformatted values if all

unformatted values with the same formatted value deﬁne an interval that contains no

other values.

The sorting order of character values, including formatted values, may be machine

dependent. For more information on sorting order, see the chapter on the SORT

procedure in the SAS Procedures Guide. The default sorting order is ASCENDING.

SPLITATDATUM

requests that a split on an interval input equal the value of the observation, if the

value is an integer, or slightly less than the value if the value is not an integer. The

alternative is to split an interval variable halfway between two values.

SPLITBETWEEN

requests that a split on an interval input be halfway between two data values. The

SPLITBETWEEN option is default, unless the SPLITATDATUM option is speciﬁed

in the PROC ARBORETUM statement. The SPLITATDATUM option is the alterna-

tive.

INTERACT Statement

INTERACT PRUNED | LARGEST | NLEAVES=

nleaves

;

The INTERACT statement declares the start of interactive training statements. If

more than one node exists in the largest subtree, then one of the options, PRUNED,

LARGEST, or NLEAVES= must appear to specify which subtree to use. Nodes not

in the speciﬁed subtree are permanently deleted. See the

“Tree Assessment and the

Subtree Sequence”

section beginning on page 49 for more information about the

subtree sequence.

The INTERACT statement is required before using any of the interactive statements:

BRANCH, PRUNE, REDO, SEARCH, SETRULE, SPLIT, TRAIN, or UNDO. If

an INTERACT statement appears before any assessment or output statement, then

no training is performed before the INTERACT statement executes, and no split is

created unless requested with an interactive statement (or unless a tree is already input

using the INMODEL= option to the PROC statement).

PRUNED

permanently deletes all nodes not in the selected subtree.

LARGEST

maintains all the nodes in the largest tree.

PRUNE Statement

27

NLEAVES= nleaves

selects the subtree with nleaves leaves.

MAKEMACRO Statement

MAKEMACRO NLEAVES=

macname

;

The MAKEMACRO statement speciﬁes the name of a macro variable to contain the

number of leaves in the current subtree.

PERFORMANCE Statement

PERFORMANCE < options > ;

The PERFORMANCE statement speciﬁes options affecting the speed of computa-

tions with little or no impact on the results. See the

“Performance Considerations”

section on page 53 for more information.

DISK | MULTIPASS | RAM

speciﬁes where to put the working copy of the training data. The RAM option re-

quests that the working copy be stored in memory if enough memory is available for

it and still allow for a single split search in one pass of the data. The DISK option

requests that the working copy be stored in a disk utility ﬁle. Storing the copy on disk

may free a considerable amount of memory for calculations, possibly speeding up

the program. The MULTIPASS option requests that the training data be read multi-

ple times instead of copying it to memory or a disk utility ﬁle. MULTIPASS is slower

than DISK because the DISK copy is converted to encodings directly usable in the

calculations. The MULTIPASS option is only preferable when the training data will

not ﬁt in RAM or in a disk utility ﬁle.

MEMSIZE= bytes

speciﬁes the maximum amount of memory to allocate for the computations and the

working copy of the training data if the data is stored in memory. The default value

depends on the computer and may considerably prolong the execution time if SAS

cannot distinguish physical memory from virtual memory.

The SAS MEMSIZE system option sets an upper limit to bytes.

NODESIZE=n | ALL

speciﬁes the number of training observations to use when searching for a splitting

rule. NODESIZE=ALL requests to use all the observations. For larger data sets,

using a large within-node sample may require more passes of the data, resulting in a

longer running time. See the

“Performance Considerations”

section on page 53 for

more detail.

The procedure counts the number of training observations in a node without adjusting

the number with the values of the variable speciﬁed in the FREQ statement. If the

count is larger than n, then the split search for that node is based on a random sample

of size n. For categorical targets, the sample uses as many observations with less

frequent target values as possible. The acceptable range is from two to two billion on

most machines.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 6 7 8 9 10 11 12 13 ... 148