The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	125/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 121 122 123 124 125 126 127 128 ... 148

The SPLIT Procedure

PRUNE Statement

Deletes all nodes descended from any specified node.

Interaction: The PRUNE statement requires an INTREE= or INDMSPLIT procedure option.

PRUNE list-of-node-identifiers;

Required Argument

list-of-node-identifiers

Specifies the nodes that will have no children.

Range:

Integer > 0

The SPLIT Procedure

SCORE Statement

Specifies that the data be scored.

SCORE

DATA=SAS-data-set

OUT=SAS-data-set

<score-option(s)> NODES=node-list>;

Required Arguments

DATA=SAS-data-set

Specifies input data that contains inputs and, optionally, targets.

OUT=SAS-data-set

Output data set with outputs.

Options

DUMMY

Includes dummy variables for each node. For each observation the value of the dummy variables

is 1 if the observation appears in the node and 0 if it does not.

NODES=node-list

Specifies a list of nodes used to score the observations. If an observation does not fall into any

node list, it does not contribute to the statistics and is not output. If an observation occurs in more

than one node, it contributes multiple times to the statistics and is output once for each node it

occurs in.

Interaction:

The NODES= option requires the INTREE= or INDMSPLIT procedure

option.

Default:

The default is the list of leaf nodes. Omitting the NODES= option results in

the decisions, utilities, and leaf assignment being output for each observation

in the DATA= data set.

NOLEAFID

Does not include lead identifiers or node numbers.

NOPRED

Does not include predicted values.

OUTFIT=SAS-data-set

Output data set with fit statistics.

ROLE=role-value

Specifies the role of the DATA= data set. The ROLE= option primarily affects what fit statistics

are computed and what their names and labels are. Role-value can be:

TRAIN

The default when DATA= data set name in the PROC statement is the same as the data set

name in the SCORE statement.

VALID | VALIDATION

The default when DATA= data set name in the SCORE statement is the same as DATA=

data set name in the VALIDATA= option in the PROC statement.

TEST

The default when DATA= data set name in the SCORE statement is not the same as the

data set name in the DATA= or VALIDATA= option in the PROC statement.

SCORE

Residuals, computed profit, and fit statistics are not produced.

The SPLIT Procedure

TARGET Statement

Specifies an output variable.

TARGET variable < / LEVEL=measurement>;

Required Argument

variable

Specifies the variable that the model-fitting tries to predict.

Options

LEVEL=measurement

Specifies the measurement level, where measurement can be:

BINARY

NOMINAL

ORDINAL

INTERVAL

Default:

LEVEL=INTERVAL.

The SPLIT Procedure

Details

Missing Values

Observations in which the target value is missing are ignored when training or validating the tree.

If EXCLUDEMISS is specified, then observations with missing values are excluded during the search

for a splitting rule. A search uses only one variable, and so only the observations missing on the single

candidate input are excluded. An observation missing input x but not missing input y is used in the

search for a split on y but not x. After a split is chosen, the rule is amended to assign missing values to

the largest branch.

If EXCLUDEMISS is not specified, the search for a split on an input treats missing values as a special,

acceptable value, and includes them in the search. All observations with missing values are assigned to

the same branch.

The branch may or may not contain other observations. The branch chosen is the one that maximizes the

split worth.

For splits on a categorical variable, this amounts to treating a missing value as a separate category. For

numerical variables, it amounts to treating missing values as having the same unknown non-missing

value.

One advantage of using missing data during the search is that the worth of split is computed with the

same number of observations for each input. Another advantage is that an association of the missing

values with the target values can contribute to the predictive ability of the split. One disadvantage is that

missing values could unjustifiably dominate the choice of split.

When a split is applied to an observation in which the required input value is missing, surrogate splitting

rules are considered before assigning the observation to the branch for missing values.

A surrogate splitting rule is a backup to the main splitting rule. For example, the main splitting rule

might use county as input and the surrogate might use region. If the county is unknown and the region is

known, the surrogate is used.

If several surrogate rules exist, each surrogate is considered in sequence until one can be applied to the

observation. If none can be applied, the main rule assigns the observation to the branch designated for

missing values.

The surrogates are considered in the order of their agreement with the main splitting rule. The agreement

is measured as the proportion of training observations it and the main rule assign to the same branch. The

measure excludes the observations that the main rule cannot be applied to. Among the remaining

observations, those on which the surrogate rule cannot be applied count as observations not assigned to

the same branch. Thus, an observation with a missing value on the input used in the surrogate rule but

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 121 122 123 124 125 126 127 128 ... 148