The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	19/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 15 16 17 18 19 20 21 22 ... 148

Split Search Algorithm

Split Search Algorithm

categories. J − J

rare

is the number of remaining categories. The sampling algorithm

selects s = (n−n

rare

)/(J −J

rare

) observations from each of the J −J

rare

remaining

categories. If s is not an integer, then the sample will contain one more observation

from some of J − J

rare

categories than others so that the total equals (n − n

rare

The sampling algorithm depends only on the order of the observations in the training

data and not on other random factors.

When the node is split into branches, all the observations are passed to the branches,

and new samples are created in each branch as needed.

Split Search Algorithm

The ARBORETUM procedure always selects a splitting rule with the largest worth

among all the splits evaluated. The algorithm is simple. The details, however, seem

complicated because of the many special situations.

The search for splitting and surrogate rules is done on the within-node training

sample.

The search involving a speciﬁc input variable eliminates observations

with a missing input value unless the MISSING= option for that input equals

USEINSEARCH. The search involving a speciﬁc categorical input variable elimi-

nates observations whose input value occurs less frequently in the within-node sample

than the threshold speciﬁed in the MINCATSIZE= option in the TRAIN statement.

The decision to eliminate an observation from the within-node sample is made in-

dependently for each input and for each node. An observation in which input W is

missing and input Z is not missing is eliminated from the search using W but not from

Z, unless MISSING=USEINSEARCH. An observation in which categorical input X

has a value that is common in the root node but occurs infrequently in some branch

is eliminated from searches in that branch but not the root node.

In most situations the algorithm sorts the observations. For interval and ordinal input

variables, the algorithm sorts the input values, and arbitrarily puts observations with

missing input values at the end. For nominal inputs with interval targets, the algo-

rithm sorts the input categories by the average target value among the observations in

the category. For nominal inputs and observations containing two categorical target

values in the search sample, the algorithm sorts the input categories by the proportion

of one of the target values. In the remaining situation, the algorithm does not sort.

The remaining situation consists of a nominal input and a categorical target with at

least three different values in the sample presented to the algorithm.

After the algorithm sorts, if it sorts at all, the algorithm passes over the data evaluating

every permissible binary split preserving the sort. A split between tied input values

or categories with the same average target value is not permitted, nor is a split that

leaves fewer observations in a branch than speciﬁed in the LEAFSIZE= option in the

TRAIN statement.

If the MAXBRANCH= option in the TRAIN statement speciﬁes binary splits, then

the search is ﬁnished; the best split evaluated is the best binary split. Otherwise

the algorithm consolidates the observations into N bins, where N is less than or

equal to the limit speciﬁed in the SEARCHBINS= option in the TRAIN statement.

The ARBORETUM Procedure

Observations collected into the same bin remain together during the search and are

assigned to the same branch.

The consolidation algorithm uses the best of the binary split points already found

to create the bins, subject to some constraints. If the input variable is interval or

ordinal and missing values appear in the sample (and are therefore acceptable values),

then one bin is always reserved exclusively for missing values. For an interval input

X, if the number of distinct values of X is greater than the number speciﬁed in the

INTERVALBINS=n option in the TRAIN statement, then the split points will be at

least (max(X)−min(X)/(n+1) apart, where max(X) and min(X) are the maximum

and minimum values of the input in the search sample. The algorithm makes as many

bins as possible that satisfy these constraints.

Next the algorithm computes the number of candidate splits, including m-ary splits,

where 2 <= m <= n. If the number does not exceed the threshold speciﬁed in the

EXHAUSTIVE= option in the TRAIN statement, then all candidate splits are evalu-

ated, and one with the largest measure of worth is chosen. Otherwise the algorithm

uses a merge-and-shufﬂe heuristic approach to select which splits to evaluate.

The merge-and-shufﬂe algorithm ﬁrst creates a branch for each bin. The algorithm

then merges a pair of branches, and then merges another pair, and so on until only

two branches remain. To choose a pair, the algorithm evaluates the worth of splitting

the merged candidate branch back into the original pair of branches, and selects a

pair that deﬁnes the splitting rule with the smallest worth. After each merge, the

algorithm reassigns a bin of observations to a different branch if the worth of the

splitting rule increases. The algorithm continues the reassignment of single bins as

long as the worth increases. The merge-and-shufﬂe algorithm evaluates many, but

not all, permissible splitting rules, and chooses the one with the largest worth. The

algorithm is heuristic because it does not guarantee that the best possible split is

found.

The previous paragraphs describe the search when the algorithm sorts the observa-

tions. The algorithm does not sort when the input variable is nominal and the target

variable is categorical with at least three categories in the sample being searched.

For this situation, if the number of input categories occurring in the search sample

is greater than the threshold speciﬁed in the SEARCHBINS= option, then categories

with the largest entropy of target values are consolidated into one bin, and the re-

maining N − 1 categories are assigned to the remaining N − 1 bins, one bin for each

category. The rest of the search is similar to the one for n-ary splits of sorted ob-

servations. If the number of candidate splits does not exceed the threshold speciﬁed

in the EXHAUSTIVE= option, then all candidate splits are evaluated. Otherwise the

algorithm uses the merge-and-shufﬂe heuristic approach.

Every rule includes an assignment of missing values to one or all branches, as de-

scribed in the

“Missing Values”

section on page 45.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 15 16 17 18 19 20 21 22 ... 148