The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	126/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 122 123 124 125 126 127 128 129 ... 148

METHOD OF SPLIT SEARCH

not the input used in the primary rule counts against the surrogate.

The NSURRS= procedure option determines the number of surrogates sought. A surrogate is discarded if

its agreement is 1/B, where B is the number of branches. As a consequence, a node might have fewer

surrogates than the number specified in the NSURRS= option.

METHOD OF SPLIT SEARCH

For a specific node and input, PROC SPLIT seeks the split with maximum worth or -log(p-value)

subject to the limit on the number of branches and the limit on the minimum number of observations

assigned to a branch. The procedure options MAXBRANCH= and LEAFSIZE= set these limits.

The measure of worth depends on the splitting criterion. The ENTROPY, GINI, and VARIANCE

reduction criteria measure worth as I(node) - sum over branches b of P(b) I(b), where I(node) denotes the

entropy, gini, or variance measure in the node, and P(b) denotes the proportion of observations in the

node assigned to branch b. If prior probabilities are specified, then the proportions P(b) are modified

accordingly.

The PROBCHISQ and PROBF criteria use the -log(p-value) measure. For these criteria, the best split is

the one with the smallest p-value. If the PADJUST=KASSBEFORE procedure option is in effect, as it is

by default, then the p-values are first multiplied using the appropriate Bonferroni factor. Adjusting the

p-value may cause it to become less significant than an alternative method of computing the p-value,

called Gabriel's adjustment. If so, then Gabriel's p-value is used.

For nodes with many observations, the algorithm uses a sample for the split search, for computing the

worth, and for observing the limit on the minimum size of a branch. The NODESAMPLE= procedure

option specifies the size of the sample and is limited to 32,767 observations (ignoring any frequency

variable) on many computer platforms. The samples in different nodes are taken independently.

For nominal targets, the sample is as balanced as possible. Suppose for example that a node contains 100

observations of one value of a binary target, 1,000 observations of the other value, and

NODESAMPLE=200 or more. Then all 100 observations of the first target value are in the node sample.

For binary splits on binary or interval targets, the optimal split is always found. For other situations, the

data is first consolidated, and then either all possible splits are evaluated or else an heuristic search is

used.

The consolidation phase searches for groups of values of the input which seem likely to be assigned to

the same branch in the best split. The split search regards observations in the same consolidation group

as having the same input value. The split search is faster because fewer candidate splits need evaluating.

If, after consolidation, the number of possible splits is greater than the number specified in the

EXHAUSTIVE= procedure option, then a heuristic search is used. The heuristic algorithm alternately

merges branches and reassigns consolidated observations to different branches. The process stops when

a binary split is reached. Among all candidate splits considered, the one with the best worth is chosen.

The heuristic algorithm initially assigns each consolidated group of observations to a different branch,

even if the number of such branches is more than the limit allowed in the final split. At each merge step,

the two branches are merged that degrade the worth of the partition the least. After two branches are

merged, the algorithm considers reassigning consolidated groups of observations to different branches.

Each consolidated group is considered in turn, and the process stops when no group is reassigned.

When using the PROBCHISQ and PROBF criteria, the p-value of the selected split on an input is

subjected to more adjustments: KASSAFTER, DEPTH, DEVILLE, and INPUTS. If the adjusted p-value

is greater than or equal to the WORTH= procedure option, the split is rejected.

AUTOMATIC PRUNING AND SUB-TREE

SELECTION

After a node is split, the newly created nodes are considered for splitting. This recursive process ends

when no node can be split. The reasons a node will not split are

The node contains too few observations, as specified in the SPLITSIZE= procedure option.

The number of nodes in the path between the root node and the given node equals the number

specified in the MAXDEPTH= procedure option.

No split exceeds the threshold worth requirement specified in the WORTH= procedure option.

The last reason is the most informative. Either all the observations in the node contain nearly the same

target value, or no input is sufficiently predictive in the node.

A tree adapts itself to the training data and generally does not fit as well when applied to new data. Trees

that fit the training data too well often predict too poorly to use on new data.

When SPLITSIZE=, MAXDEPTH, and WORTH= are set to extreme values, and, for PROBCHISQ and

PROBF criteria, PADJUST=none, the tree is apt to grow until all observations in a leaf contain the same

target value. Such trees typically overfit the training data and will predict new data poorly.

A primary consideration when developing a tree for prediction is deciding how large to grow the tree, or,

what comes to the same end, what nodes to prune off the tree.

The CHAID method of tree construction specifies a significance level of a Chi-square test to stop tree

growth. The authors of the C4.5 and Classification and Regression Trees methods argue that the right

thresholds for stopping tree construction are not knowable in advance, so theyrecommend growing a tree

too large and then pruning nodes off.

PROC SPLIT allows for both methods. The WORTH= option accepts the significance level used in

CHAID. After tree construction stops, regardless of why it stops, PROC SPLIT creates a sequence of

sub-trees of the original tree, one sub-tree for each possible number of leaves. The sub-tree chosen with

three leaves has the best assessment value of all candidate sub-trees with three leaves.

After the sequence of sub-trees is established, PROC SPLIT uses one of four methods to select which

sub-tree to use for prediction. The SUBTREE= procedure option specifies the method. If SUBTREE=n,

where n is a positive integer, then PROC SPLIT uses the largest sub-tree with, at most, n leaves.

If SUBTREE=ASSESSMENT, then PROC SPLIT uses the smallest sub-tree with the best assessment

value. The assessment is based on the validation data when available. (This differs from the construction

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 122 123 124 125 126 127 128 129 ... 148