The arboretum procedure

Yüklə 3,07 Mb.

 səhifə 126/148 tarix 30.04.2018 ölçüsü 3,07 Mb.
 not the input used in the primary rule counts against the surrogate. The NSURRS= procedure option determines the number of surrogates sought. A surrogate is discarded if its agreement is   1/B, where B is the number of branches. As a consequence, a node might have fewer surrogates than the number specified in the NSURRS= option. METHOD OF SPLIT SEARCH For a specific node and input, PROC SPLIT seeks the split with maximum worth or -log(p-value) subject to the limit on the number of branches and the limit on the minimum number of observations assigned to a branch. The procedure options MAXBRANCH= and LEAFSIZE= set these limits. The measure of worth depends on the splitting criterion. The ENTROPY, GINI, and VARIANCE reduction criteria measure worth as I(node) - sum over branches b of P(b) I(b), where I(node) denotes the entropy, gini, or variance measure in the node, and P(b) denotes the proportion of observations in the node assigned to branch b. If prior probabilities are specified, then the proportions P(b) are modified accordingly. The PROBCHISQ and PROBF criteria use the -log(p-value) measure. For these criteria, the best split is the one with the smallest p-value. If the PADJUST=KASSBEFORE procedure option is in effect, as it is by default, then the p-values are first multiplied using the appropriate Bonferroni factor. Adjusting the p-value may cause it to become less significant than an alternative method of computing the p-value, called Gabriel's adjustment. If so, then Gabriel's p-value is used. For nodes with many observations, the algorithm uses a sample for the split search, for computing the worth, and for observing the limit on the minimum size of a branch. The NODESAMPLE= procedure option specifies the size of the sample and is limited to 32,767 observations (ignoring any frequency variable) on many computer platforms. The samples in different nodes are taken independently. For nominal targets, the sample is as balanced as possible. Suppose for example that a node contains 100 observations of one value of a binary target, 1,000 observations of the other value, and NODESAMPLE=200 or more. Then all 100 observations of the first target value are in the node sample. For binary splits on binary or interval targets, the optimal split is always found. For other situations, the data is first consolidated, and then either all possible splits are evaluated or else an heuristic search is used. The consolidation phase searches for groups of values of the input which seem likely to be assigned to the same branch in the best split. The split search regards observations in the same consolidation group as having the same input value. The split search is faster because fewer candidate splits need evaluating. If, after consolidation, the number of possible splits is greater than the number specified in the EXHAUSTIVE= procedure option, then a heuristic search is used. The heuristic algorithm alternately merges branches and reassigns consolidated observations to different branches. The process stops when a binary split is reached. Among all candidate splits considered, the one with the best worth is chosen. The heuristic algorithm initially assigns each consolidated group of observations to a different branch, even if the number of such branches is more than the limit allowed in the final split. At each merge step, the two branches are merged that degrade the worth of the partition the least. After two branches are merged, the algorithm considers reassigning consolidated groups of observations to different branches. Each consolidated group is considered in turn, and the process stops when no group is reassigned. When using the PROBCHISQ and PROBF criteria, the p-value of the selected split on an input is subjected to more adjustments: KASSAFTER, DEPTH, DEVILLE, and INPUTS. If the adjusted p-value is greater than or equal to the WORTH= procedure option, the split is rejected. AUTOMATIC PRUNING AND SUB-TREE SELECTION After a node is split, the newly created nodes are considered for splitting. This recursive process ends when no node can be split. The reasons a node will not split are The node contains too few observations, as specified in the SPLITSIZE= procedure option. q    The number of nodes in the path between the root node and the given node equals the number specified in the MAXDEPTH= procedure option. q    No split exceeds the threshold worth requirement specified in the WORTH= procedure option. q    The last reason is the most informative. Either all the observations in the node contain nearly the same target value, or no input is sufficiently predictive in the node. A tree adapts itself to the training data and generally does not fit as well when applied to new data. Trees that fit the training data too well often predict too poorly to use on new data. When SPLITSIZE=, MAXDEPTH, and WORTH= are set to extreme values, and, for PROBCHISQ and PROBF criteria, PADJUST=none, the tree is apt to grow until all observations in a leaf contain the same target value. Such trees typically overfit the training data and will predict new data poorly. A primary consideration when developing a tree for prediction is deciding how large to grow the tree, or, what comes to the same end, what nodes to prune off the tree. The CHAID method of tree construction specifies a significance level of a Chi-square test to stop tree growth. The authors of the C4.5 and Classification and Regression Trees methods argue that the right thresholds for stopping tree construction are not knowable in advance, so theyrecommend growing a tree too large and then pruning nodes off. PROC SPLIT allows for both methods. The WORTH= option accepts the significance level used in CHAID. After tree construction stops, regardless of why it stops, PROC SPLIT creates a sequence of sub-trees of the original tree, one sub-tree for each possible number of leaves. The sub-tree chosen with three leaves has the best assessment value of all candidate sub-trees with three leaves. After the sequence of sub-trees is established, PROC SPLIT uses one of four methods to select which sub-tree to use for prediction. The SUBTREE= procedure option specifies the method. If SUBTREE=n, where n is a positive integer, then PROC SPLIT uses the largest sub-tree with, at most, n leaves. If SUBTREE=ASSESSMENT, then PROC SPLIT uses the smallest sub-tree with the best assessment value. The assessment is based on the validation data when available. (This differs from the construction Dostları ilə paylaş:

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

Ana səhifə