The arboretum procedure



Yüklə 3,07 Mb.
Pdf görüntüsü
səhifə127/148
tarix30.04.2018
ölçüsü3,07 Mb.
#40673
1   ...   123   124   125   126   127   128   129   130   ...   148

of the sequence of sub-trees that only uses the training data.)

If SUBTREE=LARGEST, then PROC SPLIT uses the largest sub-tree after pruning nodes that do not

increase the assessment based on the training data. For nominal targets, the largest sub-tree in the

sequence might be much smaller than the original unpruned tree because a splitting rule may have a

good split worth without increasing the number of observations correctly classified.

CHAID

Description

The inputs are either nominal or ordinal. Many software packages accept interval inputs and

automatically group the values into ranges before growing the tree.

The splitting criteria is based on p-values from the F-distribution (interval targets) or Chi-square

distribution (nominal targets). The p-values are adjusted to accommodate multiple testing.

A missing value may be treated as a separate value. For nominal inputs, a missing value constitutes a

new category. For ordinal inputs, a missing value is free of any order restrictions.

The search for a split on an input proceeds stepwise. Initially, a branch is allocated for each value of the

input. Branches are alternately merged and split again as seems warranted by the p-values. The original

CHAID algorithm by Kass stops when no merge or re-splits creates an adequate p-value. The final split

is adopted. A common alternative, sometimes called the exhaustive method, continues merging to a

binary split, and then adopts the split with the most favorable p-value among all splits the algorithm

considered.

After a split is adopted for an input, its p-value is adjusted, and the input with the best adjusted p-value

is selected as the splitting variable. If the adjusted p-value is smaller than a threshold the user specified,

then the node is split.

Tree construction ends when all the adjusted p-values of the splitting variables in the unsplit nodes are

above the user-specified threshold.



Relation to PROC SPLIT

The CHAID algorithm differs from PROC SPLIT in a number of ways: PROC SPLIT seeks the split

minimizing the adjusted p-value, the original KASS algorithm does not. The CHAID exhaustive method

is similar to the PROC SPLIT heuristic method, except that the exhaustive method "re-splits" and PROC

SPLIT "reassigns". Also, CHAID software discretizes interval inputs, while PROC SPLIT sometimes

consolidates observations into groups. PROC SPLIT searches on a within node sample, unlike CHAID.




To Approximate CHAID

The interval inputs should be discretized into a few dozen values. Then set the options as follows.

For nominal targets:

CRITERION = PROBCHISQ

SUBTREE = LARGEST (to avoid automatic pruning)

For interval targets:

CRITERION = PROBF

SUBTREE = LARGEST (to avoid automatic pruning)

For any type of target:

EXHAUSTIVE= 0 (forces a heuristic search)

MAXBRANCH = maximum number of categorical values in an input

NODESAMPLE= size of data set, up to 32,000

NSURRS = 0

PADJUST = KASSAFTER

WORTH = 0.05 (or whatever significance level seems appropriate)

Classification and Regression Trees

Description

Classification and Regression Trees is the name of a book by Breiman, Friedman, Olshen, and Stone

(BFOS) in which a tree methodology is described.

The inputs are either nominal or interval. Ordinal inputs are treated as interval.

The available splitting criteria are: reduction in variance, and reduction in least-absolute-deviation for

interval targets; gini impurity and twoing for nominal targets, and ordered twoing for ordinal targets.

For binary targets, gini, twoing, and ordered twoing create the same splits. Twoing and ordered twoing

were used infrequently.

The BFOS method does an exhaustive search for the best binary split.




Linear combination splits are also available. Using a linear combination split, an observation is assigned

to the "left" branch when a linear combination of interval inputs is less than some constant. The

coefficients and the constant define the split. (BFOS excludes missing values during a split search.) The

BFOS method for searching for linear combination splits is heuristic and may not find the best linear

combination.

When creating a split, observations with a missing value in the splitting variable (or variables in the case

of linear combination) are omitted. Surrogate splits are also created and used to assign observations to

branches when the primary splitting variable is missing. If missing values prevent the use of the primary

and surrogate splitting rules, then the observation is assigned to the largest branch (based on the

within-node training sample).

When a node has many training observations, a sample is taken and used for the split search. The

samples in different nodes are independent.

For nominal targets, prior probabilities and misclassification costs are recognized.

The tree is grown to overfit the data. A sequence of sub-trees is formed by using a cost-complexity

measure. The sub-tree with the best fit to validation data is selected. Cross-validation is available when

validation data is not.

For nominal targets, class probability trees are an alternative to classification trees. Trees are grown to

produce distinct distributions of class probabilities in the leaves. Trees are evaluated in terms of an

overall gini index. Neither misclassification costs nor rates are used.

To Approximate Classification and Regression Trees

Typically, PROC SPLIT trees should be very similar to ones grown using the BFOS methods. PROC

SPLIT does not have linear combination splits, twoing or ordered twoing splitting criterion. PROC

SPLIT incorporates a loss matrix into a split search differently than the BFOS method. Therefore, splits

in the presence of misclassification costs may differ. PROC SPLIT also handles ordinal targets

differently than BFOS.

The BFOS method recommends using validation data unless the data set contains too few observations.

PROC SPLIT is intended for large data sets, so first divide the data into training and validation data;

then, specify the EXCLUDEMISS option.

For nominal targets:

CRITERION = GINI

For interval targets:

CRITERION = VARIANCE

For any type of target:

ASSESS=STATISTIC (or IMPURITY for CLASS PROBABILITY trees.)

EXCLUDEMISS

EXHAUSTIVE= 50000 (or so to enumerate all possible splits)



Yüklə 3,07 Mb.

Dostları ilə paylaş:
1   ...   123   124   125   126   127   128   129   130   ...   148




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə