MAXBRANCH = 2
NODESAMPLE= 1000 (or whatever BFOS recommends )
NSURRS = 5 (or so )
SUBTREE = ASSESSMENT or ASSESS=IMPUNITY for CLASS PROBABILITY.
VALIDATA = validation data set
C4.5 and C5.0
Description of C4.5
The book, C4.5: PROGRAMS FOR MACHINE LEARNING, by J. Ross Quinlan, is the main
reference.
The target is nominal. The inputs may be nominal or interval.
The recommended splitting criteria is the Gain Ratio = reduction in entropy / entropy of split.
(Let P(b) denote the proportion of training observations a split assigns to branch b, b=1 to B. The
entropy of a split is defined as the entropy function applied to
{P(b): b = 1 to B}.)
For interval inputs, C4.5 finds the best binary split. For nominal inputs, a branch is created for every
value, and then, optionally, the branches are merged until the splitting measure does not improve.
Merging is performed stepwise. At each step, the pair of branches is merged that most improves the
splitting measure.
When creating a split, observations with a missing value in the splitting variable are discarded when
computing the reduction in entropy, and the entropy of a split is computed as if the split makes an
additional branch exclusively for the missing values.
When applying a splitting rule to an observation with a missing value on the splitting variable, the
observation is replaced by B observations, one for each branch, and each new observation is assigned a
weight equal to the proportion of observations used to create the split sent into that branch. The posterior
probabilities of the original observation equal the weighted sum of the probabilities for the split
observations.
The tree is grown to overfit the training data. In each node, an upper confidence limit of the number
misclassified is estimated assuming a binomial distribution around the observed number misclassified. A
sub-tree is sought that minimizes the sum of upper confidences in each leaf.
C4.5 can convert a tree into a "ruleset", which is a set of rules that assigns most observations to the same
class that the tree does. Generally, the ruleset contains fewer rules than needed to describe all root-leaf
paths and is consequently more understandable than the tree.
C4.5 can create "fuzzy" splits on interval inputs. The tree is constructed the same as with non-fuzzy
splits. If an interval input has a value near the splitting value, then the observation is effectively replaced
by two observations, each with some weight related to the proximity of the input value to the splitting
value. The posterior probabilities of the original observation equal the weighted sum of probabilities for
the two new observations.
Description of C5.0
The Web page http://www.rulequest.com contains some information about C5.0. C5.0 is C4.5 with the
following differences:
The branch-merging option for nominal splits is default.
q
The user may specify misclassification costs.
q
Boosting and cross-validation are available.
q
Relation to PROC SPLIT
The algorithm for creating rulesets from trees is much improved.
The tree created with C4.5 will differ from those created with PROC SPLIT for several reasons:
C4.5 creates binary splits on interval inputs and multiway splits on nominal inputs. This favors
nominal inputs. PROC SPLIT treats interval and nominal inputs the same in this respect.
C4.5 uses a pruning method designed to avoid using validation data. PROC SPLIT expects
validation data to be available and so does not offer the pessimistic pruning method of C4.5.
The option settings most similar to C4.5 are:
CRITERION = ERATIO
EXHAUSTIVE= 0 (forces a heuristic search)
MAXBRANCH = maximum number of nominal values in an input, up to 100
NODESAMPLE= size of data set, up to 32,000
NSURRS = 0
WORTH = 0
SUBTREE = ASSESSMENT
VALIDATA = validation data set
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPLIT Procedure
Examples
The following examples were executed using the HP-UX version 10.20 operating system; the version of
the SAS system was 6.12TS045.
Example 1: Creating a Decision Tree with a Categorical Target (Rings Data)
Example 2: Creating a Decision Tree with an Interval Target (Baseball Data)
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPLIT Procedure
Example 1: Creating a Decision Tree with a Categorical Target
(Rings Data)
Features
Specifying the Input Variables and the Target Variable
q
Setting the Splitting Criterion
q
Setting the Maximum Number of Child Nodes of a Node
q
Specifying the smallest number of training observations a node must have to consider splitting it
q
Outputting and Printing Fit Statistics
q
Creating a Misclassification Table
q
Scoring Data with the Score Statement
q
Reading the Input Data Set from a Previously Created Decision Tree (the OUTTREE= data set) with the
INTREE= option.
q
Creating Diagnostic Scatter Plots
q
Creating Contour Plots of the Posterior Probabilities
q
Creating a Scatter Plot of the Leaf Nodes
q
This example demonstrates how to create a decision tree with a categorical target. The ENTROPY splitting criterion is used to
search for and evaluate candidate splitting rules.
The example DMDB training data set SAMPSIO.DMDRING contains a categorical target with 3 levels (C = 1, 2, or 3) and two
interval inputs (X and Y). There are 180 observations in the training data set. The SAMPSIO.DMSRING data set is scored using
the scoring formula from the trained model. Both data sets and the DMDB training catalog are stored in the sample library.
Program
title 'SPLIT Example: RINGS Data';
title2 'Plot of the Rings Training Data';
goptions gunit=pct ftext=swiss ftitle=swissb htitle=4 htext=3;
proc gplot data=sampsio.dmdring;
plot y*x=c /haxis=axis1 vaxis=axis2;
symbol c=black i=none v=dot;
symbol2 c=red i=none v=square;
symbol3 c=green i=none v=triangle;
axis1 c=black width=2.5 order=(0 to 30 by 5);
axis2 c=black width=2.5 minor=none order=(0 to 20 by 2);
run;
title2 'Entropy Criterion';
proc split data=sampsio.dmdring
dmdbcat=sampsio.dmdring
criterion=entropy
Dostları ilə paylaş: |