The arboretum procedure

Yüklə 3.07 Mb.

ölçüsü3.07 Mb.
1   ...   124   125   126   127   128   129   130   131   ...   148
: documentation
documentation -> From cyber-crime to insider trading, digital investigators are increasingly being asked to
documentation -> EnCase Forensic Transform Your Investigations
documentation -> File Sharing Documentation Prepared by Alan Halter Created: 1/7/2016 Modified: 1/7/2016
documentation -> Gaia Data Release 1 Documentation release 0


NODESAMPLE= 1000 (or whatever BFOS recommends )

NSURRS = 5 (or so )


VALIDATA = validation data set

C4.5 and C5.0

Description of C4.5

The book, C4.5: PROGRAMS FOR MACHINE LEARNING, by J. Ross Quinlan, is the main


The target is nominal. The inputs may be nominal or interval.

The recommended splitting criteria is the Gain Ratio = reduction in entropy / entropy of split.

(Let P(b) denote the proportion of training observations a split assigns to branch b, b=1 to B. The

entropy of a split is defined as the entropy function applied to

{P(b): b = 1 to B}.)

For interval inputs, C4.5 finds the best binary split. For nominal inputs, a branch is created for every

value, and then, optionally, the branches are merged until the splitting measure does not improve.

Merging is performed stepwise. At each step, the pair of branches is merged that most improves the

splitting measure.

When creating a split, observations with a missing value in the splitting variable are discarded when

computing the reduction in entropy, and the entropy of a split is computed as if the split makes an

additional branch exclusively for the missing values.

When applying a splitting rule to an observation with a missing value on the splitting variable, the

observation is replaced by B observations, one for each branch, and each new observation is assigned a

weight equal to the proportion of observations used to create the split sent into that branch. The posterior

probabilities of the original observation equal the weighted sum of the probabilities for the split


The tree is grown to overfit the training data. In each node, an upper confidence limit of the number

misclassified is estimated assuming a binomial distribution around the observed number misclassified. A

sub-tree is sought that minimizes the sum of upper confidences in each leaf.

C4.5 can convert a tree into a "ruleset", which is a set of rules that assigns most observations to the same

class that the tree does. Generally, the ruleset contains fewer rules than needed to describe all root-leaf

paths and is consequently more understandable than the tree.

C4.5 can create "fuzzy" splits on interval inputs. The tree is constructed the same as with non-fuzzy

splits. If an interval input has a value near the splitting value, then the observation is effectively replaced

by two observations, each with some weight related to the proximity of the input value to the splitting

value. The posterior probabilities of the original observation equal the weighted sum of probabilities for

the two new observations.

Description of C5.0

The Web page contains some information about C5.0. C5.0 is C4.5 with the

following differences:

The branch-merging option for nominal splits is default.


The user may specify misclassification costs.


Boosting and cross-validation are available.


Relation to PROC SPLIT

The algorithm for creating rulesets from trees is much improved.

The tree created with C4.5 will differ from those created with PROC SPLIT for several reasons:

C4.5 creates binary splits on interval inputs and multiway splits on nominal inputs. This favors

nominal inputs. PROC SPLIT treats interval and nominal inputs the same in this respect.

C4.5 uses a pruning method designed to avoid using validation data. PROC SPLIT expects

validation data to be available and so does not offer the pessimistic pruning method of C4.5.

The option settings most similar to C4.5 are:


EXHAUSTIVE= 0 (forces a heuristic search)

MAXBRANCH = maximum number of nominal values in an input, up to 100

NODESAMPLE= size of data set, up to 32,000




VALIDATA = validation data set

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

The SPLIT Procedure


The following examples were executed using the HP-UX version 10.20 operating system; the version of

the SAS system was 6.12TS045.

Example 1: Creating a Decision Tree with a Categorical Target (Rings Data)

Example 2: Creating a Decision Tree with an Interval Target (Baseball Data)

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

The SPLIT Procedure

Example 1: Creating a Decision Tree with a Categorical Target

(Rings Data)


Specifying the Input Variables and the Target Variable


Setting the Splitting Criterion


Setting the Maximum Number of Child Nodes of a Node


Specifying the smallest number of training observations a node must have to consider splitting it


Outputting and Printing Fit Statistics


Creating a Misclassification Table


Scoring Data with the Score Statement


Reading the Input Data Set from a Previously Created Decision Tree (the OUTTREE= data set) with the

INTREE= option.


Creating Diagnostic Scatter Plots


Creating Contour Plots of the Posterior Probabilities


Creating a Scatter Plot of the Leaf Nodes


This example demonstrates how to create a decision tree with a categorical target. The ENTROPY splitting criterion is used to

search for and evaluate candidate splitting rules.

The example DMDB training data set SAMPSIO.DMDRING contains a categorical target with 3 levels (C = 1, 2, or 3) and two

interval inputs (X and Y). There are 180 observations in the training data set. The SAMPSIO.DMSRING data set is scored using

the scoring formula from the trained model. Both data sets and the DMDB training catalog are stored in the sample library.



title  'SPLIT Example: RINGS Data';

title2  'Plot of the Rings Training Data';

goptions gunit=pct ftext=swiss ftitle=swissb htitle=4 htext=3;

proc gplot data=sampsio.dmdring;

   plot y*x=c /haxis=axis1 vaxis=axis2;

   symbol  c=black i=none v=dot;

   symbol2 c=red i=none v=square;

   symbol3 c=green i=none v=triangle;

   axis1 c=black width=2.5 order=(0 to 30 by 5);

   axis2 c=black width=2.5 minor=none order=(0 to 20 by 2);



title2 'Entropy Criterion';

proc split data=sampsio.dmdring




Dostları ilə paylaş:
1   ...   124   125   126   127   128   129   130   131   ...   148

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur © 2017
rəhbərliyinə müraciət

    Ana səhifə