The arboretum procedure



Yüklə 3,07 Mb.
Pdf görüntüsü
səhifə128/148
tarix30.04.2018
ölçüsü3,07 Mb.
#40673
1   ...   124   125   126   127   128   129   130   131   ...   148

MAXBRANCH = 2

NODESAMPLE= 1000 (or whatever BFOS recommends )

NSURRS = 5 (or so )

SUBTREE = ASSESSMENT or ASSESS=IMPUNITY for CLASS PROBABILITY.

VALIDATA = validation data set

C4.5 and C5.0

Description of C4.5

The book, C4.5: PROGRAMS FOR MACHINE LEARNING, by J. Ross Quinlan, is the main

reference.

The target is nominal. The inputs may be nominal or interval.

The recommended splitting criteria is the Gain Ratio = reduction in entropy / entropy of split.

(Let P(b) denote the proportion of training observations a split assigns to branch b, b=1 to B. The

entropy of a split is defined as the entropy function applied to

{P(b): b = 1 to B}.)

For interval inputs, C4.5 finds the best binary split. For nominal inputs, a branch is created for every

value, and then, optionally, the branches are merged until the splitting measure does not improve.

Merging is performed stepwise. At each step, the pair of branches is merged that most improves the

splitting measure.

When creating a split, observations with a missing value in the splitting variable are discarded when

computing the reduction in entropy, and the entropy of a split is computed as if the split makes an

additional branch exclusively for the missing values.

When applying a splitting rule to an observation with a missing value on the splitting variable, the

observation is replaced by B observations, one for each branch, and each new observation is assigned a

weight equal to the proportion of observations used to create the split sent into that branch. The posterior

probabilities of the original observation equal the weighted sum of the probabilities for the split

observations.

The tree is grown to overfit the training data. In each node, an upper confidence limit of the number

misclassified is estimated assuming a binomial distribution around the observed number misclassified. A

sub-tree is sought that minimizes the sum of upper confidences in each leaf.

C4.5 can convert a tree into a "ruleset", which is a set of rules that assigns most observations to the same




class that the tree does. Generally, the ruleset contains fewer rules than needed to describe all root-leaf

paths and is consequently more understandable than the tree.

C4.5 can create "fuzzy" splits on interval inputs. The tree is constructed the same as with non-fuzzy

splits. If an interval input has a value near the splitting value, then the observation is effectively replaced

by two observations, each with some weight related to the proximity of the input value to the splitting

value. The posterior probabilities of the original observation equal the weighted sum of probabilities for

the two new observations.

Description of C5.0

The Web page http://www.rulequest.com contains some information about C5.0. C5.0 is C4.5 with the

following differences:

The branch-merging option for nominal splits is default.

q   

The user may specify misclassification costs.



q   

Boosting and cross-validation are available.

q   

Relation to PROC SPLIT

The algorithm for creating rulesets from trees is much improved.

The tree created with C4.5 will differ from those created with PROC SPLIT for several reasons:

C4.5 creates binary splits on interval inputs and multiway splits on nominal inputs. This favors

nominal inputs. PROC SPLIT treats interval and nominal inputs the same in this respect.

C4.5 uses a pruning method designed to avoid using validation data. PROC SPLIT expects

validation data to be available and so does not offer the pessimistic pruning method of C4.5.

The option settings most similar to C4.5 are:

CRITERION = ERATIO

EXHAUSTIVE= 0 (forces a heuristic search)

MAXBRANCH = maximum number of nominal values in an input, up to 100

NODESAMPLE= size of data set, up to 32,000

NSURRS = 0

WORTH = 0

SUBTREE = ASSESSMENT

VALIDATA = validation data set




Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.


The SPLIT Procedure

Examples

The following examples were executed using the HP-UX version 10.20 operating system; the version of

the SAS system was 6.12TS045.

Example 1: Creating a Decision Tree with a Categorical Target (Rings Data)

Example 2: Creating a Decision Tree with an Interval Target (Baseball Data)

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The SPLIT Procedure

Example 1: Creating a Decision Tree with a Categorical Target

(Rings Data)

Features

Specifying the Input Variables and the Target Variable

q   

Setting the Splitting Criterion



q   

Setting the Maximum Number of Child Nodes of a Node

q   

Specifying the smallest number of training observations a node must have to consider splitting it



q   

Outputting and Printing Fit Statistics

q   

Creating a Misclassification Table



q   

Scoring Data with the Score Statement

q   

Reading the Input Data Set from a Previously Created Decision Tree (the OUTTREE= data set) with the



INTREE= option.

q   


Creating Diagnostic Scatter Plots

q   


Creating Contour Plots of the Posterior Probabilities

q   


Creating a Scatter Plot of the Leaf Nodes

q   


This example demonstrates how to create a decision tree with a categorical target. The ENTROPY splitting criterion is used to

search for and evaluate candidate splitting rules.

The example DMDB training data set SAMPSIO.DMDRING contains a categorical target with 3 levels (C = 1, 2, or 3) and two

interval inputs (X and Y). There are 180 observations in the training data set. The SAMPSIO.DMSRING data set is scored using

the scoring formula from the trained model. Both data sets and the DMDB training catalog are stored in the sample library.

Program

 

title  'SPLIT Example: RINGS Data';



title2  'Plot of the Rings Training Data';

goptions gunit=pct ftext=swiss ftitle=swissb htitle=4 htext=3;

proc gplot data=sampsio.dmdring;

   plot y*x=c /haxis=axis1 vaxis=axis2;

   symbol  c=black i=none v=dot;

   symbol2 c=red i=none v=square;

   symbol3 c=green i=none v=triangle;

   axis1 c=black width=2.5 order=(0 to 30 by 5);

   axis2 c=black width=2.5 minor=none order=(0 to 20 by 2);

run;


 

title2 'Entropy Criterion';

proc split data=sampsio.dmdring

           dmdbcat=sampsio.dmdring

 

           criterion=entropy




Yüklə 3,07 Mb.

Dostları ilə paylaş:
1   ...   124   125   126   127   128   129   130   131   ...   148




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə