The arboretum procedure

Yüklə 3.07 Mb.

 səhifə 128/148 tarix 30.04.2018 ölçüsü 3.07 Mb.
 : documentationdocumentation -> From cyber-crime to insider trading, digital investigators are increasingly being asked todocumentation -> EnCase Forensic Transform Your Investigationsdocumentation -> File Sharing Documentation Prepared by Alan Halter Created: 1/7/2016 Modified: 1/7/2016documentation -> Gaia Data Release 1 Documentation release 0 MAXBRANCH = 2 NODESAMPLE= 1000 (or whatever BFOS recommends ) NSURRS = 5 (or so ) SUBTREE = ASSESSMENT or ASSESS=IMPUNITY for CLASS PROBABILITY. VALIDATA = validation data set C4.5 and C5.0 Description of C4.5 The book, C4.5: PROGRAMS FOR MACHINE LEARNING, by J. Ross Quinlan, is the main reference. The target is nominal. The inputs may be nominal or interval. The recommended splitting criteria is the Gain Ratio = reduction in entropy / entropy of split. (Let P(b) denote the proportion of training observations a split assigns to branch b, b=1 to B. The entropy of a split is defined as the entropy function applied to {P(b): b = 1 to B}.) For interval inputs, C4.5 finds the best binary split. For nominal inputs, a branch is created for every value, and then, optionally, the branches are merged until the splitting measure does not improve. Merging is performed stepwise. At each step, the pair of branches is merged that most improves the splitting measure. When creating a split, observations with a missing value in the splitting variable are discarded when computing the reduction in entropy, and the entropy of a split is computed as if the split makes an additional branch exclusively for the missing values. When applying a splitting rule to an observation with a missing value on the splitting variable, the observation is replaced by B observations, one for each branch, and each new observation is assigned a weight equal to the proportion of observations used to create the split sent into that branch. The posterior probabilities of the original observation equal the weighted sum of the probabilities for the split observations. The tree is grown to overfit the training data. In each node, an upper confidence limit of the number misclassified is estimated assuming a binomial distribution around the observed number misclassified. A sub-tree is sought that minimizes the sum of upper confidences in each leaf. C4.5 can convert a tree into a "ruleset", which is a set of rules that assigns most observations to the same class that the tree does. Generally, the ruleset contains fewer rules than needed to describe all root-leaf paths and is consequently more understandable than the tree. C4.5 can create "fuzzy" splits on interval inputs. The tree is constructed the same as with non-fuzzy splits. If an interval input has a value near the splitting value, then the observation is effectively replaced by two observations, each with some weight related to the proximity of the input value to the splitting value. The posterior probabilities of the original observation equal the weighted sum of probabilities for the two new observations. Description of C5.0 The Web page http://www.rulequest.com contains some information about C5.0. C5.0 is C4.5 with the following differences: The branch-merging option for nominal splits is default. q    The user may specify misclassification costs. q    Boosting and cross-validation are available. q    Relation to PROC SPLIT The algorithm for creating rulesets from trees is much improved. The tree created with C4.5 will differ from those created with PROC SPLIT for several reasons: C4.5 creates binary splits on interval inputs and multiway splits on nominal inputs. This favors nominal inputs. PROC SPLIT treats interval and nominal inputs the same in this respect. C4.5 uses a pruning method designed to avoid using validation data. PROC SPLIT expects validation data to be available and so does not offer the pessimistic pruning method of C4.5. The option settings most similar to C4.5 are: CRITERION = ERATIO EXHAUSTIVE= 0 (forces a heuristic search) MAXBRANCH = maximum number of nominal values in an input, up to 100 NODESAMPLE= size of data set, up to 32,000 NSURRS = 0 WORTH = 0 SUBTREE = ASSESSMENT VALIDATA = validation data set Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved. The SPLIT Procedure Examples The following examples were executed using the HP-UX version 10.20 operating system; the version of the SAS system was 6.12TS045. Example 1: Creating a Decision Tree with a Categorical Target (Rings Data) Example 2: Creating a Decision Tree with an Interval Target (Baseball Data) Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved. The SPLIT Procedure Example 1: Creating a Decision Tree with a Categorical Target (Rings Data) Features Specifying the Input Variables and the Target Variable q    Setting the Splitting Criterion q    Setting the Maximum Number of Child Nodes of a Node q    Specifying the smallest number of training observations a node must have to consider splitting it q    Outputting and Printing Fit Statistics q    Creating a Misclassification Table q    Scoring Data with the Score Statement q    Reading the Input Data Set from a Previously Created Decision Tree (the OUTTREE= data set) with the INTREE= option. q    Creating Diagnostic Scatter Plots q    Creating Contour Plots of the Posterior Probabilities q    Creating a Scatter Plot of the Leaf Nodes q    This example demonstrates how to create a decision tree with a categorical target. The ENTROPY splitting criterion is used to search for and evaluate candidate splitting rules. The example DMDB training data set SAMPSIO.DMDRING contains a categorical target with 3 levels (C = 1, 2, or 3) and two interval inputs (X and Y). There are 180 observations in the training data set. The SAMPSIO.DMSRING data set is scored using the scoring formula from the trained model. Both data sets and the DMDB training catalog are stored in the sample library. Program   title  'SPLIT Example: RINGS Data'; title2  'Plot of the Rings Training Data'; goptions gunit=pct ftext=swiss ftitle=swissb htitle=4 htext=3; proc gplot data=sampsio.dmdring;    plot y*x=c /haxis=axis1 vaxis=axis2;    symbol  c=black i=none v=dot;    symbol2 c=red i=none v=square;    symbol3 c=green i=none v=triangle;    axis1 c=black width=2.5 order=(0 to 30 by 5);    axis2 c=black width=2.5 minor=none order=(0 to 20 by 2); run;   title2 'Entropy Criterion'; proc split data=sampsio.dmdring            dmdbcat=sampsio.dmdring              criterion=entropy Dostları ilə paylaş:

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

Ana səhifə