The arboretum procedure

Yüklə 3.07 Mb.

ölçüsü3.07 Mb.
1   ...   116   117   118   119   120   121   122   123   ...   148
: documentation
documentation -> From cyber-crime to insider trading, digital investigators are increasingly being asked to
documentation -> EnCase Forensic Transform Your Investigations
documentation -> File Sharing Documentation Prepared by Alan Halter Created: 1/7/2016 Modified: 1/7/2016
documentation -> Gaia Data Release 1 Documentation release 0

Number >= support             0

Memory allocated megs:        2

NOTE: The PROCEDURE SEQUENCE used 0:00:33.42 real 0:00:16.17 cpu.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.


The NITEMS= option specifies the maximum number of events for which

rules, or chains, are  generated.



The SAME= option specifies the lower time-limit between the occurrence

of two events that you want to associate with each other (default = 0).

   visit time / same=2;



The SORT procedure sorts the observations in descending order by the

values of support.

proc sort data=s4out;

   by descending support;



The PRINT procedure lists the first 10 observations in the sorted sequence

data set.

proc print data=s4out(obs=10);

   var count support conf rule;

   title 'Partial Listing of the 4-Item Sequences';

   title2 'Lower Timing Limit Set to 2';


The SEQUENCE Procedure


Agrawal, R., Imielinski, T., and Swami, A. (1993), "Mining Association Rules between Sets of

Items in Large Databases", Proceedings, ACM SIGMOID Conference on Management of Data,

207-216, Washington, D. C.

Berry, M. J. A. and Linoff, G. (1997), Data Mining Techniques for Marketing, Sales, and

Customer Support, New York: John Wiley and Sons, Inc.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

The SPLIT Procedure

The SPLIT Procedure


Procedure Syntax

PROC SPLIT Statement

CODE Statement

DECISION Statement

DESCRIBE Statement

FREQ Statement

INPUT Statement

PRIORS Statement

PRUNE Statement

SCORE Statement

TARGET Statement



Example 1: Creating a Decision Tree with a Categorical Target (Rings Data)

Example 2: Creating a Decision Tree with an Interval Target (Baseball Data)


Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

The SPLIT Procedure


An empirical decision tree represents a segmentation of the data created by applying a series of simple

rules. Each rule assigns an observation to a segment based on the value of one input. One rule is applied

after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and

each segment is called a node. The original segment contains the entire data set and is called the root

node of the tree. A node with all its successors form a branch of the node that created it. The final nodes

are called leaves. For each leaf, a decision is made and applied to all observations in the leaf. The type of

decision depends on the context. In predictive modeling, the decision is simply the predicted value.

Besides modeling, decision trees can also select inputs or create dummy variables representing

interaction effects for use in a subsequent model, such as regression.

PROC SPLIT creates decision trees to either:

classify observations based on values of nominal or binary targets,

predict outcomes for interval targets, or

predict the appropriate decision when decision alternatives are specified.

PROC SPLIT can save the tree information in a SAS data set, which can be read again into the

procedure later.

PROC SPLIT can apply the tree to new data and create an output data set containing the predictions, or

the dummy variables for use in subsequent modeling. Alternatively, PROC SPLIT can generate DATA

step code for the same purpose.

Tree construction options include the popular features of CHAID (Chi-squared automatic interaction

detection) and those described in Classification and Regression Trees(Breiman, et al. 1984).

For example, using chi-square or F-test p-values as a splitting criterion, tree construction may stop when

the adjusted p-value is less significant than a specified threshold level, as in CHAID.

When a tree is created for any splitting criterion, the best sub-tree for each possible number of leaves is

automatically found. The sub-tree that works best on validation data may be selected automatically, as in

the Classification and Regression Trees method. The notion of "best" is implemented using an

assessment function equal to a profit matrix (or function) of target values.

Decision tree models are often easier to interpret than other models because the leaves are described

using simple rules. Another advantage of decision trees is in the treatment of missing data. The search

for a splitting rule uses the missing values of an input. Surrogate rules are available as backup when

missing data prohibit the application of a splitting rule.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

The SPLIT Procedure

Procedure Syntax

PROC SPLIT<option(s)>;

CODE <option(s)>;

DECISION DECDATA=<libref.>SAS-data-set <DECVARS=decision-variable(s)> <option(s)>;

DESCRIBE <options>;

FREQ variable;

IN | INPUT variable(s) </option(s)>;

PRIORS probabilities;

PRUNE node-identifier;

SCORE <score-option(s)>;

TARGET variable LEVEL=value> ;

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

Dostları ilə paylaş:
1   ...   116   117   118   119   120   121   122   123   ...   148

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur © 2017
rəhbərliyinə müraciət

    Ana səhifə