The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	120/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 116 117 118 119 120 121 122 123 ... 148

Proceedings, ACM SIGMOID Conference on Management of Data
The SPLIT Procedure Overview Procedure Syntax
Details Examples
Classification and Regression Trees
Procedure Syntax PROC SPLIT
PRIORS

Number >= support 0

Memory allocated megs: 2

NOTE: The PROCEDURE SEQUENCE used 0:00:33.42 real 0:00:16.17 cpu.

The NITEMS= option specifies the maximum number of events for which

rules, or chains, are generated.

nitems=4;

The SAME= option specifies the lower time-limit between the occurrence

of two events that you want to associate with each other (default = 0).

visit time / same=2;

run;

The SORT procedure sorts the observations in descending order by the

values of support.

proc sort data=s4out;

by descending support;

run;

The PRINT procedure lists the first 10 observations in the sorted sequence

data set.

proc print data=s4out(obs=10);

var count support conf rule;

title 'Partial Listing of the 4-Item Sequences';

title2 'Lower Timing Limit Set to 2';

run;

The SEQUENCE Procedure

References

Agrawal, R., Imielinski, T., and Swami, A. (1993), "Mining Association Rules between Sets of

Items in Large Databases", Proceedings, ACM SIGMOID Conference on Management of Data,

207-216, Washington, D. C.

Berry, M. J. A. and Linoff, G. (1997), Data Mining Techniques for Marketing, Sales, and

Customer Support, New York: John Wiley and Sons, Inc.

The SPLIT Procedure

The SPLIT Procedure

Overview

Procedure Syntax

PROC SPLIT Statement

CODE Statement

DECISION Statement

DESCRIBE Statement

FREQ Statement

INPUT Statement

PRIORS Statement

PRUNE Statement

SCORE Statement

TARGET Statement

Details

Examples

Example 1: Creating a Decision Tree with a Categorical Target (Rings Data)

Example 2: Creating a Decision Tree with an Interval Target (Baseball Data)

References

The SPLIT Procedure

Overview

An empirical decision tree represents a segmentation of the data created by applying a series of simple

rules. Each rule assigns an observation to a segment based on the value of one input. One rule is applied

after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and

each segment is called a node. The original segment contains the entire data set and is called the root

node of the tree. A node with all its successors form a branch of the node that created it. The final nodes

are called leaves. For each leaf, a decision is made and applied to all observations in the leaf. The type of

decision depends on the context. In predictive modeling, the decision is simply the predicted value.

Besides modeling, decision trees can also select inputs or create dummy variables representing

interaction effects for use in a subsequent model, such as regression.

PROC SPLIT creates decision trees to either:

classify observations based on values of nominal or binary targets,

predict outcomes for interval targets, or

predict the appropriate decision when decision alternatives are specified.

PROC SPLIT can save the tree information in a SAS data set, which can be read again into the

procedure later.

PROC SPLIT can apply the tree to new data and create an output data set containing the predictions, or

the dummy variables for use in subsequent modeling. Alternatively, PROC SPLIT can generate DATA

step code for the same purpose.

Tree construction options include the popular features of CHAID (Chi-squared automatic interaction

detection) and those described in Classification and Regression Trees(Breiman, et al. 1984).

For example, using chi-square or F-test p-values as a splitting criterion, tree construction may stop when

the adjusted p-value is less significant than a specified threshold level, as in CHAID.

When a tree is created for any splitting criterion, the best sub-tree for each possible number of leaves is

automatically found. The sub-tree that works best on validation data may be selected automatically, as in

the Classification and Regression Trees method. The notion of "best" is implemented using an

assessment function equal to a profit matrix (or function) of target values.

Decision tree models are often easier to interpret than other models because the leaves are described

using simple rules. Another advantage of decision trees is in the treatment of missing data. The search

for a splitting rule uses the missing values of an input. Surrogate rules are available as backup when

missing data prohibit the application of a splitting rule.

The SPLIT Procedure

Procedure Syntax

PROC SPLIT<option(s)>;

CODE <option(s)>;

DECISION DECDATA=<libref.>SAS-data-set <DECVARS=decision-variable(s)> <option(s)>;

DESCRIBE <options>;

FREQ variable;

IN | INPUT variable(s) </option(s)>;

PRIORS probabilities;

PRUNE node-identifier;

SCORE <score-option(s)>;

TARGET variable LEVEL=value> ;

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 116 117 118 119 120 121 122 123 ... 148