The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	8/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 4 5 6 7 8 9 10 11 ... 148

MEASURE=PROFIT | ASE | MISC | LIFT | LIFTPROFIT
NOPRIORS | PRIORS
PRUNEDATA= VALID | TRAIN
NOVALIDATA | VALIDATA=
NODES=
CODE Statement CODE
CATALOG=
DUMMY | NODUMMY

ASSESS Statement

the selection. The output statements, CODE, DESCRIBE, SAVE, and SCORE, use

the subtree selected in the most recent ASSESS or SUBTREE statement, and ignore

nodes not in the selected subtree.

The ASSESS, SUBTREE, and output statements end initialization and interactive

training. If the SUBTREE statement or an output statement is the ﬁrst statement after

initialization or interactive training, an ASSESS statement is implied. The procedure

computes a new subtree sequence using the most recently speciﬁed ASSESS state-

ment options, and selects the best subtree before executing the SUBTREE or output

statement.

If the ASSESS, SUBTREE, or output statement immediately follows initialization

so that no interactive training statements appear, and the tree contains no more than

the root node, then these statements will create a tree. Otherwise, if the root node is

already split using information imported using the INMODEL= option in the PROC

statement, then further split searches will not occur unless explicitly requested with

interactive training statements.

Table

4

summarizes the options available in the ASSESS statement. An option re-

mains in effect in subsequent occurrences of the ASSESS statement unless explicitly

speciﬁed differently.

Table 4.

Assess Statement Options

Option

Description

EVENT=

speciﬁes categorical target value for LIFT

MEASURE=

speciﬁes the assessment measure

NOPRIORS

ignores prior probabilities in subtree search

PROPORTION=

speciﬁes proportion of observations for LIFT

PRUNEDATA=

speciﬁes training or validation data for choosing subtrees

PRIORS

incorporates prior probabilities in subtree search

VALIDATA=

speciﬁes validation data set

NOVALIDATA

terminates a previous VALIDATA= option

The following list describes these options. See the

“Tree Assessment and the Subtree

Sequence”

section beginning on page 49 for more detail.

EVENT= category

speciﬁes a formatted value of a categorical target to use with the LIFT assessment

measure. If the EVENT= option is absent in one ASSESS statement, the last value

speciﬁed in any ASSESS statement is used. If the EVENT= option has never been

speciﬁed, the least frequent target value in the training data is used. The EVENT

option is ignored with an interval target and with other assessment measures.

MEASURE=PROFIT | ASE | MISC | LIFT | LIFTPROFIT

speciﬁes the assessment measure. Table

summarizes the available measures.

The ARBORETUM Procedure

Table 5.

Assessment Measures

Measure

Description

ASE

Average square error

LIFT

Average or proportion among the highest ranked observations

LIFTPROFIT

Average proﬁt or loss among the highest ranked observations

MISC

Proportion misclassiﬁed

PROFIT

Average proﬁt or loss from the decision function

The default measure is PROFIT if the DECISION statement speciﬁes a proﬁt or loss

function or if the target variable is ordinal. Otherwise the default measure for a nom-

inal target is MISC, and the default for an interval target is ASE. MISC is applicable

to nominal and ordinal targets. ASE is applicable to any kind of target.

For an interval target, the LIFT measure is the average target value among observa-

tions predicted to have the highest average. The PROPORTION= option speciﬁes

the proportion of observations to use. For a categorical target, the LIFT measure is

the proportion of observations with the target value speciﬁed in the EVENT= op-

tion among observations with the highest posterior probability of the EVENT= target

value.

NOPRIORS | PRIORS

speciﬁes whether to ignore prior probabilities when creating the sequence of subtrees.

The default is NOPRIORS, ignoring prior probabilities. The section

“Formulas for

Assessment Measures”

on page 50 describes how prior probabilities enter into the

formulae for evaluating subtrees.

PROPORTION=value

The PROPORTION= option speciﬁes the proportion of observations to use with the

LIFT and LIFTPROFIT assessment measures. The PROPORTION= option is ig-

nored unless LIFT or LIFTPROFIT is speciﬁed. The value must be between 0 and

1. If absent, the most recent value speciﬁed with LIFT or LIFTPROFIT is used.

Requesting LIFT or LIFTPROFIT without ever specifying the PROPORTION= op-

tion is an error.

PRUNEDATA= VALID | TRAIN

speciﬁes whether to use training or validation data when evaluating subtrees for in-

clusion in the subtree sequence. The default is VALID. If PRUNEDATA=VALID and

validation data exists, then the subtree chosen for a given number of leaves is one

with the best assessment value using the validation data.

NOVALIDATA | VALIDATA= SAS-data-set

speciﬁes the validation data set.

The NOVALIDATA option nulliﬁes any

VALIDATA= option appearing in a previous ASSESS statement.

BRANCH Statement

BRANCH < options > ;

The BRANCH statement is an interactive training statement that splits leaves into

branches using the primary candidate splitting rule deﬁned in the leaves.

The

CODE Statement

SETRULE and SEARCH statements create candidate rules. The PRUNE statement

converts primary and competing rules to candidate rules when converting a node to a

leaf. The BRANCH statement will not split a leaf without a candidate rule.

NODES=nodeids

restricts the creation of branches to leaves descendent to nodes in the list of node

identiﬁers, nodeids.

ONE

restricts branching to the one leaf with the best candidate splitting rule. If a list of

nodes is speciﬁed in the NODES= option, the ONE option only considers the leaves

descendent to nodes in the list.

CODE Statement

CODE < options > ;

The CODE statement generates SAS DATA step code that mimics the computations

done by the SCORE statement. The DATA step code creates the same variables de-

scribed in the section

“SCORE Statement OUT= Output Data Set”

on page 59. Using

the CODE statement for a tree containing a rule with MISSING=DISTRIBUTE is an

error.

CATALOG= catname | FILE= ﬁlename

speciﬁes where to output the code. Catname speciﬁes a catalog entry by providing a

compound name with one to four of the levels in the form, library.catalog.entry.type.

The default library is determined by the SAS system option USER=, usually WORK.

The default entry is SASCODE, and the default type is SOURCE. Filename speciﬁes

the name of the ﬁle to contain the code. Filename can be either:

1. A quoted string, the value of which is the name (including the extension, if

any) of the ﬁle to be opened.

2. An unquoted SAS name of no more than eight characters. If this name has

been assigned as a ﬁleref in a FILENAME statement, the ﬁle speciﬁed in the

FILENAME statement is opened. The special ﬁlerefs LOG and PRINT are

always assigned. If the speciﬁed name is not an assigned ﬁleref, the speciﬁed

value is concatenated with the extension .txt before opening. For example, if

FOO is not an assigned ﬁleref, FILE=FOO would cause FOO.txt to be opened.

If the name has more than eight characters, an error message is printed.

If no catalog or ﬁle is speciﬁed, then the code is output to the SAS log.

DUMMY | NODUMMY

requests creation of a dummy variable for each leaf node. The variables have names,

–i–, for i = 1, 2, ..., L, where L is the number of leaves. The value of the dummy

variable –i– is 1 for observations assigned exclusively to leaf i, and 0 for observa-

tions not in leaf i. For observations distributed over more than one leaf, –i– equals

the proportion of the observation assigned the leaf i. The default is NODUMMY,

suppressing the creation of dummy variables.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 4 5 6 7 8 9 10 11 ... 148