Basic Features
7
For an interval target variable, the prediction of an observation is the average of the
target values in the training data in the leaf to which the tree assigns the observation.
For one method of handling missing values, and in some nonstandard models based
on recursive partitioning, an observation may be assigned to more than one leaf. For
a categorical target, the the posterior probabilities for the observation are a weighted
average of the posterior probabilities associated with each leaf. For an interval target,
the prediction is a weighted average of the predictions in each leaf.
Assigning a branch to an observation is often called a decision, and hence the term,
decision tree
. Unfortunately, the terms decision and decision tree have different
meanings in the closely related discipline of decision theory. In decision theory, a
decision
refers to the decision alternative whose utility or profit function is largest for
a given probability distribution of outcomes. The ARBORETUM procedure adopts
this definition, and will assign a decision to each observation when decision alterna-
tives and a profit or loss function are specified in the DECISION statement.
Basic Features
The ARBORETUM procedure provides the ability to mix tree-construction strategies
advocated by Kass (CHAID) (1980) and by Breiman, Friedman, Olshen, and Stone
(1984) to match the needs of the situation. It extends the p-value adjustments of Kass
and the retrospective pruning and misclassification costs of Breiman et al.
The basic features of the ARBORETUM procedure include
• Nominal, ordinal, and interval input and target variables
• Several splitting criteria
– Variance reduction for interval targets
– F-test for interval targets
– Gini or entropy reduction (Information Gain) for categorical targets
– CHAID for nominal targets
• Binary or n-ary splits, for fixed or unspecified n
• Several missing values policies:
– Use missing values in the split search
– Assign missing values to most correlated branch
– Distribute missing observations over all branches
• Surrogate rules for missing values and variable importance
• Cost-complexity pruning and reduced-error pruning with validation data
• Prior probabilities optionally used in training or assessment
• Misclassification cost matrix incorporating new decision alternatives
• Incorporation of nominal decision matrix in the split criterion
• Interactive training mode for specifying splits and nodes to prune
8
The ARBORETUM Procedure
• Variable importance computed separately with training and validation data
• Generation of SAS DATA step code with an indicator variable for each leaf
• Generation of PMML
Enterprise Miner Tree Desktop Application
The SAS Enterprise Miner Tree Desktop Application is a Microsoft Windows appli-
cation enabling a user to
• view the results from the ARBORETUM procedure
• modify the tree created with the ARBORETUM procedure
• create a new tree
The Desktop Application is highly interactive, containing many tables and views that
may be independently arranged. Clicking on a variable, node, or subtree in one view
automatically selects corresponding items in others. The tree may print to a single
page or across multiple pages.
The Desktop Application for SAS 9.1 runs on Windows NT, 2000, and XP. It may be
executed on its own, or launched from Enterprise Miner 4.3 and 5.1. It is automati-
cally installed with Enterprise Miner 4.3.
Getting Started
This section presents a simple example in three parts to introduce the syntax of the
ARBORETUM procedure. The first part runs the procedure with a minimum number
of statements. The procedure creates a sequence of an increasingly complicated sub-
trees. The second part of the example prints an assessment of each subtree, and selects
one that is different from the one the procedure selected. The third part illustrates how
to explicitly change a splitting rule. The section begins with an explanation of how
the procedure statements may alternate between a training phase and an assessment
and output phase.
Running the ARBORETUM Procedure
The ARBORETUM procedure runs in two or three phases:
• initialization
• interactive training (optional)
• model assessment and output
The initialization statements specify the training data, variable roles, and other op-
tions that may not be set more than once. The interactive training statements are
optional. They allow complete control of the creation and modification of splitting
A Brief Example
9
rules and of the deletion of nodes. The model assessment and output statements cre-
ate a sequence of subtrees, evaluate and select subtrees, apply the predictions to new
data, and save model estimates in SAS data sets.
The ARBORETUM procedure executes statements as soon as they are submitted.
The initialization statements must appear first. If interactive training is intended,
those statements would typically come next, followed by assessment and output state-
ments. Interactive training statements may follow model assessment and output state-
ments, which in turn may be repeated after interactive training.
Interactive training begins with an INTERACT statement. The INTERACT statement
specifies which subtree to begin with, and, implicitly, which nodes to permanently
delete. Interactive training ends with any model assessment or output statement. If
no interactive training statements appear before the first assessment or output state-
ment, and no nontrivial tree is imported using the INMODEL= option in the PROC
ARBORETUM statement, the ARBORETUM procedure will automatically create a
tree.
A RUN statement clears error conditions. A QUIT statement terminates the proce-
dure.
A Brief Example
The following SAS code creates and saves a decision tree:
proc arboretum data=sashelp.shoes ;
target sales;
input region subsidiary product stores;
save summary=sum1
sequence=seq1
model=tree1
;
run;
proc print data=sum1 label;
The PROC ARBORETUM statement invokes the procedure. The DATA= option
specifies the training data to be the SHOES data set that exists in the SAS library,
SASHELP. The TARGET statement specifies SALES as the target variable. The
INPUT statement specifies input variables from the SHOES data set. No LEVEL=
option appears in the INPUT statement, and consequently the ARBORETUM
procedure assumes that the character variables, REGION, SUBSIDIARY, and
PRODUCT, have a nominal level of measurement, and the numeric variable,
STORES, has an interval level of measurement.
The ARBORETUM procedure does not use the SAS Output Delivery System.
Instead, it saves results in SAS data sets that may be printed, or, in the case of the
MODEL= data set, may be input into a subsequent PROC ARBORETUM statement,
eliminating the need to respecify the data set and variables, or may be input into the
Enterprise Miner Tree Desktop Application to continue the analysis or just explore
the results graphically.
Dostları ilə paylaş: |