The arboretum procedure

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	2/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 2 3 4 5 6 7 8 9 ... 148

The ARBORETUM Procedure

The ARBORETUM Procedure

Overview

A decision tree is a type of predictive model developed independently in the statistics

and artiﬁcial intelligence communities. A tree partitions large amounts of data into

segments called terminal nodes or leaves. The data in a leaf determine estimates of

the value of a target variable, the dependent variable to be predicted. These estimates

are subsequently applied to predict the target of a new observation assigned to the

leaf. The ARBORETUM procedure searches for partitions that ﬁt the training data,

the data used to compute the estimates. If these estimates ﬁt new data well, the tree is

said to generalize well. Good generalization is the primary goal for predictive tasks.

A tree might ﬁt the training data well but generalize poorly.

Decision trees may also help prepare data for other predictive models by suggesting

which variables to use, suggesting interactions of variables, and providing segments

for stratiﬁed modeling.

Trees are popular because they seem easy to use and understand. They seem easy to

use because trees accept interval, ordinal, and nominal variables and tolerate missing

values well. No understanding of statistical distributions is necessary because tree

construction relies on frequencies of values in the training data set. Trees seem easy

to understand because small trees clearly depict how a few variables characterize a

target variable.

However, trees have shortcomings. Small trees, though easy to understand, are too

simplistic to represent complex relationships in data, and large trees have many par-

titions that, collectively, may be difﬁcult to comprehend. Trees require a lot of data

to discover complex relationships, and even to ﬁt a simple linear relationship well.

Trees are therefore less efﬁcient and less intuitive than a simple regression when a

linear relationship exists. Even when a tree provides a simple and accurate descrip-

tion, other equally simple and accurate descriptions may exist. The tree would then

give the false impression that certain variables uniquely explain the variations in the

target values, whereas different variables would suggest a different interpretation that

might generalize just as well.

Note: this document describes syntax for the ARBORETUM procedure in SAS 9.1.

The syntax is likely to change and be incompatible in future releases.

The ARBORETUM Procedure

Terminology

The ARBORETUM procedure uses recursive partitioning to create a decision tree.

Recursive partitioning partitions the data into subsets and then partitions each of the

subsets, and so on. In the terminology of the tree metaphor, the subsets are nodes,

the original data set is the root node, and the ﬁnal, unpartitioned subsets are terminal

nodes

or leaves. Nodes that are not terminal nodes are sometimes called internal

nodes. The subsets of a single partition are commonly called child nodes, thereby

mixing the metaphor with genealogy, which also provides the terms descendent and

ancestor

nodes. A branch of a node consists of a child node and its descendents.

The ARBORETUM procedure deﬁnes a partition in terms of values of a single vari-

able selected from a set of available variables called input variables. A rule assigning

the variable values to branches is called a splitting rule. It is sometimes called the

primary

splitting rule when discussing alternative partitions of the same node. A

competing rule

refers to a partition considered on an input variable other than the one

used in the primary rule. A leaf has no primary or competing rule, but may have a

candidate rule

ready for splitting the leaf. A surrogate rule is one chosen to emulate

a primary rule, and is used when the primary rule cannot be used, most commonly

when an observation is missing a value for the primary input variable.

The ARBORETUM procedure searches for a splitting rule that maximizes an asso-

ciation between the target variable and the node partitions. The splitting criterion

deﬁnes the measure of worth of the rule.

A nominal variable is a numeric or character categorical variable in which the cate-

gories are unordered. An ordinal variable is a numeric or character categorical vari-

able in which the categories are ordered. An interval variable is a numeric variable for

which differences of values are informative. The measurement level of the variable is

the property of being nominal, ordinal, or interval.

The ARBORETUM procedure uses normalized, formatted values of categorical vari-

ables, and considers two categorical values the same if the normalized values are the

same. Normalization removes any leading blank spaces from a value, converts lower

case characters to upper case, and truncates the value to 32 characters. The FORMAT

procedure in the SAS Procedures Guide explains how to deﬁne a format. A FORMAT

statement in the current run of PROC ARBORETUM or in the DATA step that created

the training data associates a format with a variable. By default, numeric variables

use the BEST12 format, and the formatted values of character variables are the same

as the unformatted ones.

The relative proportions of categorical target values in the training data may differ

from the proportions in the data to which the tree will be applied. Estimates of these

latter proportions should be speciﬁed with prior probabilities when the tree is cre-

ated. If the prior probabilities are the same as the proportions of the target values in

the training data, then the predicted probabilities for an observation, also called the

posterior probabilities

, equal the proportions of the target values in the training data

in the leaf to which the observation is assigned. If the prior probabilities differ from

the training data proportions, then the posterior probabilities will also.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 148