54
The ARBORETUM Procedure
after establishing the primary splitting rule. Consequently, creating a node, finding a
split, and finding surrogate splits requires at least three passes of the data. A separate
search for a rule for missing values (and hence a separate pass) is only necessary
for splits that are defined in the SPLIT statement and for which the rule for missing
values is omitted. If the rule for missing values is present in the SPLIT statement, no
pass is needed for a split search in the node for any input.
The number of bytes needed for each search task is approximately equal to the within-
node sample size specified in the NODESIZE= option in the PERFORMANCE state-
ment, times 3, times the number of bytes in a double word, which is 8 on most com-
puters.
Memory Considerations
Reserving more memory may reduce the number of data passes, but may not reduce
the execution time if a large proportion of the memory is virtual memory swapped
to disk. A computer operating system allocates more memory to software programs
running on the system than is physically available. When the operating system de-
tects that no program is using an allocated section of physical memory, the system
copies the contents of the section to disk, an action commonly called swapping-out,
and then reassigns the section to satisfy another request for memory. When the pro-
gram that created the original contents tries to access it, the operating system finds
another dormant section of physical memory, swaps that section to disk, and swaps
the original contents to the new section of physical memory. The programs appear to
have access to more memory than physically exists. The apparent amount of memory
is called virtual memory.
By default, the ARBORETUM procedure estimates the amount of memory it will
need for tree construction tasks, asks the operating system how much physical mem-
ory is available, and then allocates just enough to perform the tasks, or all of physical
memory, whichever is smaller. The estimate of the amount of memory assumes that
all split searches in a node are done in the same pass. The MEMSIZE= option to
the PERFORMANCE statement overrides the default process. The SAS MEMSIZE
option sets limits on the amount of physical memory available to the tasks.
IMPORTANCE= Output Data Set
The IMPORTANCE= option in the SAVE statement specifies the output data set to
contain the measure of relative importance of each input variable in the selected
subtree. The ASSESS and SUBTREE statements determine the subtree. Each ob-
servation describes an input variable. The observations are in order of decreasing
importance as computed with the training data.
Variable Importance
The relative importance of input variable v in subtree T is computed as
I(v; T ) ∝
τ ∈T
a(s
v
, τ )∆SSE(τ )
Variables in the Data Set
55
where the sum is over nodes τ in T , and s
v
denotes the primary or surrogate splitting
rule using v. a(s
v
, τ ) is the measure of agreement for the rule using v in node τ :
a(s
v
, τ ) =
1
if s
v
is the primary splitting rule
agreement
if s
v
is a surrogate rule
0
otherwise
∆SSE(τ ) is the reduction in sum of square errors from the predicted values:
∆SSE(τ ) = SSE(τ ) −
b∈B(τ )
SSE(τ
b
)
SSE(τ ) =
N (τ )
i=1
(Y
i
− ˆ
Y (τ ))
2
for interval target Y
N (τ )
i=1
J
j=1
(δ
ij
− ˆ
p
j
(τ ))
2
for target with J categories
where
B(τ )
= set of branches from τ
τ
b
= child node of τ in branch b
N (τ )
= number of observations in τ
ˆ
Y (τ )
= average Y in training data in τ
δ
ij
= 1 if Y
i
= j, 0 otherwise
ˆ
p
j
(τ )
= average δ
ij
in training data in τ
For a categorical target, the formula for SSE(τ ) reduces to
SSE(τ ) =
N (1 −
J
j=1
ˆ
p
2
j
)
for training data
N (1 −
J
j=1
(2p
j
− ˆ
p
j
)ˆ
p
j
)
for validation data
where p
j
is the proportion of the validation data with target value j, and N , p
j
, and
ˆ
p
j
are evaluated in node τ .
Variables in the Data Set
The IMPORTANCE= data set contains the following variables:
• NAME, the input variable name
• LABEL, the input variable label
• NRULES, the number of splitting rules using this variable
• NSURROGATES, the number of surrogate rules using this variable
• IMPORTANCE, the relative importance computed with the training data
• V–IMPORTANCE, the relative importance computed with the validation data