the generalized delta rule. In backpropagation, the difference (delta) between the output
value and the target value is the error.
RPROP
Requests the RPROP algorithm.
QPROP
Requests Quickprop.
See the following table for the defaults for weight-based optimization techniques for a given value
of the OBJECT= option.
Defaults for Weight-based Optimization Techniques
OBJECTIVE
FUNCTION
OPTIMIZATION
TECHNIQUE
NUMBER
OF
WEIGHTS
OBJECT=DEV
LEVMAR
0 to 100
weights
OBJECT=DEV
QUANEW
101 - 501
weights
OBJECT=DEV
CONGRA
501 or
more
weights
(All other
objective
functions)
QUANEW
up to 500
weights
(All other
objective
functions)
CONGRA
501 or
more
weights
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The NEURAL Procedure
USE Statement
Sets all weights to values from a data set.
Category Action Statement - affects the network or the data sets. Options set in an action statement
affect only that statement.
USE SAS-data-set;
Required Arguments
SAS-data-set
Specifies an input data set that contains all the weights. Unlike the INITIAL statement, the USE
statement does not generate any random weights, therefore the data set must contain all of the
network weights and parameters.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The NEURAL Procedure
Details
For details about neural network architecture and training, see the online Neural Network Node:
Reference documentation. For an introduction to predictive modeling, see the online Predictive
Modeling document. Both of these documents can be accessed by using the Help pull-down menu to
select the "Open the Enterprise Miner Nodes Help" item.
The BPROP, RPROP, and QPROP Algorithms Used in
PROC NEURAL
Introduction
While the standard backprop algorithm has been a very popular algorithm for training feedforward
networks, performance problems have motivated numerous attempts at finding faster algorithms.
The following discussion of the implementation of the backprop (BPROP), RPROP, and QPROP
algorithms in PROC NEURAL relates the details of these algorithms with the printed output resulting
from the use of the PDETAIL option. The discussion uses the algorithmic description and notation in
Schiffmann, Joost, and Werner (1994) as well as the Neural Net Frequently Asked Questions (FAQ)
available as a hypertext document readable by any World-Wide Web browser, such as Mosaic, under the
URL: ftp://ftp.sas.com/pub/neural/FAQ.html.
There is an important distinction between "backprop" ( or "back propagation of errors") and the
"backpropagation algorithm".
The "back propagation of errors" is an efficient computational technique for computing the derivative of
the error function with respect to the weights and biases of the network. This derivative, more commonly
known as the error gradient, is needed for any first order nonlinear optimization method. The standard
backpropagation algorithm is a method for updating with weights based on the gradient. It is a variation
of the simple "delta rule". See "What is backprop?" in part 2 of the FAQ for more details and references
on standard backprop, RPROP, and Quickprop.
With any of the "prop" algorithms, PROC NEURAL allows detailed printing of the iterations. The
PDETAIL option in the TRAIN statement prints, for each iteration, all quantities involved in the
algorithm for each weight in the network. This option should be used with caution as it can result in
voluminous output. However, by restricting the number of iterations and number of non-frozen weights,
a manageable amount of information is produced. The purpose of the PDETAIL option is to allow you to
follow the nonlinear optimization of the error function for each of the network weights. For any
particular propagation method, as described below, all quantities used to compute an updated weight are
printed.
In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning
rate makes the weights and error function diverge, so there is no learning at all. If the error function is
quadratic, as in linear models, good learning rates can be computed from the Hessian matrix. If the error
function has many local and global optima, as in typical feedforward neural networks with hidden units,
the optimal learning rate often changes dramatically during the training process, because the Hessian
also changes dramatically. Trying to train a neural network using a constant learning rate is usually a
tedious process requiring much trial and error.
With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use
standard backprop at all, because vastly more efficient, reliable, and convenient batch training
algorithms exist such as Quickprop and RPROP.
Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as
standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function
of the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need
a large step size; this happens when you initialize a network with small random weights. In other regions
of the weight space, the gradient is small and you need a small step size; this happens when you are close
to a local minimum. Likewise, a large gradient may call for either a small step or a large step. Many
algorithms try to adapt the learning rate, but any algorithm that multiplies the learning rate by the
gradient to compute the change in the weights is likely to produce erratic behavior when the gradient
changes abruptly. The great advantage of Quickprop and RPROP is that they do not have this excessive
dependence on the magnitude of the gradient. Conventional optimization algorithms use not only the
gradient but also second-order derivatives or a line search (or some combination thereof) to obtain a
good step size.
Mathematical Notation
It is helpful to establish notation so that we can relate quantities and describe algorithms.
is the weight associated with the connection between the ith unit in the current layer
and the jth unit from the previous layer. The argument n refers to iteration.
1.
is the update or change for weight
. This update results in the
iteration value for
.
2.
is the partial derivative of the error function
with respect to the weight
at
the nth iteration.
3.
is the kth component of the output vector for the mth case as a function of the
inputs
and network weights
.
4.
is the
kth component of the target vector for the
mth case as a function of the inputs
.
5.
The basic algorithm in all methods is a generic update given by