the generalized delta rule. In backpropagation, the difference (delta)

between the output
value and the target value is the error.

RPROP

Requests the RPROP algorithm.

QPROP

Requests Quickprop.

See the following table for the defaults for weight-based optimization techniques for a given value

of the OBJECT= option.

**Defaults for Weight-based Optimization Techniques**
**OBJECTIVE**
**FUNCTION**
**OPTIMIZATION**
**TECHNIQUE**
**NUMBER**
**OF**
**WEIGHTS**
OBJECT=DEV

LEVMAR

0 to 100

weights

OBJECT=DEV

QUANEW

101 - 501

weights

OBJECT=DEV

CONGRA

501 or

more

weights

(All other

objective

functions)

QUANEW

up to 500

weights

(All other

objective

functions)

CONGRA

501 or

more

weights

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

*The NEURAL Procedure*
**USE Statement**
**Sets all weights to values from a data set.**
**Category **Action Statement - affects the network or the data sets. Options

set in an action statement
affect only that statement.

**USE ***SAS-data-set*;

**Required Arguments**
**SAS-data-set**
Specifies an input data set that contains all the weights. Unlike the INITIAL statement, the USE

statement does not generate any random weights, therefore the data set must contain all of the

network weights and parameters.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

*The NEURAL Procedure*
**Details**
For details about neural network architecture and training, see the online Neural Network Node:

Reference documentation. For an introduction to predictive modeling, see the online Predictive

Modeling document. Both of these documents can be accessed by using the Help pull-down menu to

select the "Open the Enterprise Miner Nodes Help" item.

**The BPROP, RPROP, and QPROP Algorithms Used in**

**PROC NEURAL**

**Introduction**

While the standard backprop algorithm has been a very popular algorithm for training feedforward

networks, performance problems have motivated numerous attempts at finding faster algorithms.

The following discussion of the implementation of the backprop (BPROP), RPROP, and QPROP

algorithms in PROC NEURAL relates the details of these algorithms with the printed output resulting

from the use of the PDETAIL option. The discussion uses the algorithmic description and notation in

Schiffmann, Joost, and Werner (1994) as well as the Neural Net Frequently Asked Questions (FAQ)

available as a hypertext document readable by any World-Wide Web browser, such as Mosaic, under the

URL: ftp://ftp.sas.com/pub/neural/FAQ.html.

There is an important distinction between "backprop" ( or "back propagation of errors") and the

"backpropagation algorithm".

The "back propagation of errors" is an efficient computational technique for computing the derivative of

the error function with respect to the weights and biases of the network. This derivative, more commonly

known as the error gradient, is needed for any first order nonlinear optimization method. The standard

backpropagation algorithm is a method for updating with weights based on the gradient. It is a variation

of the simple "delta rule". See "What is backprop?" in part 2 of the FAQ for more details and references

on standard backprop, RPROP, and Quickprop.

With any of the "prop" algorithms, PROC NEURAL allows detailed printing of the iterations. The

PDETAIL option in the TRAIN statement prints, for each iteration, all quantities involved in the

algorithm for each weight in the network. This option should be used with caution as it can result in

voluminous output. However, by restricting the number of iterations and number of non-frozen weights,

a manageable amount of information is produced. The purpose of the PDETAIL option is to allow you to

follow the nonlinear optimization of the error function for each of the network weights. For any

particular propagation method, as described below, all quantities used to compute an updated weight are

printed.

In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning

rate makes the weights and error function diverge, so there is no learning at all. If the error function is

quadratic, as in linear models, good learning rates can be computed from the Hessian matrix. If the error

function has many local and global optima, as in typical feedforward neural networks with hidden units,

the optimal learning rate often changes dramatically during the training process, because the Hessian

also changes dramatically. Trying to train a neural network using a constant learning rate is usually a

tedious process requiring much trial and error.

With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use

standard backprop at all, because vastly more efficient, reliable, and convenient batch training

algorithms exist such as Quickprop and RPROP.

Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as

standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function

of the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need

a large step size; this happens when you initialize a network with small random weights. In other regions

of the weight space, the gradient is small and you need a small step size; this happens when you are close

to a local minimum. Likewise, a large gradient may call for either a small step or a large step. Many

algorithms try to adapt the learning rate, but any algorithm that multiplies the learning rate by the

gradient to compute the change in the weights is likely to produce erratic behavior when the gradient

changes abruptly. The great advantage of Quickprop and RPROP is that they do not have this excessive

dependence on the magnitude of the gradient. Conventional optimization algorithms use not only the

gradient but also second-order derivatives or a line search (or some combination thereof) to obtain a

good step size.

**Mathematical Notation**

It is helpful to establish notation so that we can relate quantities and describe algorithms.

is the weight associated with the connection between the **i**th unit in the current layer

and the **j**th unit from the previous layer. The argument **n** refers to iteration.

1.

is the update or change for weight

. This update results in the

iteration value for

.

2.

is the partial derivative

of the error function
with respect to the weight

at

the **n**th iteration.

3.

is the **k**th component of the output vector for the **m**th case as a function of the

inputs

and network weights

.

4.

is the

**k**th component

of the target vector for the **m**th case as a function of the inputs

.

5.

The basic algorithm in all methods is a generic update given by