## Derivative-Based Optimization
## Contents ## Optimization problems ## Mathematical background ## Descent Methods ## The Method of Steepest Descent ## Conjugate Gradient
## Objective function – mathematical function which is optimized by changing the values of the design variables. ## Objective function – mathematical function which is optimized by changing the values of the design variables. ## Design Variables – Those variables which we, as designers, can change. ## Constraints – Functions of the design variables which establish limits in individual variables or combinations of design variables.
## 3 basic ingredients… ## 3 basic ingredients… - an objective function,
- a set of decision variables,
- a set of equality/inequality constraints.
## Design Variables: decision and objective vector - Design Variables: decision and objective vector
- Constraints: equality and inequality
- Bounds: feasible ranges for variables
- Objective Function: maximization can be converted to minimization due to the duality principle
## Identify the quantity or function, *f*, to be optimized. ## Identify the design variables: *x*1, *x*2, *x*3, …,*xn*. ## Identify the constraints if any exist ## a. Equalities ## b. Inequalities ## Adjust the design variables (*x*’s) until *f* is optimized and all of the constraints are satisfied.
## Objective functions may be unimodal or multimodal. ## Objective functions may be unimodal or multimodal. - Unimodal – only one optimum
- Multimodal – more than one optimum
## Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design. ## The global optimum is the best of all local optimum designs.
## Existence of global minimum ## Existence of global minimum ## If *f*(*x*) is continuous on the feasible set *S* which is closed and bounded, then *f*(*x*) has a global minimum in *S* - A set
*S* is **closed** if it contains all its boundary pts. - A set
*S* is **bounded** if it is contained in the interior of some circle
__Derivative-based optimization__ (gradient based) __Derivative-based optimization__ (gradient based)
- Capable of determining “search directions” according to an objective function’s derivative information
- steepest descent method;
- Newton’s method; Newton-Raphson method;
- Conjugate gradient, etc.
__Derivative-free optimization__
- random search method;
- genetic algorithm;
- simulated annealing; etc.
## A square matrix *M *is **positive definite **if ## A square matrix *M *is **positive definite **if ## It is **positive semidefinite **if
## A symmetric matrix *M* = *MT* is positive definite if and only if its eigenvalues *λi* > 0. (semidefinite ↔ *λi* ≥ 0) ## A symmetric matrix *M* = *MT* is positive definite if and only if its eigenvalues *λi* > 0. (semidefinite ↔ *λi* ≥ 0) *Proof *(*→*): Let *vi* the eigenvector for the *i-*th eigenvalue *λi* - Then,
- which implies
*λi* > 0,
__Theorem__: If a matrix *M = UTU* then it is positive definite __Theorem__: If a matrix *M = UTU* then it is positive definite
*f*: *Rn *→ *R* is a **quadratic function** if *f*: *Rn *→ *R* is a **quadratic function** if
* *
## It is no necessary for *Q *be symmetric. ## It is no necessary for *Q *be symmetric. - Suposse matrix
*P *non-symmetric
## Suposse matrix *P *non-symmetric. Example - Suposse matrix
*P *non-symmetric. Example
## Given the quadratic function ## Given the quadratic function
## Two other shapes can result from the quadratic form. ## Two other shapes can result from the quadratic form. - If Q is negative definite, then
*f* is a parabolic “bowl” up side down. - If Q is indefinite then
*f* describes a saddle.
## Quadratics are useful in the study of optimization. ## Quadratics are useful in the study of optimization. - Often, objective functions are “close to” quadratic near the solution.
- It is easier to analyze the behavior of algorithms when applied to quadratics.
- Analysis of algorithms for quadratics gives insight into their behavior in general.
## The derivative of *f*: *R* → *R* is a function *f *′: *R* → *R *given by ## The derivative of *f*: *R* → *R* is a function *f *′: *R* → *R *given by ## if the limit exists.
## Along the Axes…
## In general direction…
__Definition__: A real-valued function *f*: *Rn *→ *R* is said to be **continuously differentiable** if the partial derivatives __Definition__: A real-valued function *f*: *Rn *→ *R* is said to be **continuously differentiable** if the partial derivatives
## exist for each *x *in* Rn *and are continuous functions of *x*. ## In this case, we say *f C*1* *(a **smooth** **function** **C****1**)
__Definition__: The gradient of *f*: in *R*2* *→ *R:* __Definition__: The gradient of *f*: in *R*2* *→ *R:*
## It is a function ∇*f*: *R*2* *→ *R*2* *given by
__Definition__: The gradient of *f*: *Rn *→ *R* is a function ∇*f*: *Rn *→ *Rn *given by __Definition__: The gradient of *f*: *Rn *→ *R* is a function ∇*f*: *Rn *→ *Rn *given by
## The gradient defines (hyper) plane approximating the function infinitesimally
## By the chain rule
**Proposition 1**: **Proposition 1**:
## is maximal choosing
*Proof*:
*Proof*: *Proof*:
- On the other hand for general
*v*:
**Proposition 2**: let *f*: *Rn *→ *R *be a smooth function *C*1 around p, **Proposition 2**: let *f*: *Rn *→ *R *be a smooth function *C*1 around p,
## if *f* has local **minimum** (maximum) at p then,
*Proof*: intuitive
## We found the best INFINITESIMAL DIRECTION at each point, ## We found the best INFINITESIMAL DIRECTION at each point, ## Looking for minimum: “blind man” procedure ## How can we derive the way to the minimum using this knowledge?
## The gradient of **f**: *Rn *→ *Rm *is a function *Df*: *Rn *→ *Rm*×*n *given by ## The gradient of **f**: *Rn *→ *Rm *is a function *Df*: *Rn *→ *Rm*×*n *given by
## If the derivative of ∇*f *exists, we say that *f *is twice differentiable. ## If the derivative of ∇*f *exists, we say that *f *is twice differentiable. - Write the second derivative as
*D*2*f *(or **F**), and call it the *Hessian *of *f*.
## The level set of a function *f*: *Rn *→ *R *at level *c *is the set of points *S *= {**x**: *f*(**x**) = *c*}. ## The level set of a function *f*: *Rn *→ *R *at level *c *is the set of points *S *= {**x**: *f*(**x**) = *c*}.
**Fact**: ∇*f*(**x**0) is orthogonal to the level set at **x**0 **Fact**: ∇*f*(**x**0) is orthogonal to the level set at **x**0
*Proof of fact*: *Proof of fact*:
- Imagine a particle traveling along the level set.
- Let
**g**(*t*) be the position of the particle at time *t*, with **g**(0) = **x**0. - Note that
*f*(**g**(*t*)) = constant for all *t*. - Velocity vector
**g**′(*t*) is tangent to the level set. - Consider
*F*(*t*) = *f*(**g**(*t*)). We have *F*′(0) = 0. By the chain rule, - Hence, ∇
*f*(**x**0) and **g**′(0) are orthogonal.
## Suppose *f*: *R* → *R* is in *C*1. Then, ## Suppose *f*: *R* → *R* is in *C*1. Then,
## Suppose *f*: *R* → *R* is in *C*2. Then, ## Suppose *f*: *R* → *R* is in *C*2. Then,
## Suppose *f*: *Rn *→ *R*. ## Suppose *f*: *Rn *→ *R*. - If
*f *in *C*1, then - If
*f *in *C*2, then
## We already know that ∇*f*(**x**0) is orthogonal to the level set at *x*0. ## We already know that ∇*f*(**x**0) is orthogonal to the level set at *x*0. ## Fact: ∇*f *points in the direction of increasing *f*.
## Consider **x**α = **x**0 + α∇*f*(**x**0), α > 0. ## Consider **x**α = **x**0 + α∇*f*(**x**0), α > 0. ## Therefore, for sufficiently small , * f*(**x**α) > *f*(**x**0)
## This theorem is the link from the previous gradient properties to the constructive algorithm. ## This theorem is the link from the previous gradient properties to the constructive algorithm. __The problem__:
__The Theorem__: __The Theorem__:
- Suppose
*f*: *Rn *→ *R *C1 smooth, and exist continuous function: *k*: *Rn *→ [0,1], and, - And, the search vectors constructed by the model algorithm satisfy:
## And ## Then - if is the sequence constructed by the algorithm model,
- then any accumulation point
*y* of this sequence satisfy:
## The theorem has very intuitive interpretation: ## The theorem has very intuitive interpretation: ## Always go in descent direction.
## We now use what we have learned to implement the most basic minimization technique. ## We now use what we have learned to implement the most basic minimization technique. ## First we introduce the algorithm, which is a version of the model algorithm. __The problem__:
__Theorem__: __Theorem__:
- If is a sequence constructed by the SD algorithm, then every accumulation point y of the sequence satisfy:
*Proof*: from Wolfe theorem
## How long a step to take?
## How long a step to take? ## How long a step to take? ## Therefore the method of steepest descent looks like this:
## We from now on assume we want to minimize the quadratic function: ## We from now on assume we want to minimize the quadratic function: ## This is equivalent to solve linear problem:
## La solucion es la interseccion de las lineas ## La solucion es la interseccion de las lineas
- Cada elipsoide tiene
*f*(*x*) constante
## What is the problem with steepest descent? ## What is the problem with steepest descent? - We can repeat the same directions over and over…
## Wouldn’t it be better if, every time we took a step, we got it right the first time?
## What is the problem with steepest descent? ## What is the problem with steepest descent? - We can repeat the same directions over and over…
## Conjugate gradient requires n gradient evaluations and n line searches.
## First, let’s define de *error* as ## First, let’s define de *error* as
## Let’s pick a set of orthogonal *search directions* ## Let’s pick a set of orthogonal *search directions*
## Using the coordinate axes as search directions… ## Using the coordinate axes as search directions…
## We have
## Given , how do we calculate ? ## Given , how do we calculate ?
## Given , how do we calculate ? ## Given , how do we calculate ?
## How do we find ? ## How do we find ? - Since search vectors form a basis
## We want that after n step the error will be 0: ## We want that after n step the error will be 0:
## So we look for such that ## So we look for such that - Simple calculation shows that if we take
## Sources ## J-Shing Roger Jang, Chuen-Tsai Sun and Eiji Mizutani, Slides for Ch. 5 of “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence”, First Edition, Prentice Hall, 1997. ## Djamel Bouchaffra. Soft Computing. Course materials. Oakland University. Fall 2005 ## Lucidi delle lezioni, Soft Computing. Materiale Didattico. Dipartimento di Elettronica e Informazione. Politecnico di Milano. 2004 ## Jeen-Shing Wang, Course: Introduction to Neural Networks. Lecture notes. Department of Electrical Engineering. National Cheng Kung University. Fall, 2005
## Sources ## Carlo Tomasi, Mathematical Methods for Robotics and Vision. Stanford University. Fall 2000 ## Petros Ioannou, Jing Sun, Robust Adaptive Control. Prentice-Hall, Inc, Upper Saddle River: NJ, 1996 ## Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Edition 11/4. School of Computer Science. Carnegie Mellon University. Pittsburgh. August 4, 1994 ## Gordon C. Everstine, Selected Topics in Linear Algebra. The GeorgeWashington University. 8 June 2004
**Dostları ilə paylaş:** |