Computacion Inteligente Derivative-Based Optimization



Yüklə 445 b.
tarix25.07.2018
ölçüsü445 b.


Computacion Inteligente

  • Derivative-Based Optimization


Contents

  • Optimization problems

  • Mathematical background

  • Descent Methods

  • The Method of Steepest Descent

  • Conjugate Gradient





Objective function – mathematical function which is optimized by changing the values of the design variables.

  • Objective function – mathematical function which is optimized by changing the values of the design variables.

  • Design Variables – Those variables which we, as designers, can change.

  • Constraints – Functions of the design variables which establish limits in individual variables or combinations of design variables.



3 basic ingredients…

  • 3 basic ingredients…

    • an objective function,
    • a set of decision variables,
    • a set of equality/inequality constraints.


Design Variables: decision and objective vector

    • Design Variables: decision and objective vector
    • Constraints: equality and inequality
    • Bounds: feasible ranges for variables
    • Objective Function: maximization can be converted to minimization due to the duality principle


Identify the quantity or function, f, to be optimized.

  • Identify the quantity or function, f, to be optimized.

  • Identify the design variables: x1, x2, x3, …,xn.

  • Identify the constraints if any exist

  • a. Equalities

  • b. Inequalities

  • Adjust the design variables (x’s) until f is optimized and all of the constraints are satisfied.



Objective functions may be unimodal or multimodal.

  • Objective functions may be unimodal or multimodal.

    • Unimodal – only one optimum
    • Multimodal – more than one optimum
  • Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design.

  • The global optimum is the best of all local optimum designs.



Existence of global minimum

  • Existence of global minimum

  • If f(x) is continuous on the feasible set S which is closed and bounded, then f(x) has a global minimum in S

    • A set S is closed if it contains all its boundary pts.
    • A set S is bounded if it is contained in the interior of some circle






Derivative-based optimization (gradient based)

  • Derivative-based optimization (gradient based)

    • Capable of determining “search directions” according to an objective function’s derivative information
      • steepest descent method;
      • Newton’s method; Newton-Raphson method;
      • Conjugate gradient, etc.
  • Derivative-free optimization

      • random search method;
      • genetic algorithm;
      • simulated annealing; etc.




A square matrix M is positive definite if

  • A square matrix M is positive definite if

  • It is positive semidefinite if



A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

  • A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

    • Proof (): Let vi the eigenvector for the i-th eigenvalue λi
    • Then,
    • which implies λi > 0,


Theorem: If a matrix M = UTU then it is positive definite

  • Theorem: If a matrix M = UTU then it is positive definite



f: Rn R is a quadratic function if

  • f: Rn R is a quadratic function if

    • where Q is symmetric.


It is no necessary for Q be symmetric.

  • It is no necessary for Q be symmetric.

    • Suposse matrix P non-symmetric


Suposse matrix P non-symmetric. Example

    • Suposse matrix P non-symmetric. Example


Given the quadratic function

  • Given the quadratic function



Two other shapes can result from the quadratic form.

  • Two other shapes can result from the quadratic form.

    • If Q is negative definite, then f is a parabolic “bowl” up side down.
    • If Q is indefinite then f describes a saddle.


Quadratics are useful in the study of optimization.

  • Quadratics are useful in the study of optimization.

    • Often, objective functions are “close to” quadratic near the solution.
    • It is easier to analyze the behavior of algorithms when applied to quadratics.
    • Analysis of algorithms for quadratics gives insight into their behavior in general.


The derivative of f: RR is a function f ′: RR given by

  • The derivative of f: RR is a function f ′: RR given by

  • if the limit exists.



Along the Axes…

  • Along the Axes…



In general direction…

  • In general direction…





Definition: A real-valued function f: Rn R is said to be continuously differentiable if the partial derivatives

  • Definition: A real-valued function f: Rn R is said to be continuously differentiable if the partial derivatives

  • exist for each x in Rn and are continuous functions of x.

  • In this case, we say f C1 (a smooth function C1)



Definition: The gradient of f: in R2 R:

  • Definition: The gradient of f: in R2 R:

  • It is a function ∇f: R2 R2 given by



Definition: The gradient of f: Rn R is a function ∇f: Rn Rn given by

  • Definition: The gradient of f: Rn R is a function ∇f: Rn Rn given by



The gradient defines (hyper) plane approximating the function infinitesimally

  • The gradient defines (hyper) plane approximating the function infinitesimally



By the chain rule

  • By the chain rule



Proposition 1:

  • Proposition 1:

  • is maximal choosing



Proof:

  • Proof:

    • Assign:
    • by chain rule:


Proof:

  • Proof:

    • On the other hand for general v:


Proposition 2: let f: Rn R be a smooth function C1 around p,

  • Proposition 2: let f: Rn R be a smooth function C1 around p,

  • if f has local minimum (maximum) at p then,



Proof: intuitive

  • Proof: intuitive



We found the best INFINITESIMAL DIRECTION at each point,

  • We found the best INFINITESIMAL DIRECTION at each point,

  • Looking for minimum: “blind man” procedure

  • How can we derive the way to the minimum using this knowledge?



The gradient of f: Rn Rm is a function Df: Rn Rm×n given by

  • The gradient of f: Rn Rm is a function Df: Rn Rm×n given by



If the derivative of ∇f exists, we say that f is twice differentiable.

  • If the derivative of ∇f exists, we say that f is twice differentiable.

    • Write the second derivative as D2f (or F), and call it the Hessian of f.


The level set of a function f: Rn R at level c is the set of points S = {x: f(x) = c}.

  • The level set of a function f: Rn R at level c is the set of points S = {x: f(x) = c}.



Fact: ∇f(x0) is orthogonal to the level set at x0

  • Fact: ∇f(x0) is orthogonal to the level set at x0



Proof of fact:

  • Proof of fact:

    • Imagine a particle traveling along the level set.
    • Let g(t) be the position of the particle at time t, with g(0) = x0.
    • Note that f(g(t)) = constant for all t.
    • Velocity vector g′(t) is tangent to the level set.
    • Consider F(t) = f(g(t)). We have F′(0) = 0. By the chain rule,
    • Hence, ∇f(x0) and g′(0) are orthogonal.


Suppose f: RR is in C1. Then,

  • Suppose f: RR is in C1. Then,



Suppose f: RR is in C2. Then,

  • Suppose f: RR is in C2. Then,



Suppose f: Rn R.

  • Suppose f: Rn R.

    • If f in C1, then
    • If f in C2, then


We already know that ∇f(x0) is orthogonal to the level set at x0.

  • We already know that ∇f(x0) is orthogonal to the level set at x0.

    • Suppose ∇f(x0) ≠ 0.
  • Fact: ∇f points in the direction of increasing f.



Consider xα = x0 + α∇f(x0), α > 0.

  • Consider xα = x0 + α∇f(x0), α > 0.

    • By Taylor's formula,
  • Therefore, for sufficiently small ,

  • f(xα) > f(x0)





This theorem is the link from the previous gradient properties to the constructive algorithm.

  • This theorem is the link from the previous gradient properties to the constructive algorithm.

  • The problem:





The Theorem:

  • The Theorem:

    • Suppose f: Rn R C1 smooth, and exist continuous function: k: Rn → [0,1], and,
    • And, the search vectors constructed by the model algorithm satisfy:


And

    • And
  • Then

    • if is the sequence constructed by the algorithm model,
    • then any accumulation point y of this sequence satisfy:


The theorem has very intuitive interpretation:

  • The theorem has very intuitive interpretation:

  • Always go in descent direction.





We now use what we have learned to implement the most basic minimization technique.

  • We now use what we have learned to implement the most basic minimization technique.

  • First we introduce the algorithm, which is a version of the model algorithm.

  • The problem:





Theorem:

  • Theorem:

    • If is a sequence constructed by the SD algorithm, then every accumulation point y of the sequence satisfy:
    • Proof: from Wolfe theorem


How long a step to take?

  • How long a step to take?



How long a step to take?

  • How long a step to take?

    • From the chain rule:
  • Therefore the method of steepest descent looks like this:











We from now on assume we want to minimize the quadratic function:

  • We from now on assume we want to minimize the quadratic function:

  • This is equivalent to solve linear problem:



La solucion es la interseccion de las lineas

  • La solucion es la interseccion de las lineas



Cada elipsoide tiene f(x) constante

    • Cada elipsoide tiene f(x) constante


What is the problem with steepest descent?

  • What is the problem with steepest descent?

    • We can repeat the same directions over and over…
  • Wouldn’t it be better if, every time we took a step, we got it right the first time?



What is the problem with steepest descent?

  • What is the problem with steepest descent?

    • We can repeat the same directions over and over…
  • Conjugate gradient requires n gradient evaluations and n line searches.



First, let’s define de error as

  • First, let’s define de error as



Let’s pick a set of orthogonal search directions

  • Let’s pick a set of orthogonal search directions



Using the coordinate axes as search directions…

  • Using the coordinate axes as search directions…



We have

  • We have



Given , how do we calculate ?

  • Given , how do we calculate ?



Given , how do we calculate ?

  • Given , how do we calculate ?

    • That is


How do we find ?

  • How do we find ?

    • Since search vectors form a basis


We want that after n step the error will be 0:

  • We want that after n step the error will be 0:

    • Here an idea: if then:


So we look for such that

  • So we look for such that

    • Simple calculation shows that if we take




Sources

  • J-Shing Roger Jang, Chuen-Tsai Sun and Eiji Mizutani, Slides for Ch. 5 of “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence”, First Edition, Prentice Hall, 1997.

  • Djamel Bouchaffra. Soft Computing. Course materials. Oakland University. Fall 2005

  • Lucidi delle lezioni, Soft Computing. Materiale Didattico. Dipartimento di Elettronica e Informazione. Politecnico di Milano. 2004

  • Jeen-Shing Wang, Course: Introduction to Neural Networks. Lecture notes. Department of Electrical Engineering. National Cheng Kung University. Fall, 2005



Sources

  • Carlo Tomasi, Mathematical Methods for Robotics and Vision. Stanford University. Fall 2000

  • Petros Ioannou, Jing Sun, Robust Adaptive Control. Prentice-Hall, Inc, Upper Saddle River: NJ, 1996

  • Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Edition 11/4. School of Computer Science. Carnegie Mellon University. Pittsburgh. August 4, 1994

  • Gordon C. Everstine, Selected Topics in Linear Algebra. The GeorgeWashington University. 8 June 2004





Dostları ilə paylaş:


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2019
rəhbərliyinə müraciət

    Ana səhifə