Computacion Inteligente Derivative-Based Optimization

Yüklə 2,58 Mb.

tarix	25.07.2018
ölçüsü	2,58 Mb.
	#58870

Computacion Inteligente

Derivative-Based Optimization

Optimization problems
Mathematical background
Descent Methods
The Method of Steepest Descent
Conjugate Gradient

Objective function – mathematical function which is optimized by changing the values of the design variables.

Objective function – mathematical function which is optimized by changing the values of the design variables.
Design Variables – Those variables which we, as designers, can change.
Constraints – Functions of the design variables which establish limits in individual variables or combinations of design variables.

3 basic ingredients…

3 basic ingredients…

an objective function,
a set of decision variables,
a set of equality/inequality constraints.

Design Variables: decision and objective vector

Design Variables: decision and objective vector
Constraints: equality and inequality
Bounds: feasible ranges for variables
Objective Function: maximization can be converted to minimization due to the duality principle

Identify the quantity or function, f, to be optimized.

Identify the quantity or function, f, to be optimized.
Identify the design variables: x1, x2, x3, …,xn.
Identify the constraints if any exist
a. Equalities
b. Inequalities
Adjust the design variables (x’s) until f is optimized and all of the constraints are satisfied.

Objective functions may be unimodal or multimodal.

Objective functions may be unimodal or multimodal.

Unimodal – only one optimum
Multimodal – more than one optimum

Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design.
The global optimum is the best of all local optimum designs.

Existence of global minimum

Existence of global minimum
If f(x) is continuous on the feasible set S which is closed and bounded, then f(x) has a global minimum in S

A set S is closed if it contains all its boundary pts.
A set S is bounded if it is contained in the interior of some circle

Derivative-based optimization (gradient based)

Derivative-based optimization (gradient based)

Capable of determining “search directions” according to an objective function’s derivative information

steepest descent method;
Newton’s method; Newton-Raphson method;
Conjugate gradient, etc.

Derivative-free optimization

random search method;
genetic algorithm;
simulated annealing; etc.

A square matrix M is positive definite if

A square matrix M is positive definite if
It is positive semidefinite if

A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

Proof (→): Let vi the eigenvector for the i-th eigenvalue λi
Then,
which implies λi > 0,

Theorem: If a matrix M = UTU then it is positive definite

Theorem: If a matrix M = UTU then it is positive definite

f: Rn → R is a quadratic function if

f: Rn → R is a quadratic function if

where Q is symmetric.

It is no necessary for Q be symmetric.

It is no necessary for Q be symmetric.

Suposse matrix P non-symmetric

Suposse matrix P non-symmetric. Example

Suposse matrix P non-symmetric. Example

Given the quadratic function

Given the quadratic function

Two other shapes can result from the quadratic form.

Two other shapes can result from the quadratic form.

If Q is negative definite, then f is a parabolic “bowl” up side down.
If Q is indefinite then f describes a saddle.

Quadratics are useful in the study of optimization.

Quadratics are useful in the study of optimization.

Often, objective functions are “close to” quadratic near the solution.
It is easier to analyze the behavior of algorithms when applied to quadratics.
Analysis of algorithms for quadratics gives insight into their behavior in general.

The derivative of f: R → R is a function f ′: R → R given by

The derivative of f: R → R is a function f ′: R → R given by
if the limit exists.

Along the Axes…

Along the Axes…

In general direction…

In general direction…

Definition: A real-valued function f: Rn → R is said to be continuously differentiable if the partial derivatives

Definition: A real-valued function f: Rn → R is said to be continuously differentiable if the partial derivatives
exist for each x in Rn and are continuous functions of x.
In this case, we say f C1 (a smooth function C1)

Definition: The gradient of f: in R2 → R:

Definition: The gradient of f: in R2 → R:
It is a function ∇f: R2 → R2 given by

Definition: The gradient of f: Rn → R is a function ∇f: Rn → Rn given by

Definition: The gradient of f: Rn → R is a function ∇f: Rn → Rn given by

The gradient defines (hyper) plane approximating the function infinitesimally

The gradient defines (hyper) plane approximating the function infinitesimally

By the chain rule

By the chain rule

Proposition 1:

Proposition 1:
is maximal choosing

Proof:

Proof:

Assign:
by chain rule:

Proof:

Proof:

On the other hand for general v:

Proposition 2: let f: Rn → R be a smooth function C1 around p,

Proposition 2: let f: Rn → R be a smooth function C1 around p,
if f has local minimum (maximum) at p then,

Proof: intuitive

Proof: intuitive

We found the best INFINITESIMAL DIRECTION at each point,

We found the best INFINITESIMAL DIRECTION at each point,
Looking for minimum: “blind man” procedure
How can we derive the way to the minimum using this knowledge?

The gradient of f: Rn → Rm is a function Df: Rn → Rm×n given by

The gradient of f: Rn → Rm is a function Df: Rn → Rm×n given by

If the derivative of ∇f exists, we say that f is twice differentiable.

If the derivative of ∇f exists, we say that f is twice differentiable.

Write the second derivative as D2f (or F), and call it the Hessian of f.

The level set of a function f: Rn → R at level c is the set of points S = {x: f(x) = c}.

The level set of a function f: Rn → R at level c is the set of points S = {x: f(x) = c}.

Fact: ∇f(x0) is orthogonal to the level set at x0

Fact: ∇f(x0) is orthogonal to the level set at x0

Proof of fact:

Proof of fact:

Imagine a particle traveling along the level set.
Let g(t) be the position of the particle at time t, with g(0) = x0.
Note that f(g(t)) = constant for all t.
Velocity vector g′(t) is tangent to the level set.
Consider F(t) = f(g(t)). We have F′(0) = 0. By the chain rule,
Hence, ∇f(x0) and g′(0) are orthogonal.

Suppose f: R → R is in C1. Then,

Suppose f: R → R is in C1. Then,

Suppose f: R → R is in C2. Then,

Suppose f: R → R is in C2. Then,

Suppose f: Rn → R.

Suppose f: Rn → R.

If f in C1, then
If f in C2, then

We already know that ∇f(x0) is orthogonal to the level set at x0.

We already know that ∇f(x0) is orthogonal to the level set at x0.

Suppose ∇f(x0) ≠ 0.

Fact: ∇f points in the direction of increasing f.

Consider xα = x0 + α∇f(x0), α > 0.

Consider xα = x0 + α∇f(x0), α > 0.

By Taylor's formula,

Therefore, for sufficiently small ,
f(xα) > f(x0)

This theorem is the link from the previous gradient properties to the constructive algorithm.

This theorem is the link from the previous gradient properties to the constructive algorithm.
The problem:

The Theorem:

The Theorem:

Suppose f: Rn → R C1 smooth, and exist continuous function: k: Rn → [0,1], and,
And, the search vectors constructed by the model algorithm satisfy:

And

Then

if is the sequence constructed by the algorithm model,
then any accumulation point y of this sequence satisfy:

The theorem has very intuitive interpretation:

The theorem has very intuitive interpretation:
Always go in descent direction.

We now use what we have learned to implement the most basic minimization technique.

We now use what we have learned to implement the most basic minimization technique.
First we introduce the algorithm, which is a version of the model algorithm.
The problem:

Theorem:

Theorem:

If is a sequence constructed by the SD algorithm, then every accumulation point y of the sequence satisfy:
Proof: from Wolfe theorem

How long a step to take?

How long a step to take?

How long a step to take?

How long a step to take?

From the chain rule:

Therefore the method of steepest descent looks like this:

We from now on assume we want to minimize the quadratic function:

We from now on assume we want to minimize the quadratic function:
This is equivalent to solve linear problem:

La solucion es la interseccion de las lineas

La solucion es la interseccion de las lineas

Cada elipsoide tiene f(x) constante

Cada elipsoide tiene f(x) constante

What is the problem with steepest descent?

What is the problem with steepest descent?

We can repeat the same directions over and over…

Wouldn’t it be better if, every time we took a step, we got it right the first time?

What is the problem with steepest descent?

What is the problem with steepest descent?

We can repeat the same directions over and over…

Conjugate gradient requires n gradient evaluations and n line searches.

First, let’s define de error as

First, let’s define de error as

Let’s pick a set of orthogonal search directions

Let’s pick a set of orthogonal search directions

Using the coordinate axes as search directions…

Using the coordinate axes as search directions…

We have

We have

Given , how do we calculate ?

Given , how do we calculate ?

Given , how do we calculate ?

Given , how do we calculate ?

That is

How do we find ?

How do we find ?

Since search vectors form a basis

We want that after n step the error will be 0:

We want that after n step the error will be 0:

Here an idea: if then:

So we look for such that

So we look for such that

Simple calculation shows that if we take

Sources

J-Shing Roger Jang, Chuen-Tsai Sun and Eiji Mizutani, Slides for Ch. 5 of “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence”, First Edition, Prentice Hall, 1997.
Djamel Bouchaffra. Soft Computing. Course materials. Oakland University. Fall 2005
Lucidi delle lezioni, Soft Computing. Materiale Didattico. Dipartimento di Elettronica e Informazione. Politecnico di Milano. 2004
Jeen-Shing Wang, Course: Introduction to Neural Networks. Lecture notes. Department of Electrical Engineering. National Cheng Kung University. Fall, 2005

Sources

Carlo Tomasi, Mathematical Methods for Robotics and Vision. Stanford University. Fall 2000
Petros Ioannou, Jing Sun, Robust Adaptive Control. Prentice-Hall, Inc, Upper Saddle River: NJ, 1996
Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Edition 11/4. School of Computer Science. Carnegie Mellon University. Pittsburgh. August 4, 1994
Gordon C. Everstine, Selected Topics in Linear Algebra. The GeorgeWashington University. 8 June 2004

Yüklə 2,58 Mb.

Dostları ilə paylaş:

Computacion Inteligente Derivative-Based Optimization

Computacion Inteligente

Derivative-Based Optimization

Contents

Optimization problems

Mathematical background

Descent Methods

The Method of Steepest Descent

Conjugate Gradient

Objective function – mathematical function which is optimized by changing the values of the design variables.

Objective function – mathematical function which is optimized by changing the values of the design variables.

Design Variables – Those variables which we, as designers, can change.

Constraints – Functions of the design variables which establish limits in individual variables or combinations of design variables.

3 basic ingredients…

3 basic ingredients…

Design Variables: decision and objective vector

Identify the quantity or function, f, to be optimized.

Identify the quantity or function, f, to be optimized.

Identify the design variables: x1, x2, x3, …,xn.

Identify the constraints if any exist

a. Equalities

b. Inequalities

Adjust the design variables (x’s) until f is optimized and all of the constraints are satisfied.

Objective functions may be unimodal or multimodal.

Objective functions may be unimodal or multimodal.

Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design.

The global optimum is the best of all local optimum designs.

Existence of global minimum

Existence of global minimum

If f(x) is continuous on the feasible set S which is closed and bounded, then f(x) has a global minimum in S

Derivative-based optimization (gradient based)

Derivative-based optimization (gradient based)

Derivative-free optimization

A square matrix M is positive definite if

A square matrix M is positive definite if

It is positive semidefinite if

A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

Theorem: If a matrix M = UTU then it is positive definite

Theorem: If a matrix M = UTU then it is positive definite

f: Rn → R is a quadratic function if

f: Rn → R is a quadratic function if

It is no necessary for Q be symmetric.

It is no necessary for Q be symmetric.

Suposse matrix P non-symmetric. Example

Given the quadratic function

Given the quadratic function

Two other shapes can result from the quadratic form.

Two other shapes can result from the quadratic form.

Quadratics are useful in the study of optimization.

Quadratics are useful in the study of optimization.

The derivative of f: R → R is a function f ′: R → R given by

The derivative of f: R → R is a function f ′: R → R given by

if the limit exists.

Along the Axes…

Along the Axes…

In general direction…

In general direction…

Definition: A real-valued function f: Rn → R is said to be continuously differentiable if the partial derivatives

Definition: A real-valued function f: Rn → R is said to be continuously differentiable if the partial derivatives

exist for each x in Rn and are continuous functions of x.

In this case, we say f C1 (a smooth function C1)

Definition: The gradient of f: in R2 → R:

Definition: The gradient of f: in R2 → R:

It is a function ∇f: R2 → R2 given by

Definition: The gradient of f: Rn → R is a function ∇f: Rn → Rn given by

Definition: The gradient of f: Rn → R is a function ∇f: Rn → Rn given by

The gradient defines (hyper) plane approximating the function infinitesimally

The gradient defines (hyper) plane approximating the function infinitesimally

By the chain rule

By the chain rule

Proposition 1:

Proposition 1:

is maximal choosing

Proof:

Proof:

Proof:

Proof:

Proposition 2: let f: Rn → R be a smooth function C1 around p,

Proposition 2: let f: Rn → R be a smooth function C1 around p,