SlideShare a Scribd company logo
Theories of continuous
optimization
olivier.teytaud@inria.fr
Or the art of preparing coffee,
and a little bit of sorting algorithms.
Sorry for not being here
yesterday :-)
Theories of continuous
optimization
olivier.teytaud@inria.fr
1. The many flavors of continuous
optimization
2. Noise-free setting, rates
3. Noise-free setting, lower bounds
4. Noisy setting, rates
5. Noisy setting, lower bounds
The many flavors of continuous
optimization
● What do we optimize ? (x in R^d, f(x) in R)
– argmin_x f(x)
– Pareto front { ( f_1(x), f_2(x), f_3(x) ) ; x }
– argmin_{x such that Z(x) } f(x)
● How to optimize ?
– Exactly
– Approximately
What are the best temperature/pressure
for preparing espresso ?
What are the best
temperature/pression
for preparing espresso with <12 bars ?
What are the
coffee
parameters
such that it can
not become
simultaneously
better for Bob,
Alice and
Charles ?
Coffee is never perfect
But in discrete domains
people want perfect coffee,
i.e. exact soutions
The many flavors of continuous
optimization
● From which information ?
– (x,y) → << is f(x) < f(y) ? >> (ternary)
(user feedback ? noise-free & unbiased ?)
– x → f(x)
– x → f(x), f(x)∇ (99% of the continuous opt. world...)
– x → f(x), f(x), Hf(x)∇
● Which criterion ?
– f(x) after time T
– f(x) after T comparisons
– f(x) after T evaluations
– ...
My coffee is twice-differentiable
and I've met its Hessian
I don't give marks to coffee
but I can choose
A coffee, quickly!
I want to drink T coffees.
The many flavors of continuous
optimization
Comparison information & number of comparisons
(comparison-complexity)
Or comparison information & number of evaluations ?
(strange ?)
(but comparing all λ^2 points ==> λ log(λ) comp.)
Or black-box & number of evaluations ? (black-box complexity)
Black box vs white box:
BB might be much easier
because we do not count time!
In discrete domains (example by T. Jansen, right ? Maybe someone else
as well ?):
Argmin quadraticFunction ==> NP-complete, but
black-box complexity = O(B(B+1)/2+B+2)
(algo = find the quadratic form!)
==> super expensive but BB easy!
In (0,1): encode x>0 in (0,1/2), in unary
and minimum of f at 1/2+1/(2busy-beaver(x)), in (½,1)
==> takes time busy-beaver(x) (to be proved :-) ) but BB computable in
O(x) !
(BB or comparison-based!)
I love internet, we can
find an image with
a beaver and coffee
Introduction: the (1+1)-ES
(Schumer and Steiglitz)
Thanks Anne Auger for not complaining for all I have stolen from your slides.
Simple, but good coffee.
Why ?
Generate  points around x
( x +  N where N is a standard
Gaussian)
Basicschemaofan
EvolutionStrategy
Parameters:
x,
Generate  points around x
( x +  N where N is a standard
Gaussian)
Compute their  fitness values
Basicschemaofan
EvolutionStrategy
Parameters:
x,
Generate  points around x
( x +  N where N is a standard
Gaussian)
Compute their  fitness values
Select the  best
Basicschemaofan
EvolutionStrategy
Parameters:
x,
Generate  points around x
( x +  N where N is a standard
Gaussian)
Compute their  fitness values
Select the  best
Let x = average of these  best
Basicschemaofan
EvolutionStrategy
Parameters:
x,
Generate  points around x
( x +  N where N is a standard
Gaussian)
Compute their  fitness values
Select the  best
Let x = average of these  best
Basicschemaofan
EvolutionStrategy
Parameters:
x,
Generate  points around x
( x +  N where N is a standard
Gaussian)
Compute their  fitness values
Select the  best
Let x = average of these  best
Obviouslyparallel
Parameters:
x,
Multi-cores,
Clusters, Grids...
Generate  points around x
( x +  N where N is a standard
Gaussian)
Compute their  fitness values
Select the  best
Let x = average of these  best
Obviouslyparallel
Parameters:
x,
Reallysimple.
Generate  points around x
( x +  N where N is a standard
Gaussian)
Compute their  fitness values
Select the  best
Let x = average of these  best
Obviouslyparallel
Parameters:
x,
Reallysimple.
Not a negligible advantage.
When I accessed, for the 1st time,
a crucial industrial
code of an important
company, I believed
that it would be
clean and bug free.
(I was young :-) )
Generate 1 point x' around x
( x +  N where N is a standard
Gaussian)
Compute its fitness value
Keep the best (x or x').
x=best(x,x')
=2 if x' best
=0.84 otherwise
Parameters:
x,
The(1+1)-ESwith1/5th
rule
This is x...
I generate =6 points
I select the =3 best points
x=average of these =3 best points
Ok.
Choosing an initial
x is as in any algorithm.
But how do I choose sigma ?
Ok.
Choosing x is as in any algorithm.
But how do I choose sigma ?
Sometimes by human guess.
But for large number of iterations,
there is better.
Theories of continuous optimization
Theories of continuous optimization
Theories of continuous optimization
log || xn – x* || ~ - C n
Usually termed “linear convergence”,
==> but it's in log-scale.
log || xn – x* || ~ - C n
Consider the (1+1)-ES.
x(n) = x(n-1) or x(n-1) + (n-1)N
We want to maximize:
- E log || x(n) - f* ||
Ok, we want to choose . How to do that ?
Consider the (1+1)-ES.
x(n) = x(n-1) or x(n-1) + (n-1)N
We want to maximize:
- E log || x(n) - f* ||
__________________
- E log || x(n-1) – f* ||
Consider the (1+1)-ES.
x(n) = x(n-1) or x(n-1) + (n-1)N
We want to maximize:
- E log || x(n) - f* ||
__________________
- E log || x(n-1) – f* ||
We don't know f*.
How can we optimize this ?
We will observe
the acceptance rate,
and we will deduce if 
is too large or too small..
- E log || x(n) - f* ||
--------------------------
- E log || x(n-1) – f* ||
Accepted
mutations
Rejected
mutations
ON THE NORM FUNCTION
Level set
Current
point Optimum
Rejected
mutations
Rejected
mutations
Rejected
mutations
Progress
rate
- E log || x(n) - f* ||
___________________________________
- E log || x(n-1) – f* ||
Accepted
mutations
Rejected
mutations
Level set
For each step-size,
evaluate this “expected progress rate”
and evaluate “P(acceptance)”
Rejected
mutations
Progressrate
Acceptance rate
Rejected
mutations
Progressrate
Acceptance rate
We want to be here!
We observe
(approximately)
this variable
Rejected
mutations
Progressrate
Acceptance rateBig
step-size
Rejected
mutations
Progressrate
Acceptance rate
Small
step-size
Rejected
mutations
Progressrate
Acceptance rate
Small acceptance rate
==> decrease sigma
Rejected
mutations
Progressrate
Acceptance rate
Big acceptance rate
==> increase sigma
Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 39
1/5th
rule with arbitrary pop-size
Based on maths showing
that good step-size
<==> success rate  1/5
Theories of continuous
optimization
olivier.teytaud@inria.fr
1. The many flavors of continuous optimization
2. Noise-free setting, rates
3. Noise-free setting, lower bounds
4. Noisy setting, rates
5. Noisy setting, lower bounds
Noise-free setting, rates
(optimum in 0 for short)
● Newton (f, f, Hf) :∇
||x_n|| = O( ||x_{n-1}||^2 ) “quadratic”
● BFGS (f, f)∇
superlinear ||x_n|| = o( ||x_{n-1}|| )
(LBFGS: R-linear)
● NEWUOA (f) or BOBYQA
finite number of bits ==> complexity (dimension)
● ES (comparison)
||x_n|| = O( K1/d
|| x_{n-1} || ) “linear”
n ~ d x log ( 1 / ε )
Noise-free setting, q-rates
q-superlinear convergence
q-linear convergence
Noise-free setting, r-rates
q-rates have a problem:
You might have a super fast convergence, but no
q-superlinear convergence, because a few values
are super-small (the sequence is not decreasing)
r-rate: the r-rate of xn is the best q-rate
of a sequence yn with
values yn ≥ xn
Noise-free setting, rates by
Markov chain analysis
● A. Auger, A. Chotard, N. Hansen, sphere functions and then linear functions.
● Typical idea in the sphere case: rescaling!
(x/σ) is a homogeneous Markov chain
==> (x/σ) is asymp. stationary (for some ES and functions...)
==> so progress rate =with asymptotic probability distribution such
that hopefully
E(log ||x(n+1)/σ(n+1)|| - log ||x(n)/σ(n)|| | || x(n)/σ(n) || )
is negative.
Not proved!
Theories of continuous
optimization
olivier.teytaud@inria.fr
1. The many flavors of continuous optimization
2. Noise-free setting, rates
3. Noise-free setting, lower bounds
4. Noisy setting, rates
5. Noisy setting, lower bounds
Noise-free setting, lower bounds
(Gelly, Fournier, Teytaud, ...)
Remember lower bound for sorting with comparison-based
algorithms ?
● n! possible inputs
●
2k
possible bits of information for k comparisons
●
2k
is at least n! ==> k at least log(n!) = n log (n)
indeed, we have no proof that better than n log(n) is
impossible for integers.
==> same concept for ES!
Noise-free setting, lower bounds
Number of possible outputs:
● K possible outputs per iteration
● K^N possible outputs after N iterations
==> can be extended to stochastic algorithms
How many possible outputs are needed for
precision ε in domain D for a given norm ?
Noise-free setting, lower bounds
“Branching factor” K = nb of possible outcomes
when comparing the offspring fitness
● 2 for (1+1)-ES (better or worse!) (3 if tie)
● (μ,λ)-ES ==> λ choose μ
● (μ+λ)-ES ==> λ+μ choose μ
● arbitrary ranking – ES ==> lambda!
●
ranking with archive ==> O(n) possibilities for nth
point
Fournier & Teytaud,
Algorithmica'06.
# ε-balls in
[0,1]^d
Generate
λ points, use the list
of the μ bestGenerate
λ points, use the list
of μ best among all
I'm grateful to J. Jaegerskuepper,
I fell in love while watching
his talk at Dagstul'06.
(in love with lower bounds, not with Jens)
Noise-free setting, lower bounds,
multiobjective case
d spherical objective functions in dimension N
I've written a paper without assuming
sphere-like objective functions.
Using tricks on Hausdorff metric.
The most unreadable thing I've ever written.
I apologize for this.
1
Theories of continuous
optimization
olivier.teytaud@inria.fr
1. The many flavors of continuous optimization
2. Noise-free setting, rates
3. Noise-free setting, lower bounds
4. Noisy setting, rates
5. Noisy setting, lower bounds
Noisy setting, rates
Search point ≠ recommended point
Even in deterministic
optimization this is important.
Noisy setting, rates
Simple regret (SR) & cumulated regret (CR):
SR(n) = IE fitness( nth
recommended point)
– IE fitness(x*)
I want to
become a
great barista
I want a good
coffee on average
Simple regret, uniform regret,
cumulative regret
Simple regret, uniform regret,
cumulative regret
Noise model: z in {0,1,2}
0: constant variance noise / additive noise
1: linear variance noise
2: quadratic variance noise / multiplicative noise
Also actuator noise: f(x,w) = f(x+w)
My coffee
machine has
noise
No excuse for bad
coffee, no noise at
the optimum (Markov
chain analysis ok)
E.g. success rates of
parametric policies
with optimum = 100%
Noisy setting (noise Θ(1)),
comparison-based rates if z=0
(M.-L. Cauwet <== looking for a post-doc)
(showing that MLIS (Beyer'98) leads to SR=O(1/N) with constant noise)
1. Comparison-based noisy
Optimization: the operator
Don't average and compare:
just compare many times!
Hint: apply a sorting algorithm,
It will be much faster than n^2
comparisons :-) (for
comparison-based criteria
rather than evaluation-based
complexity)
Noisy setting,
comparison-based rates
2. Key points:
Frequency estimated with
N x N points has precision
1/sqrt(n).
a) Optimum position = exactly
computable from the probability
of “f(-1) > f(1)”.
b) This probability is approx.
frequency with
precision 1/sqrt(n).
Evaluate f(1) and f(-1),
N times,
consider the frequency
at which 1 is better.
Some algebra :-)
Noisy setting,
comparison-based rates
3. Noisy optimization
in dimension 1
Mutate large,
inherit small,
with large
population
==>
SR = O(1/N)
Extension to
strongly convex
multidimensional
quadratic forms
ok
Noisy setting,
portfolio methods
(S. Astete-Morales, M.-L. Cauwet, J. Liu, B. Rozières, T.)
Portfolio methods:
● Running several methods
● Picking up the best
Sometimes a priori, often online, sometimes
chaining (switch algorithm in real time).
Remarks:
● “Introsort” (used in STL) is a sorting algorithm with
chaining.
● In combinatorial optimization, chaining is less usual.
A portfolio algorithm: 3 key
parameters rn, sn and lag
rn = time for
nth
comparison!
rn large ==> rare
Main index
Budget sn
Run solvers !
(fair budget!)
sn, rn and lag
Many solutions for writing portfolios.
In all cases, you have roughly these parameters:
● How frequently you compare
● How precisely you compare (adaptive ? Maybe
but no “race” (or you might spend a huge time) )
● More surprising, who you compare: old outputs
Because old outputs are cheaper to compare!
Lag
Lag principle in noisy optimization with portfolios:
● Compare “old” recommendations (because cheaper)
● Recommend “current” recommendations
● Assume “best old = best current” asymptotically
==> nothing tricky, just algebra and standard bounds
Unfair budget: evaluate
until lag(rn), not until rn!
Except for
best!
Unreadable theorem
Improved theorem, INOPA, still
unreadable
Almost readable theorem
If solver i has regret (C(i)+o(1)) / nα(i)
, and
then,
● NOPA has the log(m) shift ;
● and INOPA has the log(m') shift.
Theories of continuous
optimization
olivier.teytaud@inria.fr
1. The many flavors of continuous optimization
2. Noise-free setting, rates
3. Noise-free setting, lower bounds
4. Noisy setting, rates
5. Noisy setting, lower bounds
Noisy setting, lower bounds:
validating “mutate large inherit small” (H.G. Beyer, 1998)
Simple ES:
Sampling around the
recommendation
Sampling radius
~ = distance to optimum
Search
points
Proof of slope of simple regret at best -1/2
If wrong, then for
By definition
of CR
I am a lazy guy, I
do the proof by
reduction to CR
It's the
sphere
It's a
Gaussian
Sampling scale
≈ distance scale
Ok, CR ≈ sum of SR
(not obvious, SR is based on
recommendations)
If “sampling distance
≈ distance to optimum“
then CR ≈ sum of SR
(and, thanks, O. Shamir has proved
that CR at best 1/2)
Parallelization in evolution
strategies in the noise-free
setting
Pop. size λ
Convergencerate
(inverseruntime)
Dimension
Linear
(Beyer's book &
“on the benefit
of sex”)
Logarithmic
(Gelly, T., Fournier)
Not always!
But speculative
parallelization can do that
Special parametrization
required for some algs.
(Teytaud & T.)
You don't want to know
which image I have found
by googling “coffee” and “sex”
Conclusion in noisy continuous
optimization with constant noise
log(# evaluations)
Log(Simpleregret)
Kiefer-W
olfowitz style
(-1), using
m
any derivatives, e.g. Fabian
+
Hessian
estim
ates if third
derivative
=
0
+
ES
wit M
LIS
on
sphere
Hessian estimate (-2/3)
Evolution strategies and other sampling/selecting algorithms (-1/2)
(when no MLIS)
Conclusion in noise-free
continuous optimization
# evaluations
Log(Simpleregret)
Full-ranking
(m
u,lam
bda) ==>
m
u+log(lam
bda)
Selection(mu+lambda) ==> log(mu+lambda)
(mu,lambda) ==> - log(lambda)
(V=d for the sphere)
Full-ranking
(m
u+lam
bda)==>
m
u+log(m
u+lam
bda)
But rate linear
in lambda
for lambda < d
Infinite archive
(i.e. mu increasing):
can we have
superlinear
rates ?
Yes with super strange
functions, but for sphere ?
Conclusion
Usually lower bounds are harder than lower bounds.
Not really the case for continuous ES :-)
Upper bounds for continuous noise-free ES:
Open problem 1: Markov analysis: we don't have the
sign of the constant in log ||x_n|| ≈ C n
(pattern search: more convenient for proofs,
but not what we use in real life)
Open problem 2: Can comparison-based algorithms
be superlinear ? Yes on
super-artificial functions, but on the sphere ?
Open problem 3: on which functions/noise can we have slope -1
with comparison-based algorithms ?
Not cited, sorry
● Optimization with constraints (D. Arnold & others)
● Parallel BFGS with MPI_allreduce/Hadoop (N. Leroux)
● Drift analysis (beautiful tutorial paper by P. Oliveto & C. Witt)
● Optimization with unbiased operators (P.-K. Lehr,e, C. Doer & others)
● Natural gradient (Y. Akimoto, Y. Ollivier, A. Auger & others)
● Dynamic optimization (B. Doerr & others)
● Pso, Ants (C. Witt, D. Sudholt, F. Neumann)
● Fixed budget vs rates (T. Jansen & others)
● Noisy optim. (C. Giessen, T. Kötzing, S. Astete-Morales, J. Liu, M.-L. Cauwet)
● Bandit-based noisy optimization (C. Igel, V. Heidrich-Meisner, J. Decock, P. Rolet)
● Optimization of linear functions (A. Auger & others)
Sorry for all people I have forgotten :-)
Maybe my last research talk :-)
Much more an engineer than a mathematician.
Most of you are 10x better than me at maths :-)
All questions welcome!
(the audience will answer, they
know more than me :-) )

More Related Content

PPTX
Recursion part 2
PDF
Functional programming in ruby
PDF
5 numerical analysis
PDF
Rcommands-for those who interested in R.
PDF
FYP Final Presentation
PDF
A/B Testing Theory and Practice (TrueMotion Data Science Lunch Seminar)
PDF
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
PDF
Average value by integral method
Recursion part 2
Functional programming in ruby
5 numerical analysis
Rcommands-for those who interested in R.
FYP Final Presentation
A/B Testing Theory and Practice (TrueMotion Data Science Lunch Seminar)
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
Average value by integral method

What's hot (18)

PDF
Introduction to Recursion (Python)
PDF
Decreasing and increasing functions by arun umrao
PDF
08 binarysearchtrees 1
PPTX
Python Homework Help
PDF
Daa chapter7
PPTX
Chap 3php array part4
PDF
Numarical values
PDF
Simpler java
PDF
Mit6 094 iap10_lec03
PDF
Eigenvalue eigenvector slides
PDF
N-Queens Combinatorial Problem - Polyglot FP for Fun and Profit – Haskell and...
PDF
Numarical values highlighted
PPTX
4 5 fractional exponents-x
PDF
Programming in lua STRING AND ARRAY
PPT
Chap11alg
PPTX
13 recursion-120712074623-phpapp02
PDF
Expressiveness and Model of the Polymorphic λ Calculus
Introduction to Recursion (Python)
Decreasing and increasing functions by arun umrao
08 binarysearchtrees 1
Python Homework Help
Daa chapter7
Chap 3php array part4
Numarical values
Simpler java
Mit6 094 iap10_lec03
Eigenvalue eigenvector slides
N-Queens Combinatorial Problem - Polyglot FP for Fun and Profit – Haskell and...
Numarical values highlighted
4 5 fractional exponents-x
Programming in lua STRING AND ARRAY
Chap11alg
13 recursion-120712074623-phpapp02
Expressiveness and Model of the Polymorphic λ Calculus
Ad

Viewers also liked (16)

ODP
Stochastic modelling and quasi-random numbers
ODP
Energy Management Forum, Tainan 2012
ODP
Dynamic Optimization without Markov Assumptions: application to power systems
ODP
Machine learning 2016: deep networks and Monte Carlo Tree Search
ODP
Introduction to the TAO Uct Sig, a team working on computational intelligence...
ODP
Tools for artificial intelligence
ODP
Noisy optimization --- (theory oriented) Survey
ODP
Multimodal or Expensive Optimization
ODP
Statistics 101
ODP
Meta Monte-Carlo Tree Search
ODP
Computers and Killall-Go
ODP
Uncertainties in large scale power systems
ODP
Complexity of planning and games with partial information
ODP
ODP
Inteligencia Artificial y Go
ODP
Combining UCT and Constraint Satisfaction Problems for Minesweeper
Stochastic modelling and quasi-random numbers
Energy Management Forum, Tainan 2012
Dynamic Optimization without Markov Assumptions: application to power systems
Machine learning 2016: deep networks and Monte Carlo Tree Search
Introduction to the TAO Uct Sig, a team working on computational intelligence...
Tools for artificial intelligence
Noisy optimization --- (theory oriented) Survey
Multimodal or Expensive Optimization
Statistics 101
Meta Monte-Carlo Tree Search
Computers and Killall-Go
Uncertainties in large scale power systems
Complexity of planning and games with partial information
Inteligencia Artificial y Go
Combining UCT and Constraint Satisfaction Problems for Minesweeper
Ad

Similar to Theories of continuous optimization (20)

ODP
Derivative Free Optimization
PPTX
Linear Programming- Leacture-16-lp1.pptx
ODP
Artificial Intelligence and Optimization with Parallelism
PPT
Dynamic programming
PDF
Anlysis and design of algorithms part 1
PPT
dynamic programming Rod cutting class
PDF
22 01 2014_03_23_31_eee_formula_sheet_final
PPTX
Design and Analysis of Algorithms Lecture Notes
PDF
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
PPTX
6-Python-Recursion PPT.pptx
PPT
Lecture 1 and 2 of Data Structures & Algorithms
ODP
Bias correction, and other uncertainty management techniques
ODP
Simple regret bandit algorithms for unstructured noisy optimization
PPTX
Coursera 2week
PPT
Design and analysis of algorithm in Computer Science
PDF
Practical and Worst-Case Efficient Apportionment
PDF
constructing_generic_algorithms__ben_deane__cppcon_2020.pdf
PDF
机器学习Adaboost
PDF
Introduction to Artificial Neural Network
PDF
Support Vector Machines is the the the the the the the the the
Derivative Free Optimization
Linear Programming- Leacture-16-lp1.pptx
Artificial Intelligence and Optimization with Parallelism
Dynamic programming
Anlysis and design of algorithms part 1
dynamic programming Rod cutting class
22 01 2014_03_23_31_eee_formula_sheet_final
Design and Analysis of Algorithms Lecture Notes
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
6-Python-Recursion PPT.pptx
Lecture 1 and 2 of Data Structures & Algorithms
Bias correction, and other uncertainty management techniques
Simple regret bandit algorithms for unstructured noisy optimization
Coursera 2week
Design and analysis of algorithm in Computer Science
Practical and Worst-Case Efficient Apportionment
constructing_generic_algorithms__ben_deane__cppcon_2020.pdf
机器学习Adaboost
Introduction to Artificial Neural Network
Support Vector Machines is the the the the the the the the the

Recently uploaded (20)

PPT
introduction to datamining and warehousing
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
DOCX
573137875-Attendance-Management-System-original
PPTX
Artificial Intelligence
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Geodesy 1.pptx...............................................
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Well-logging-methods_new................
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Current and future trends in Computer Vision.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
introduction to datamining and warehousing
Mechanical Engineering MATERIALS Selection
CYBER-CRIMES AND SECURITY A guide to understanding
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
573137875-Attendance-Management-System-original
Artificial Intelligence
Embodied AI: Ushering in the Next Era of Intelligent Systems
Geodesy 1.pptx...............................................
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Safety Seminar civil to be ensured for safe working.
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Well-logging-methods_new................
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
UNIT-1 - COAL BASED THERMAL POWER PLANTS
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Current and future trends in Computer Vision.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf

Theories of continuous optimization

  • 1. Theories of continuous optimization [email protected] Or the art of preparing coffee, and a little bit of sorting algorithms. Sorry for not being here yesterday :-)
  • 2. Theories of continuous optimization [email protected] 1. The many flavors of continuous optimization 2. Noise-free setting, rates 3. Noise-free setting, lower bounds 4. Noisy setting, rates 5. Noisy setting, lower bounds
  • 3. The many flavors of continuous optimization ● What do we optimize ? (x in R^d, f(x) in R) – argmin_x f(x) – Pareto front { ( f_1(x), f_2(x), f_3(x) ) ; x } – argmin_{x such that Z(x) } f(x) ● How to optimize ? – Exactly – Approximately What are the best temperature/pressure for preparing espresso ? What are the best temperature/pression for preparing espresso with <12 bars ? What are the coffee parameters such that it can not become simultaneously better for Bob, Alice and Charles ? Coffee is never perfect But in discrete domains people want perfect coffee, i.e. exact soutions
  • 4. The many flavors of continuous optimization ● From which information ? – (x,y) → << is f(x) < f(y) ? >> (ternary) (user feedback ? noise-free & unbiased ?) – x → f(x) – x → f(x), f(x)∇ (99% of the continuous opt. world...) – x → f(x), f(x), Hf(x)∇ ● Which criterion ? – f(x) after time T – f(x) after T comparisons – f(x) after T evaluations – ... My coffee is twice-differentiable and I've met its Hessian I don't give marks to coffee but I can choose A coffee, quickly! I want to drink T coffees.
  • 5. The many flavors of continuous optimization Comparison information & number of comparisons (comparison-complexity) Or comparison information & number of evaluations ? (strange ?) (but comparing all λ^2 points ==> λ log(λ) comp.) Or black-box & number of evaluations ? (black-box complexity)
  • 6. Black box vs white box: BB might be much easier because we do not count time! In discrete domains (example by T. Jansen, right ? Maybe someone else as well ?): Argmin quadraticFunction ==> NP-complete, but black-box complexity = O(B(B+1)/2+B+2) (algo = find the quadratic form!) ==> super expensive but BB easy! In (0,1): encode x>0 in (0,1/2), in unary and minimum of f at 1/2+1/(2busy-beaver(x)), in (½,1) ==> takes time busy-beaver(x) (to be proved :-) ) but BB computable in O(x) ! (BB or comparison-based!) I love internet, we can find an image with a beaver and coffee
  • 7. Introduction: the (1+1)-ES (Schumer and Steiglitz) Thanks Anne Auger for not complaining for all I have stolen from your slides. Simple, but good coffee. Why ?
  • 8. Generate  points around x ( x +  N where N is a standard Gaussian) Basicschemaofan EvolutionStrategy Parameters: x,
  • 9. Generate  points around x ( x +  N where N is a standard Gaussian) Compute their  fitness values Basicschemaofan EvolutionStrategy Parameters: x,
  • 10. Generate  points around x ( x +  N where N is a standard Gaussian) Compute their  fitness values Select the  best Basicschemaofan EvolutionStrategy Parameters: x,
  • 11. Generate  points around x ( x +  N where N is a standard Gaussian) Compute their  fitness values Select the  best Let x = average of these  best Basicschemaofan EvolutionStrategy Parameters: x,
  • 12. Generate  points around x ( x +  N where N is a standard Gaussian) Compute their  fitness values Select the  best Let x = average of these  best Basicschemaofan EvolutionStrategy Parameters: x,
  • 13. Generate  points around x ( x +  N where N is a standard Gaussian) Compute their  fitness values Select the  best Let x = average of these  best Obviouslyparallel Parameters: x, Multi-cores, Clusters, Grids...
  • 14. Generate  points around x ( x +  N where N is a standard Gaussian) Compute their  fitness values Select the  best Let x = average of these  best Obviouslyparallel Parameters: x, Reallysimple.
  • 15. Generate  points around x ( x +  N where N is a standard Gaussian) Compute their  fitness values Select the  best Let x = average of these  best Obviouslyparallel Parameters: x, Reallysimple. Not a negligible advantage. When I accessed, for the 1st time, a crucial industrial code of an important company, I believed that it would be clean and bug free. (I was young :-) )
  • 16. Generate 1 point x' around x ( x +  N where N is a standard Gaussian) Compute its fitness value Keep the best (x or x'). x=best(x,x') =2 if x' best =0.84 otherwise Parameters: x, The(1+1)-ESwith1/5th rule
  • 19. I select the =3 best points
  • 20. x=average of these =3 best points
  • 21. Ok. Choosing an initial x is as in any algorithm. But how do I choose sigma ?
  • 22. Ok. Choosing x is as in any algorithm. But how do I choose sigma ? Sometimes by human guess. But for large number of iterations, there is better.
  • 26. log || xn – x* || ~ - C n
  • 27. Usually termed “linear convergence”, ==> but it's in log-scale. log || xn – x* || ~ - C n
  • 28. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* || Ok, we want to choose . How to do that ?
  • 29. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* || __________________ - E log || x(n-1) – f* ||
  • 30. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* || __________________ - E log || x(n-1) – f* || We don't know f*. How can we optimize this ? We will observe the acceptance rate, and we will deduce if  is too large or too small..
  • 31. - E log || x(n) - f* || -------------------------- - E log || x(n-1) – f* || Accepted mutations Rejected mutations ON THE NORM FUNCTION Level set Current point Optimum Rejected mutations Rejected mutations Rejected mutations Progress rate
  • 32. - E log || x(n) - f* || ___________________________________ - E log || x(n-1) – f* || Accepted mutations Rejected mutations Level set For each step-size, evaluate this “expected progress rate” and evaluate “P(acceptance)”
  • 34. Rejected mutations Progressrate Acceptance rate We want to be here! We observe (approximately) this variable
  • 39. Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 39 1/5th rule with arbitrary pop-size Based on maths showing that good step-size <==> success rate  1/5
  • 40. Theories of continuous optimization [email protected] 1. The many flavors of continuous optimization 2. Noise-free setting, rates 3. Noise-free setting, lower bounds 4. Noisy setting, rates 5. Noisy setting, lower bounds
  • 41. Noise-free setting, rates (optimum in 0 for short) ● Newton (f, f, Hf) :∇ ||x_n|| = O( ||x_{n-1}||^2 ) “quadratic” ● BFGS (f, f)∇ superlinear ||x_n|| = o( ||x_{n-1}|| ) (LBFGS: R-linear) ● NEWUOA (f) or BOBYQA finite number of bits ==> complexity (dimension) ● ES (comparison) ||x_n|| = O( K1/d || x_{n-1} || ) “linear” n ~ d x log ( 1 / ε )
  • 42. Noise-free setting, q-rates q-superlinear convergence q-linear convergence
  • 43. Noise-free setting, r-rates q-rates have a problem: You might have a super fast convergence, but no q-superlinear convergence, because a few values are super-small (the sequence is not decreasing) r-rate: the r-rate of xn is the best q-rate of a sequence yn with values yn ≥ xn
  • 44. Noise-free setting, rates by Markov chain analysis ● A. Auger, A. Chotard, N. Hansen, sphere functions and then linear functions. ● Typical idea in the sphere case: rescaling! (x/σ) is a homogeneous Markov chain ==> (x/σ) is asymp. stationary (for some ES and functions...) ==> so progress rate =with asymptotic probability distribution such that hopefully E(log ||x(n+1)/σ(n+1)|| - log ||x(n)/σ(n)|| | || x(n)/σ(n) || ) is negative. Not proved!
  • 45. Theories of continuous optimization [email protected] 1. The many flavors of continuous optimization 2. Noise-free setting, rates 3. Noise-free setting, lower bounds 4. Noisy setting, rates 5. Noisy setting, lower bounds
  • 46. Noise-free setting, lower bounds (Gelly, Fournier, Teytaud, ...) Remember lower bound for sorting with comparison-based algorithms ? ● n! possible inputs ● 2k possible bits of information for k comparisons ● 2k is at least n! ==> k at least log(n!) = n log (n) indeed, we have no proof that better than n log(n) is impossible for integers. ==> same concept for ES!
  • 47. Noise-free setting, lower bounds Number of possible outputs: ● K possible outputs per iteration ● K^N possible outputs after N iterations ==> can be extended to stochastic algorithms How many possible outputs are needed for precision ε in domain D for a given norm ?
  • 48. Noise-free setting, lower bounds “Branching factor” K = nb of possible outcomes when comparing the offspring fitness ● 2 for (1+1)-ES (better or worse!) (3 if tie) ● (μ,λ)-ES ==> λ choose μ ● (μ+λ)-ES ==> λ+μ choose μ ● arbitrary ranking – ES ==> lambda! ● ranking with archive ==> O(n) possibilities for nth point Fournier & Teytaud, Algorithmica'06. # ε-balls in [0,1]^d Generate λ points, use the list of the μ bestGenerate λ points, use the list of μ best among all I'm grateful to J. Jaegerskuepper, I fell in love while watching his talk at Dagstul'06. (in love with lower bounds, not with Jens)
  • 49. Noise-free setting, lower bounds, multiobjective case d spherical objective functions in dimension N I've written a paper without assuming sphere-like objective functions. Using tricks on Hausdorff metric. The most unreadable thing I've ever written. I apologize for this. 1
  • 50. Theories of continuous optimization [email protected] 1. The many flavors of continuous optimization 2. Noise-free setting, rates 3. Noise-free setting, lower bounds 4. Noisy setting, rates 5. Noisy setting, lower bounds
  • 51. Noisy setting, rates Search point ≠ recommended point Even in deterministic optimization this is important.
  • 52. Noisy setting, rates Simple regret (SR) & cumulated regret (CR): SR(n) = IE fitness( nth recommended point) – IE fitness(x*) I want to become a great barista I want a good coffee on average
  • 53. Simple regret, uniform regret, cumulative regret
  • 54. Simple regret, uniform regret, cumulative regret
  • 55. Noise model: z in {0,1,2} 0: constant variance noise / additive noise 1: linear variance noise 2: quadratic variance noise / multiplicative noise Also actuator noise: f(x,w) = f(x+w) My coffee machine has noise No excuse for bad coffee, no noise at the optimum (Markov chain analysis ok) E.g. success rates of parametric policies with optimum = 100%
  • 56. Noisy setting (noise Θ(1)), comparison-based rates if z=0 (M.-L. Cauwet <== looking for a post-doc) (showing that MLIS (Beyer'98) leads to SR=O(1/N) with constant noise) 1. Comparison-based noisy Optimization: the operator Don't average and compare: just compare many times! Hint: apply a sorting algorithm, It will be much faster than n^2 comparisons :-) (for comparison-based criteria rather than evaluation-based complexity)
  • 57. Noisy setting, comparison-based rates 2. Key points: Frequency estimated with N x N points has precision 1/sqrt(n). a) Optimum position = exactly computable from the probability of “f(-1) > f(1)”. b) This probability is approx. frequency with precision 1/sqrt(n). Evaluate f(1) and f(-1), N times, consider the frequency at which 1 is better. Some algebra :-)
  • 58. Noisy setting, comparison-based rates 3. Noisy optimization in dimension 1 Mutate large, inherit small, with large population ==> SR = O(1/N) Extension to strongly convex multidimensional quadratic forms ok
  • 59. Noisy setting, portfolio methods (S. Astete-Morales, M.-L. Cauwet, J. Liu, B. Rozières, T.) Portfolio methods: ● Running several methods ● Picking up the best Sometimes a priori, often online, sometimes chaining (switch algorithm in real time). Remarks: ● “Introsort” (used in STL) is a sorting algorithm with chaining. ● In combinatorial optimization, chaining is less usual.
  • 60. A portfolio algorithm: 3 key parameters rn, sn and lag rn = time for nth comparison! rn large ==> rare Main index Budget sn Run solvers ! (fair budget!)
  • 61. sn, rn and lag Many solutions for writing portfolios. In all cases, you have roughly these parameters: ● How frequently you compare ● How precisely you compare (adaptive ? Maybe but no “race” (or you might spend a huge time) ) ● More surprising, who you compare: old outputs Because old outputs are cheaper to compare!
  • 62. Lag Lag principle in noisy optimization with portfolios: ● Compare “old” recommendations (because cheaper) ● Recommend “current” recommendations ● Assume “best old = best current” asymptotically ==> nothing tricky, just algebra and standard bounds
  • 63. Unfair budget: evaluate until lag(rn), not until rn! Except for best!
  • 65. Improved theorem, INOPA, still unreadable
  • 66. Almost readable theorem If solver i has regret (C(i)+o(1)) / nα(i) , and then, ● NOPA has the log(m) shift ; ● and INOPA has the log(m') shift.
  • 67. Theories of continuous optimization [email protected] 1. The many flavors of continuous optimization 2. Noise-free setting, rates 3. Noise-free setting, lower bounds 4. Noisy setting, rates 5. Noisy setting, lower bounds
  • 68. Noisy setting, lower bounds: validating “mutate large inherit small” (H.G. Beyer, 1998) Simple ES: Sampling around the recommendation Sampling radius ~ = distance to optimum Search points
  • 69. Proof of slope of simple regret at best -1/2 If wrong, then for By definition of CR I am a lazy guy, I do the proof by reduction to CR It's the sphere It's a Gaussian Sampling scale ≈ distance scale Ok, CR ≈ sum of SR (not obvious, SR is based on recommendations) If “sampling distance ≈ distance to optimum“ then CR ≈ sum of SR (and, thanks, O. Shamir has proved that CR at best 1/2)
  • 70. Parallelization in evolution strategies in the noise-free setting Pop. size λ Convergencerate (inverseruntime) Dimension Linear (Beyer's book & “on the benefit of sex”) Logarithmic (Gelly, T., Fournier) Not always! But speculative parallelization can do that Special parametrization required for some algs. (Teytaud & T.) You don't want to know which image I have found by googling “coffee” and “sex”
  • 71. Conclusion in noisy continuous optimization with constant noise log(# evaluations) Log(Simpleregret) Kiefer-W olfowitz style (-1), using m any derivatives, e.g. Fabian + Hessian estim ates if third derivative = 0 + ES wit M LIS on sphere Hessian estimate (-2/3) Evolution strategies and other sampling/selecting algorithms (-1/2) (when no MLIS)
  • 72. Conclusion in noise-free continuous optimization # evaluations Log(Simpleregret) Full-ranking (m u,lam bda) ==> m u+log(lam bda) Selection(mu+lambda) ==> log(mu+lambda) (mu,lambda) ==> - log(lambda) (V=d for the sphere) Full-ranking (m u+lam bda)==> m u+log(m u+lam bda) But rate linear in lambda for lambda < d Infinite archive (i.e. mu increasing): can we have superlinear rates ? Yes with super strange functions, but for sphere ?
  • 73. Conclusion Usually lower bounds are harder than lower bounds. Not really the case for continuous ES :-) Upper bounds for continuous noise-free ES: Open problem 1: Markov analysis: we don't have the sign of the constant in log ||x_n|| ≈ C n (pattern search: more convenient for proofs, but not what we use in real life) Open problem 2: Can comparison-based algorithms be superlinear ? Yes on super-artificial functions, but on the sphere ? Open problem 3: on which functions/noise can we have slope -1 with comparison-based algorithms ?
  • 74. Not cited, sorry ● Optimization with constraints (D. Arnold & others) ● Parallel BFGS with MPI_allreduce/Hadoop (N. Leroux) ● Drift analysis (beautiful tutorial paper by P. Oliveto & C. Witt) ● Optimization with unbiased operators (P.-K. Lehr,e, C. Doer & others) ● Natural gradient (Y. Akimoto, Y. Ollivier, A. Auger & others) ● Dynamic optimization (B. Doerr & others) ● Pso, Ants (C. Witt, D. Sudholt, F. Neumann) ● Fixed budget vs rates (T. Jansen & others) ● Noisy optim. (C. Giessen, T. Kötzing, S. Astete-Morales, J. Liu, M.-L. Cauwet) ● Bandit-based noisy optimization (C. Igel, V. Heidrich-Meisner, J. Decock, P. Rolet) ● Optimization of linear functions (A. Auger & others) Sorry for all people I have forgotten :-) Maybe my last research talk :-) Much more an engineer than a mathematician. Most of you are 10x better than me at maths :-) All questions welcome! (the audience will answer, they know more than me :-) )