Theories of continuous optimization

Theories of continuous
optimization
olivier.teytaud@inria.fr
Or the art of preparing coffee,
and a little bit of sorting algorithms.
Sorry for not being here
yesterday :-)

optimization
1. The many flavors of continuous
optimization
2. Noise-free setting, rates
3. Noise-free setting, lower bounds
4. Noisy setting, rates
5. Noisy setting, lower bounds

The many flavors of continuous
optimization
● What do we optimize ? (x in R^d, f(x) in R)
– argmin_x f(x)
– Pareto front { ( f_1(x), f_2(x), f_3(x) ) ; x }
– argmin_{x such that Z(x) } f(x)
● How to optimize ?
– Exactly
– Approximately
What are the best temperature/pressure
for preparing espresso ?
What are the best
temperature/pression
for preparing espresso with <12 bars ?
What are the
coffee
parameters
such that it can
not become
simultaneously
better for Bob,
Alice and
Charles ?
Coffee is never perfect
But in discrete domains
people want perfect coffee,
i.e. exact soutions

optimization
● From which information ?
– (x,y) → << is f(x) < f(y) ? >> (ternary)
(user feedback ? noise-free & unbiased ?)
– x → f(x)
– x → f(x), f(x)∇ (99% of the continuous opt. world...)
– x → f(x), f(x), Hf(x)∇
● Which criterion ?
– f(x) after time T
– f(x) after T comparisons
– f(x) after T evaluations
– ...
My coffee is twice-differentiable
and I've met its Hessian
I don't give marks to coffee
but I can choose
A coffee, quickly!
I want to drink T coffees.

optimization
Comparison information & number of comparisons
(comparison-complexity)
Or comparison information & number of evaluations ?
(strange ?)
(but comparing all λ^2 points ==> λ log(λ) comp.)
Or black-box & number of evaluations ? (black-box complexity)

Black box vs white box:
BB might be much easier
because we do not count time!
In discrete domains (example by T. Jansen, right ? Maybe someone else
as well ?):
Argmin quadraticFunction ==> NP-complete, but
black-box complexity = O(B(B+1)/2+B+2)
(algo = find the quadratic form!)
==> super expensive but BB easy!
In (0,1): encode x>0 in (0,1/2), in unary
and minimum of f at 1/2+1/(2busy-beaver(x)), in (½,1)
==> takes time busy-beaver(x) (to be proved :-) ) but BB computable in
O(x) !
(BB or comparison-based!)
I love internet, we can
find an image with
a beaver and coffee

Introduction: the (1+1)-ES
(Schumer and Steiglitz)
Thanks Anne Auger for not complaining for all I have stolen from your slides.
Simple, but good coffee.
Why ?

Generate  points around x
( x +  N where N is a standard
Gaussian)
Basicschemaofan
EvolutionStrategy
Parameters:
x,

Gaussian)
Compute their  fitness values
Basicschemaofan
EvolutionStrategy
Parameters:
x,

Gaussian)
Select the  best
Basicschemaofan
EvolutionStrategy
Parameters:
x,

Gaussian)
Select the  best
Let x = average of these  best
Basicschemaofan
EvolutionStrategy
Parameters:
x,

Gaussian)
Select the  best
Obviouslyparallel
Parameters:
x,
Multi-cores,
Clusters, Grids...

Gaussian)
Select the  best
Parameters:
x,
Reallysimple.

Gaussian)
Select the  best
Parameters:
x,
Reallysimple.
Not a negligible advantage.
When I accessed, for the 1st time,
a crucial industrial
code of an important
company, I believed
that it would be
clean and bug free.
(I was young :-) )

Generate 1 point x' around x
Gaussian)
Compute its fitness value
Keep the best (x or x').
x=best(x,x')
=2 if x' best
=0.84 otherwise
Parameters:
x,
The(1+1)-ESwith1/5th
rule

I select the =3 best points

x=average of these =3 best points

Ok.
Choosing an initial
x is as in any algorithm.
But how do I choose sigma ?

Ok.
Choosing x is as in any algorithm.
But how do I choose sigma ?
Sometimes by human guess.
But for large number of iterations,
there is better.

Theories of continuous optimization

Usually termed “linear convergence”,
==> but it's in log-scale.
log || xn – x* || ~ - C n

Consider the (1+1)-ES.
x(n) = x(n-1) or x(n-1) + (n-1)N
We want to maximize:
- E log || x(n) - f* ||
Ok, we want to choose . How to do that ?

x(n) = x(n-1) or x(n-1) + (n-1)N
- E log || x(n) - f* ||
__________________
- E log || x(n-1) – f* ||

x(n) = x(n-1) or x(n-1) + (n-1)N
- E log || x(n) - f* ||
__________________
- E log || x(n-1) – f* ||
We don't know f*.
How can we optimize this ?
We will observe
the acceptance rate,
and we will deduce if 
is too large or too small..

- E log || x(n) - f* ||
--------------------------
- E log || x(n-1) – f* ||
Accepted
mutations
Rejected
mutations
ON THE NORM FUNCTION
Level set
Current
point Optimum
Rejected
mutations
Rejected
mutations
Rejected
mutations
Progress
rate

- E log || x(n) - f* ||
___________________________________
- E log || x(n-1) – f* ||
Accepted
mutations
Rejected
mutations
Level set
For each step-size,
evaluate this “expected progress rate”
and evaluate “P(acceptance)”

Rejected
mutations
Progressrate
Acceptance rate

Rejected
mutations
Progressrate
Acceptance rate
We want to be here!
We observe
(approximately)
this variable

Rejected
mutations
Progressrate
Acceptance rateBig
step-size

Rejected
mutations
Progressrate
Acceptance rate
Small
step-size

Rejected
mutations
Progressrate
Acceptance rate
Small acceptance rate
==> decrease sigma

Rejected
mutations
Progressrate
Acceptance rate
Big acceptance rate
==> increase sigma

Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 39
1/5th
rule with arbitrary pop-size
Based on maths showing
that good step-size
<==> success rate  1/5

optimization
1. The many flavors of continuous optimization
2. Noise-free setting, rates
3. Noise-free setting, lower bounds
4. Noisy setting, rates
5. Noisy setting, lower bounds

Noise-free setting, rates
(optimum in 0 for short)
● Newton (f, f, Hf) :∇
||x_n|| = O( ||x_{n-1}||^2 ) “quadratic”
● BFGS (f, f)∇
superlinear ||x_n|| = o( ||x_{n-1}|| )
(LBFGS: R-linear)
● NEWUOA (f) or BOBYQA
finite number of bits ==> complexity (dimension)
● ES (comparison)
||x_n|| = O( K1/d
|| x_{n-1} || ) “linear”
n ~ d x log ( 1 / ε )

Noise-free setting, q-rates
q-superlinear convergence
q-linear convergence

Noise-free setting, r-rates
q-rates have a problem:
You might have a super fast convergence, but no
q-superlinear convergence, because a few values
are super-small (the sequence is not decreasing)
r-rate: the r-rate of xn is the best q-rate
of a sequence yn with
values yn ≥ xn

Noise-free setting, rates by
Markov chain analysis
● A. Auger, A. Chotard, N. Hansen, sphere functions and then linear functions.
● Typical idea in the sphere case: rescaling!
(x/σ) is a homogeneous Markov chain
==> (x/σ) is asymp. stationary (for some ES and functions...)
==> so progress rate =with asymptotic probability distribution such
that hopefully
E(log ||x(n+1)/σ(n+1)|| - log ||x(n)/σ(n)|| | || x(n)/σ(n) || )
is negative.
Not proved!

Noise-free setting, lower bounds
(Gelly, Fournier, Teytaud, ...)
Remember lower bound for sorting with comparison-based
algorithms ?
● n! possible inputs
●
2k
possible bits of information for k comparisons
●
2k
is at least n! ==> k at least log(n!) = n log (n)
indeed, we have no proof that better than n log(n) is
impossible for integers.
==> same concept for ES!

Number of possible outputs:
● K possible outputs per iteration
● K^N possible outputs after N iterations
==> can be extended to stochastic algorithms
How many possible outputs are needed for
precision ε in domain D for a given norm ?

“Branching factor” K = nb of possible outcomes
when comparing the offspring fitness
● 2 for (1+1)-ES (better or worse!) (3 if tie)
● (μ,λ)-ES ==> λ choose μ
● (μ+λ)-ES ==> λ+μ choose μ
● arbitrary ranking – ES ==> lambda!
●
ranking with archive ==> O(n) possibilities for nth
point
Fournier & Teytaud,
Algorithmica'06.
# ε-balls in
[0,1]^d
Generate
λ points, use the list
of the μ bestGenerate
λ points, use the list
of μ best among all
I'm grateful to J. Jaegerskuepper,
I fell in love while watching
his talk at Dagstul'06.
(in love with lower bounds, not with Jens)

Noise-free setting, lower bounds,
multiobjective case
d spherical objective functions in dimension N
I've written a paper without assuming
sphere-like objective functions.
Using tricks on Hausdorff metric.
The most unreadable thing I've ever written.
I apologize for this.
1

Noisy setting, rates
Search point ≠ recommended point
Even in deterministic
optimization this is important.

Noisy setting, rates
Simple regret (SR) & cumulated regret (CR):
SR(n) = IE fitness( nth
recommended point)
– IE fitness(x*)
I want to
become a
great barista
I want a good
coffee on average

Simple regret, uniform regret,
cumulative regret

Noise model: z in {0,1,2}
0: constant variance noise / additive noise
1: linear variance noise
2: quadratic variance noise / multiplicative noise
Also actuator noise: f(x,w) = f(x+w)
My coffee
machine has
noise
No excuse for bad
coffee, no noise at
the optimum (Markov
chain analysis ok)
E.g. success rates of
parametric policies
with optimum = 100%

Noisy setting (noise Θ(1)),
comparison-based rates if z=0
(M.-L. Cauwet <== looking for a post-doc)
(showing that MLIS (Beyer'98) leads to SR=O(1/N) with constant noise)
1. Comparison-based noisy
Optimization: the operator
Don't average and compare:
just compare many times!
Hint: apply a sorting algorithm,
It will be much faster than n^2
comparisons :-) (for
comparison-based criteria
rather than evaluation-based
complexity)

Noisy setting,
comparison-based rates
2. Key points:
Frequency estimated with
N x N points has precision
1/sqrt(n).
a) Optimum position = exactly
computable from the probability
of “f(-1) > f(1)”.
b) This probability is approx.
frequency with
precision 1/sqrt(n).
Evaluate f(1) and f(-1),
N times,
consider the frequency
at which 1 is better.
Some algebra :-)

Noisy setting,
comparison-based rates
3. Noisy optimization
in dimension 1
Mutate large,
inherit small,
with large
population
==>
SR = O(1/N)
Extension to
strongly convex
multidimensional
quadratic forms
ok

Noisy setting,
portfolio methods
(S. Astete-Morales, M.-L. Cauwet, J. Liu, B. Rozières, T.)
Portfolio methods:
● Running several methods
● Picking up the best
Sometimes a priori, often online, sometimes
chaining (switch algorithm in real time).
Remarks:
● “Introsort” (used in STL) is a sorting algorithm with
chaining.
● In combinatorial optimization, chaining is less usual.

A portfolio algorithm: 3 key
parameters rn, sn and lag
rn = time for
nth
comparison!
rn large ==> rare
Main index
Budget sn
Run solvers !
(fair budget!)

sn, rn and lag
Many solutions for writing portfolios.
In all cases, you have roughly these parameters:
● How frequently you compare
● How precisely you compare (adaptive ? Maybe
but no “race” (or you might spend a huge time) )
● More surprising, who you compare: old outputs
Because old outputs are cheaper to compare!

Lag
Lag principle in noisy optimization with portfolios:
● Compare “old” recommendations (because cheaper)
● Recommend “current” recommendations
● Assume “best old = best current” asymptotically
==> nothing tricky, just algebra and standard bounds

Unfair budget: evaluate
until lag(rn), not until rn!
Except for
best!

Improved theorem, INOPA, still
unreadable

Almost readable theorem
If solver i has regret (C(i)+o(1)) / nα(i)
, and
then,
● NOPA has the log(m) shift ;
● and INOPA has the log(m') shift.

Noisy setting, lower bounds:
validating “mutate large inherit small” (H.G. Beyer, 1998)
Simple ES:
Sampling around the
recommendation
Sampling radius
~ = distance to optimum
Search
points

Proof of slope of simple regret at best -1/2
If wrong, then for
By definition
of CR
I am a lazy guy, I
do the proof by
reduction to CR
It's the
sphere
It's a
Gaussian
Sampling scale
≈ distance scale
Ok, CR ≈ sum of SR
(not obvious, SR is based on
recommendations)
If “sampling distance
≈ distance to optimum“
then CR ≈ sum of SR
(and, thanks, O. Shamir has proved
that CR at best 1/2)

Parallelization in evolution
strategies in the noise-free
setting
Pop. size λ
Convergencerate
(inverseruntime)
Dimension
Linear
(Beyer's book &
“on the benefit
of sex”)
Logarithmic
(Gelly, T., Fournier)
Not always!
But speculative
parallelization can do that
Special parametrization
required for some algs.
(Teytaud & T.)
You don't want to know
which image I have found
by googling “coffee” and “sex”

Conclusion in noisy continuous
optimization with constant noise
log(# evaluations)
Log(Simpleregret)
Kiefer-W
olfowitz style
(-1), using
m
any derivatives, e.g. Fabian
+
Hessian
estim
ates if third
derivative
=
0
+
ES
wit M
LIS
on
sphere
Hessian estimate (-2/3)
Evolution strategies and other sampling/selecting algorithms (-1/2)
(when no MLIS)

Conclusion in noise-free
continuous optimization
# evaluations
Log(Simpleregret)
Full-ranking
(m
u,lam
bda) ==>
m
u+log(lam
bda)
Selection(mu+lambda) ==> log(mu+lambda)
(mu,lambda) ==> - log(lambda)
(V=d for the sphere)
Full-ranking
(m
u+lam
bda)==>
m
u+log(m
u+lam
bda)
But rate linear
in lambda
for lambda < d
Infinite archive
(i.e. mu increasing):
can we have
superlinear
rates ?
Yes with super strange
functions, but for sphere ?

Conclusion
Usually lower bounds are harder than lower bounds.
Not really the case for continuous ES :-)
Upper bounds for continuous noise-free ES:
Open problem 1: Markov analysis: we don't have the
sign of the constant in log ||x_n|| ≈ C n
(pattern search: more convenient for proofs,
but not what we use in real life)
Open problem 2: Can comparison-based algorithms
be superlinear ? Yes on
super-artificial functions, but on the sphere ?
Open problem 3: on which functions/noise can we have slope -1
with comparison-based algorithms ?

Not cited, sorry
● Optimization with constraints (D. Arnold & others)
● Parallel BFGS with MPI_allreduce/Hadoop (N. Leroux)
● Drift analysis (beautiful tutorial paper by P. Oliveto & C. Witt)
● Optimization with unbiased operators (P.-K. Lehr,e, C. Doer & others)
● Natural gradient (Y. Akimoto, Y. Ollivier, A. Auger & others)
● Dynamic optimization (B. Doerr & others)
● Pso, Ants (C. Witt, D. Sudholt, F. Neumann)
● Fixed budget vs rates (T. Jansen & others)
● Noisy optim. (C. Giessen, T. Kötzing, S. Astete-Morales, J. Liu, M.-L. Cauwet)
● Bandit-based noisy optimization (C. Igel, V. Heidrich-Meisner, J. Decock, P. Rolet)
● Optimization of linear functions (A. Auger & others)
Sorry for all people I have forgotten :-)
Maybe my last research talk :-)
Much more an engineer than a mathematician.
Most of you are 10x better than me at maths :-)
All questions welcome!
(the audience will answer, they
know more than me :-) )

Theories of continuous optimization

More Related Content

What's hot (18)

Viewers also liked (16)

Similar to Theories of continuous optimization (20)

Recently uploaded (20)

Theories of continuous optimization