Loss Calibrated Variational Inference

1/35
Loss Calibrated Variational Inference
Tomasz Ku´smierczyk
Joseph Sakaya
October 17, 2019
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration

2/35
Outline of the Talk
Recap of Lecture 11 - Variational Inference
Reparameterization gradients
Bayesian decision theory
Loss calibrated variational inference: framework
Loss calibration: discrete case
Loss calibration: continuous case
Conclusion

3/35
Recap: Lecture 11
Motivation
MCMC approximates posteriors by sampling
Computationally expensive
Asymptotic convergence
Diagnostics can be tricky
Variational inference approximates the posterior with a
parameteric family of distributions q(θ; λ)
Converts inference to an optimization problem
Scales very well
Does not converge to the true posterior
Minimize KL divergence between a proxy q(θ; λ) and
p(θ|D)
KL(q(θ; λ) p(θ|D)) = Eq(θ;λ) log
q(θ; λ)
p(θ|D)

4/35
Variational Inference
Evidence Lower Bound
Consider the equation:
log p(D) = KL(q(θ; λ) p(θ|D))+Eq(θ;λ) [log p(D, θ) − q(θ; λ)]
ELBO L(λ)
Minimization of KL is the same as maximizing L(λ) since
log p(D) is constant w.r.t λ.
Therefore,
λ∗
= arg max
λ
L(λ) ≡ arg min
λ
KL(q(θ; λ) p(θ|D)).

5/35
Objective to maximize:
L(λ) = Eq(θ;λ) [log p(D, θ) − q(θ; λ)]
λ∗
= arg max
λ
L(λ)
Optimization via gradient descent. the gradient λL(λ) is
related to the distribution q(θ; λ) over which we take
expectation.
Use reparameterization trick to transform Eq(θ;λ) [. . .] to an
expectation over the base distribution Eq( ) [. . .]

6/35
Draw S samples from the base distribution
s ∼ q( ).
Transform θs = f( s, λ) and evaluate the Monte Carlo
estimate of the ELBO:
L(λ) ≈
1
S
S
s=1
[log p(D, θs) − q(θs; λ)] .
The Monte Carlo estimate of the gradient now becomes:
λL(λ) ≈
1
S
S
s=1
[ λ(log p(D, θs) − q(θs; λ))] .

7/35
Bayesian Decision Theory
Decision making under uncertainty characterized by the
posterior p(θ|D)
Make optimal decisions h, given p(θ|D) and utility u(h, θ)
deﬁned over the parameters θ
An optimal decision maximises the posterior gain (expected
utility):
G(h) =
Θ
u(h, θ)p(θ|D) dθ
Or alternatively, minimizes the risk (expected loss):
R(h) =
Θ
(h, θ)p(θ|D) dθ

8/35
Bayesian Decision Theory - Example
When (h, θ) = (h − θ)2, what is the optimal decision h?

9/35
Bayesian Decision Theory - Example
When (h, θ) = (h − θ)2, what is the optimal decision h?
How about when (h, θ) = |h − θ|?

10/35
The optimal decision maximizes the gain
G(h) =
Θ
u(h, θ)p(θ|D) dθ
h∗
p = arg max
h∈H
G(h)
However, p(θ|D) is intractable

11/35
Approximate p(θ|D) with q(θ; λ)
Gq(h) = u(h, y)q(θ; λ) dθ
h∗
q = arg max
H∈H
Gq(h)
Million dollar question: Is h∗
q = h∗
p?

12/35
Example: Nuclear power plant
Collect temperature data D from sensor.
Infer a posterior distribution p(θ|D) over θ.
Utility Matrix
θ < Tcrit θ ≥ Tcrit
on 1010 100
oﬀ 105 1010

13/35
Example: Nuclear power plant
In each of the cases what is the optimal decision?
G(h = ‘on’) =
Tcrit
0
1010
× p(θ|D) dθ +
500
Tcrit
100
× p(θ|D) dθ
G(h = ‘oﬀ’) =
Tcrit
0
105
× p(θ|D) dθ +
500
Tcrit
1010
× p(θ|D) dθ

14/35
Unimodal posteriors – VI

15/35
Multimodal posteriors – VI

16/35
Multimodal posteriors – Expectation Propagation

17/35
Multimodal posteriors – ideal ﬁt

19/35
Lessons learnt
If you have access to the full posterior, you have nothing to
worry about. The posteriors are necessary and suﬃcient
information for making accurate decisions.
If you are approximating a multi-modal posterior with a
unimodal variational distribution, the decision making task
should be part of the inference.
Do not take anything for granted, especially because it is
black-box.

20/35
Loss Calibrated Lower Bound
Lower bound the Gain
log G(h) = log p(θ|D)u(θ, h)dθ
= log
q(θ)
q(θ)
p(θ|D)u(θ, h)dθ
≥ q(θ)log
p(θ|D)
q(θ)
u(θ, h)dθ via Jensen’s inequality
= −KL(q, p) + q(θ) log u(θ, h)dθ
= ELBO(λ) − log p(D) + q(θ) log u(θ, h)dθ

21/35
LCVI: objective
New objective:
L(λ, h) = Eq(θ;λ)[log p(D, θ) − log q(θ; λ)]
ELBO(λ) - expected lower bound
+ Eq(θ;λ) [log u(h, θ)]
U(λ,h) - utility-dependent penalty term
Optimization using EM:
M-step: h∗
q = arg maxh∈H Gq(h)
E-step: λ∗
= arg maxλ L(λ, h∗
q)

22/35
Discrete case: Diabetes
D = {(x, y)},
x - patient covariates
Y = {Healthy, Moderate, Severe}
utility matrix u(h, y):
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
it is bad to say ’Healthy’ when ’Severe’

23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)

23/35
k ex·θk
Mean-ﬁeld approximation family:
q(θSe, θMod, θHe) =
N(θSe|µSe, σ2
Se)N(θMod|µMod, σ2
Mod)N(θHe|µHe, σ2
He)

23/35
k ex·θk
Mean-ﬁeld approximation family:
q(θSe, θMod, θHe) =
N(θSe|µSe, σ2
Se)N(θMod|µMod, σ2
Mod)N(θHe|µHe, σ2
He)
Reparametrization:
θSe = µSe + σSe · Se,
θMod = µMod + σMod · Mod,
θHe = µHe + σHe · He,
Maximize LV I(λ) := ELBO(λ) w.r.t approximation parameters
λ = {µSe, ..., σHe}

24/35
Recap: LCVI objective in predictive setting
L(λ, h) = ELBO(λ) + q(θ) log u(y, h)p(y|θ, D)dydθ

25/35
Discrete case: LCVI objective
sum over possible outputs:
L(λ, h) = ELBO(λ) + Eq(θ;λ) log
y∈Y
u(h, y)p(y|θ, D)
expectation using MC:
≈ ELBO(λ) +
1
M
θ∼q(θ;λ)
log
y∈Y
u(h, y)p(y|θ, D)
reparameterization:
≈ ELBO(λ) +
1
M
∼q( )
log
y∈Y
u (h, y) p (y|fθ( , λ), D)

26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ ﬁxed)

26/35
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
E-step: use λL(λ, h) to update λ (h ﬁxed)

26/35
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
E-step: use λL(λ, h) to update λ (h ﬁxed)
for example if h = He:
L(λ, He) ≈ ELBO +
1
M ∼q0
log 2.0 ·
ex·θHe
k ex·θk
+ 1.0 ·
ex·θMod
k ex·θk
+ 0.0 ·
ex·θSe
k ex·θk

27/35
VI vs. LCVI: Test Data Confusion Matrices
He
Mod
Sev
Predicted label
He
Mod
Sev
Truelabel
0.86 0.11 0.03
0.00 1.00 0.00
0.00 0.00 1.00
VI
He
Mod
Sev
Predicted label
He
Mod
Sev
Truelabel
0.99 0.01 0.00
0.00 1.00 0.00
0.00 0.00 1.00
LCVI

28/35
Continuous case with double reparametrization
MC approximation of both integrals:
L(λ, h) ≈ ELBO +
1
M
θ∼qλ(θ)

log
1
N
y∼p(y|θ,x)
u(h, y)


reparametrization:
≈ ELBO +
1
M ∼q0

log
1
N
y∼p(y|fθ( ,λ),x)
u(h, y)


≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))


gradient-based optimization w.r.t. h and λ

29/35
Posterior predictive distribution shift
2.5 5.0 7.5 10.0
Value
0.0
0.1
0.2
0.3
ProbabilityDensity
hLCVI
hVI
data
user no: 791
artist: Muse
LCVI
VI

30/35
LCVI (blue) vs. VI (red/green)
.328.330.333
Empirical
.460.465
.398.400.402
1.701.75
q = 0.2
.320.325
Risk
q = 0.5
.450.460
q = 0.8
.320.325
squared
1.101.201.30
tilted

31/35
Conclusion
Bad posterior approximations result in sub-optimal
decisions / predictions
Learn better approximations (better in concrete task)
Learn how to make better decisions from bad posteriors

32/35
References
Adam D Cobb, Stephen J Roberts, and Yarin Gal.
Loss-Calibrated Approximate Inference in Bayesian Neural
Networks.
In Theory of Deep Learning workshop, ICML, 2018.
Tomasz Ku´smierczyk, Joseph Sakaya, and Arto Klami.
Variational Bayesian Decision-making for Continuous
Utilities.
In Thirty-third Conference on Neural Information
Processing Systems, NeurIPS, 2019.
Simon Lacoste-Julien, Ferenc Husz´ar, and Zoubin
Ghahramani.
Approximate inference for the loss-calibrated Bayesian.
In Proceedings of the 14th International Conference on
Artiﬁcial Intelligence and Statistics, AISTATS, 2011.

33/35
Continous case (detailed): Monte Carlo
L = ELBO + Eqλ(θ) log u(h, y)p(y|θ, x) dy
Approximate expectation using MC:
≈ ELBO +
1
M
θ∼qλ(θ)
log u(h, y)p(y|θ, x) dy
Approximate integral using MC:
≈ ELBO +
1
M
θ∼qλ(θ)

log

 1
N
y∼p(y|θ,x)
u(h, y)





34/35
Continuous case (detailed): double reparametrization
L ≈ ELBO +
1
M
θ∼qλ(θ)

log
1
N
y∼p(y|θ,x)
u(h, y)


The Monte Carlo expectation of λU(λ, h) is:
≈ ELBO +
1
M ∼q0

log
1
N
y∼p(y|fθ( ,λ),x)
u(h, y)


≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0



35/35
Continuous case (detailed): double reparametrization
≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0


p(y|.) needs to be reparameterizable:
until recently only for gaussians, but:
Michael Figurnov, Shakir Mohamed, Andriy Mnih. Implicit
Reparameterization Gradients, arXiv: May 2018.
we need M × N samples
computation graph is O(M × N)

Loss Calibrated Variational Inference

More Related Content

What's hot (20)

Similar to Loss Calibrated Variational Inference (20)

More from Tomasz Kusmierczyk (7)

Recently uploaded (20)

Loss Calibrated Variational Inference