SlideShare a Scribd company logo
1/35
Loss Calibrated Variational Inference
Tomasz Ku´smierczyk
Joseph Sakaya
October 17, 2019
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
2/35
Outline of the Talk
Recap of Lecture 11 - Variational Inference
Reparameterization gradients
Bayesian decision theory
Loss calibrated variational inference: framework
Loss calibration: discrete case
Loss calibration: continuous case
Conclusion
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
3/35
Recap: Lecture 11
Motivation
MCMC approximates posteriors by sampling
Computationally expensive
Asymptotic convergence
Diagnostics can be tricky
Variational inference approximates the posterior with a
parameteric family of distributions q(θ; λ)
Converts inference to an optimization problem
Scales very well
Does not converge to the true posterior
Minimize KL divergence between a proxy q(θ; λ) and
p(θ|D)
KL(q(θ; λ) p(θ|D)) = Eq(θ;λ) log
q(θ; λ)
p(θ|D)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
4/35
Variational Inference
Evidence Lower Bound
Consider the equation:
log p(D) = KL(q(θ; λ) p(θ|D))+Eq(θ;λ) [log p(D, θ) − q(θ; λ)]
ELBO L(λ)
Minimization of KL is the same as maximizing L(λ) since
log p(D) is constant w.r.t λ.
Therefore,
λ∗
= arg max
λ
L(λ) ≡ arg min
λ
KL(q(θ; λ) p(θ|D)).
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
5/35
Reparameterization gradients
Objective to maximize:
L(λ) = Eq(θ;λ) [log p(D, θ) − q(θ; λ)]
λ∗
= arg max
λ
L(λ)
Optimization via gradient descent. the gradient λL(λ) is
related to the distribution q(θ; λ) over which we take
expectation.
Use reparameterization trick to transform Eq(θ;λ) [. . .] to an
expectation over the base distribution Eq( ) [. . .]
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
6/35
Reparameterization gradients
Draw S samples from the base distribution
s ∼ q( ).
Transform θs = f( s, λ) and evaluate the Monte Carlo
estimate of the ELBO:
L(λ) ≈
1
S
S
s=1
[log p(D, θs) − q(θs; λ)] .
The Monte Carlo estimate of the gradient now becomes:
λL(λ) ≈
1
S
S
s=1
[ λ(log p(D, θs) − q(θs; λ))] .
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
7/35
Bayesian Decision Theory
Decision making under uncertainty characterized by the
posterior p(θ|D)
Make optimal decisions h, given p(θ|D) and utility u(h, θ)
defined over the parameters θ
An optimal decision maximises the posterior gain (expected
utility):
G(h) =
Θ
u(h, θ)p(θ|D) dθ
Or alternatively, minimizes the risk (expected loss):
R(h) =
Θ
(h, θ)p(θ|D) dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
8/35
Bayesian Decision Theory - Example
When (h, θ) = (h − θ)2, what is the optimal decision h?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
9/35
Bayesian Decision Theory - Example
When (h, θ) = (h − θ)2, what is the optimal decision h?
How about when (h, θ) = |h − θ|?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
10/35
Bayesian Decision Theory
The optimal decision maximizes the gain
G(h) =
Θ
u(h, θ)p(θ|D) dθ
h∗
p = arg max
h∈H
G(h)
However, p(θ|D) is intractable
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
11/35
Bayesian Decision Theory
Approximate p(θ|D) with q(θ; λ)
Gq(h) = u(h, y)q(θ; λ) dθ
h∗
q = arg max
H∈H
Gq(h)
Million dollar question: Is h∗
q = h∗
p?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
12/35
Bayesian Decision Theory
Example: Nuclear power plant
Collect temperature data D from sensor.
Infer a posterior distribution p(θ|D) over θ.
Utility Matrix
θ < Tcrit θ ≥ Tcrit
on 1010 100
off 105 1010
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
13/35
Bayesian Decision Theory
Example: Nuclear power plant
In each of the cases what is the optimal decision?
G(h = ‘on’) =
Tcrit
0
1010
× p(θ|D) dθ +
500
Tcrit
100
× p(θ|D) dθ
G(h = ‘off’) =
Tcrit
0
105
× p(θ|D) dθ +
500
Tcrit
1010
× p(θ|D) dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
14/35
Bayesian Decision Theory
Unimodal posteriors – VI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
15/35
Bayesian Decision Theory
Multimodal posteriors – VI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
16/35
Bayesian Decision Theory
Multimodal posteriors – Expectation Propagation
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
17/35
Bayesian Decision Theory
Multimodal posteriors – ideal fit
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
18/35
Bayesian Decision Theory
Two Types of Decisions
Decision over parameters
G(h) =
Θ
u(h, θ)p(θ|D) dθ
h∗
= arg max
h∈H
G(h)
Decision over model outputs
G(h|x) =
Θ Y
u(h, y)p(y|θ, x) dy p(θ|D) dθ
h∗
= arg max
h∈H
G(h|x)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
19/35
Lessons learnt
If you have access to the full posterior, you have nothing to
worry about. The posteriors are necessary and sufficient
information for making accurate decisions.
If you are approximating a multi-modal posterior with a
unimodal variational distribution, the decision making task
should be part of the inference.
Do not take anything for granted, especially because it is
black-box.
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
20/35
Loss Calibrated Lower Bound
Lower bound the Gain
log G(h) = log p(θ|D)u(θ, h)dθ
= log
q(θ)
q(θ)
p(θ|D)u(θ, h)dθ
≥ q(θ)log
p(θ|D)
q(θ)
u(θ, h)dθ via Jensen’s inequality
= −KL(q, p) + q(θ) log u(θ, h)dθ
= ELBO(λ) − log p(D) + q(θ) log u(θ, h)dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
21/35
LCVI: objective
New objective:
L(λ, h) = Eq(θ;λ)[log p(D, θ) − log q(θ; λ)]
ELBO(λ) - expected lower bound
+ Eq(θ;λ) [log u(h, θ)]
U(λ,h) - utility-dependent penalty term
Optimization using EM:
M-step: h∗
q = arg maxh∈H Gq(h)
E-step: λ∗
= arg maxλ L(λ, h∗
q)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
22/35
Discrete case: Diabetes
D = {(x, y)},
x - patient covariates
Y = {Healthy, Moderate, Severe}
utility matrix u(h, y):
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
it is bad to say ’Healthy’ when ’Severe’
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Mean-field approximation family:
q(θSe, θMod, θHe) =
N(θSe|µSe, σ2
Se)N(θMod|µMod, σ2
Mod)N(θHe|µHe, σ2
He)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Mean-field approximation family:
q(θSe, θMod, θHe) =
N(θSe|µSe, σ2
Se)N(θMod|µMod, σ2
Mod)N(θHe|µHe, σ2
He)
Reparametrization:
θSe = µSe + σSe · Se,
θMod = µMod + σMod · Mod,
θHe = µHe + σHe · He,
Maximize LV I(λ) := ELBO(λ) w.r.t approximation parameters
λ = {µSe, ..., σHe}
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
24/35
Recap: LCVI objective in predictive setting
L(λ, h) = ELBO(λ) + q(θ) log u(y, h)p(y|θ, D)dydθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
25/35
Discrete case: LCVI objective
sum over possible outputs:
L(λ, h) = ELBO(λ) + Eq(θ;λ) log
y∈Y
u(h, y)p(y|θ, D)
expectation using MC:
≈ ELBO(λ) +
1
M
θ∼q(θ;λ)
log
y∈Y
u(h, y)p(y|θ, D)
reparameterization:
≈ ELBO(λ) +
1
M
∼q( )
log
y∈Y
u (h, y) p (y|fθ( , λ), D)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
E-step: use λL(λ, h) to update λ (h fixed)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
E-step: use λL(λ, h) to update λ (h fixed)
for example if h = He:
L(λ, He) ≈ ELBO +
1
M ∼q0
log 2.0 ·
ex·θHe
k ex·θk
+ 1.0 ·
ex·θMod
k ex·θk
+ 0.0 ·
ex·θSe
k ex·θk
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
27/35
VI vs. LCVI: Test Data Confusion Matrices
He
Mod
Sev
Predicted label
He
Mod
Sev
Truelabel
0.86 0.11 0.03
0.00 1.00 0.00
0.00 0.00 1.00
VI
He
Mod
Sev
Predicted label
He
Mod
Sev
Truelabel
0.99 0.01 0.00
0.00 1.00 0.00
0.00 0.00 1.00
LCVI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
28/35
Continuous case with double reparametrization
MC approximation of both integrals:
L(λ, h) ≈ ELBO +
1
M
θ∼qλ(θ)

log
1
N
y∼p(y|θ,x)
u(h, y)


reparametrization:
≈ ELBO +
1
M ∼q0

log
1
N
y∼p(y|fθ( ,λ),x)
u(h, y)


≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))


gradient-based optimization w.r.t. h and λ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
29/35
Posterior predictive distribution shift
2.5 5.0 7.5 10.0
Value
0.0
0.1
0.2
0.3
ProbabilityDensity
hLCVI
hVI
data
user no: 791
artist: Muse
LCVI
VI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
30/35
LCVI (blue) vs. VI (red/green)
.328.330.333
Empirical
.460.465
.398.400.402
1.701.75
q = 0.2
.320.325
Risk
q = 0.5
.450.460
q = 0.8
.320.325
squared
1.101.201.30
tilted
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
31/35
Conclusion
Bad posterior approximations result in sub-optimal
decisions / predictions
Learn better approximations (better in concrete task)
Learn how to make better decisions from bad posteriors
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
32/35
References
Adam D Cobb, Stephen J Roberts, and Yarin Gal.
Loss-Calibrated Approximate Inference in Bayesian Neural
Networks.
In Theory of Deep Learning workshop, ICML, 2018.
Tomasz Ku´smierczyk, Joseph Sakaya, and Arto Klami.
Variational Bayesian Decision-making for Continuous
Utilities.
In Thirty-third Conference on Neural Information
Processing Systems, NeurIPS, 2019.
Simon Lacoste-Julien, Ferenc Husz´ar, and Zoubin
Ghahramani.
Approximate inference for the loss-calibrated Bayesian.
In Proceedings of the 14th International Conference on
Artificial Intelligence and Statistics, AISTATS, 2011.
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
33/35
Continous case (detailed): Monte Carlo
L = ELBO + Eqλ(θ) log u(h, y)p(y|θ, x) dy
Approximate expectation using MC:
≈ ELBO +
1
M
θ∼qλ(θ)
log u(h, y)p(y|θ, x) dy
Approximate integral using MC:
≈ ELBO +
1
M
θ∼qλ(θ)

log

 1
N
y∼p(y|θ,x)
u(h, y)




Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
34/35
Continuous case (detailed): double reparametrization
L ≈ ELBO +
1
M
θ∼qλ(θ)

log
1
N
y∼p(y|θ,x)
u(h, y)


The Monte Carlo expectation of λU(λ, h) is:
≈ ELBO +
1
M ∼q0

log
1
N
y∼p(y|fθ( ,λ),x)
u(h, y)


≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))


Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
35/35
Continuous case (detailed): double reparametrization
≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))


p(y|.) needs to be reparameterizable:
until recently only for gaussians, but:
Michael Figurnov, Shakir Mohamed, Andriy Mnih. Implicit
Reparameterization Gradients, arXiv: May 2018.
we need M × N samples
computation graph is O(M × N)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration

More Related Content

PDF
Introduction to modern Variational Inference.
PDF
Automatic variational inference with latent categorical variables
PDF
Formal methods 8 - category theory (last one)
PDF
Deep generative model.pdf
PDF
A nonlinear approximation of the Bayesian Update formula
PDF
Patch Matching with Polynomial Exponential Families and Projective Divergences
PDF
Intro to Classification: Logistic Regression & SVM
PDF
Meta-learning and the ELBO
Introduction to modern Variational Inference.
Automatic variational inference with latent categorical variables
Formal methods 8 - category theory (last one)
Deep generative model.pdf
A nonlinear approximation of the Bayesian Update formula
Patch Matching with Polynomial Exponential Families and Projective Divergences
Intro to Classification: Logistic Regression & SVM
Meta-learning and the ELBO

What's hot (20)

PDF
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
PDF
ABC: How Bayesian can it be?
PPT
Giáo trình Phân tích và thiết kế giải thuật - CHAP 8
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Efficient end-to-end learning for quantizable representations
PPT
Chap8 new
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Approximation Algorithms
PDF
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
PDF
Simplified Runtime Analysis of Estimation of Distribution Algorithms
PDF
Formal methods 4 - Z notation
PDF
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
PPTX
Presentation of daa on approximation algorithm and vertex cover problem
PDF
Cheatsheet unsupervised-learning
PDF
Practical volume estimation of polytopes by billiard trajectories and a new a...
PPTX
Vertex cover Problem
PDF
26 Machine Learning Unsupervised Fuzzy C-Means
PDF
Variational inference using implicit distributions
PDF
Regret Minimization in Multi-objective Submodular Function Maximization
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
ABC: How Bayesian can it be?
Giáo trình Phân tích và thiết kế giải thuật - CHAP 8
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Efficient end-to-end learning for quantizable representations
Chap8 new
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Approximation Algorithms
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Formal methods 4 - Z notation
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Presentation of daa on approximation algorithm and vertex cover problem
Cheatsheet unsupervised-learning
Practical volume estimation of polytopes by billiard trajectories and a new a...
Vertex cover Problem
26 Machine Learning Unsupervised Fuzzy C-Means
Variational inference using implicit distributions
Regret Minimization in Multi-objective Submodular Function Maximization
Ad

Similar to Loss Calibrated Variational Inference (20)

PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
PDF
QMC: Transition Workshop - Probability Models for Discretization Uncertainty ...
PDF
Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)
PDF
Lecture3 linear svm_with_slack
PDF
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
PDF
Lecture5 kernel svm
PPTX
A machine learning method for efficient design optimization in nano-optics
PPTX
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
PPTX
A machine learning method for efficient design optimization in nano-optics
PDF
Practical-bayesian-optimization-of-machine-learning-algorithms_ver2
PDF
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
PDF
MLHEP Lectures - day 2, basic track
PDF
Statistics symposium talk, Harvard University
PDF
Bayesian Inference: An Introduction to Principles and ...
PDF
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PDF
bayes_machine_learning_book for data scientist
PDF
Deep VI with_beta_likelihood
PDF
Lecture 5 Statistical Learning Theory
PDF
Bayesian Deep Learning
Maximum likelihood estimation of regularisation parameters in inverse problem...
QMC: Transition Workshop - Probability Models for Discretization Uncertainty ...
Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)
Lecture3 linear svm_with_slack
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
Lecture5 kernel svm
A machine learning method for efficient design optimization in nano-optics
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
A machine learning method for efficient design optimization in nano-optics
Practical-bayesian-optimization-of-machine-learning-algorithms_ver2
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
MLHEP Lectures - day 2, basic track
Statistics symposium talk, Harvard University
Bayesian Inference: An Introduction to Principles and ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
When Models Meet Data: From ancient science to todays Artificial Intelligence...
bayes_machine_learning_book for data scientist
Deep VI with_beta_likelihood
Lecture 5 Statistical Learning Theory
Bayesian Deep Learning
Ad

More from Tomasz Kusmierczyk (7)

PDF
Priors for BNNs
PDF
Overconfidence and subnetwork Inference for BNNs
PDF
On the Causal Effect of Digital Badges
PDF
What are the negative effects of social media?: fighting fake information
PDF
Sampling and Markov Chain Monte Carlo Techniques
PDF
Probabilistic Models in Recommender Systems: Time Variant Models
PDF
Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)
Priors for BNNs
Overconfidence and subnetwork Inference for BNNs
On the Causal Effect of Digital Badges
What are the negative effects of social media?: fighting fake information
Sampling and Markov Chain Monte Carlo Techniques
Probabilistic Models in Recommender Systems: Time Variant Models
Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Recently uploaded (20)

PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Empathic Computing: Creating Shared Understanding
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Mushroom cultivation and it's methods.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Unlocking AI with Model Context Protocol (MCP)
A comparative study of natural language inference in Swahili using monolingua...
Empathic Computing: Creating Shared Understanding
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cloud_computing_Infrastucture_as_cloud_p
Mobile App Security Testing_ A Comprehensive Guide.pdf
Getting Started with Data Integration: FME Form 101
Mushroom cultivation and it's methods.pdf
A Presentation on Artificial Intelligence
Heart disease approach using modified random forest and particle swarm optimi...
OMC Textile Division Presentation 2021.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25-Week II
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Loss Calibrated Variational Inference

  • 1. 1/35 Loss Calibrated Variational Inference Tomasz Ku´smierczyk Joseph Sakaya October 17, 2019 Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 2. 2/35 Outline of the Talk Recap of Lecture 11 - Variational Inference Reparameterization gradients Bayesian decision theory Loss calibrated variational inference: framework Loss calibration: discrete case Loss calibration: continuous case Conclusion Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 3. 3/35 Recap: Lecture 11 Motivation MCMC approximates posteriors by sampling Computationally expensive Asymptotic convergence Diagnostics can be tricky Variational inference approximates the posterior with a parameteric family of distributions q(θ; λ) Converts inference to an optimization problem Scales very well Does not converge to the true posterior Minimize KL divergence between a proxy q(θ; λ) and p(θ|D) KL(q(θ; λ) p(θ|D)) = Eq(θ;λ) log q(θ; λ) p(θ|D) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 4. 4/35 Variational Inference Evidence Lower Bound Consider the equation: log p(D) = KL(q(θ; λ) p(θ|D))+Eq(θ;λ) [log p(D, θ) − q(θ; λ)] ELBO L(λ) Minimization of KL is the same as maximizing L(λ) since log p(D) is constant w.r.t λ. Therefore, λ∗ = arg max λ L(λ) ≡ arg min λ KL(q(θ; λ) p(θ|D)). Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 5. 5/35 Reparameterization gradients Objective to maximize: L(λ) = Eq(θ;λ) [log p(D, θ) − q(θ; λ)] λ∗ = arg max λ L(λ) Optimization via gradient descent. the gradient λL(λ) is related to the distribution q(θ; λ) over which we take expectation. Use reparameterization trick to transform Eq(θ;λ) [. . .] to an expectation over the base distribution Eq( ) [. . .] Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 6. 6/35 Reparameterization gradients Draw S samples from the base distribution s ∼ q( ). Transform θs = f( s, λ) and evaluate the Monte Carlo estimate of the ELBO: L(λ) ≈ 1 S S s=1 [log p(D, θs) − q(θs; λ)] . The Monte Carlo estimate of the gradient now becomes: λL(λ) ≈ 1 S S s=1 [ λ(log p(D, θs) − q(θs; λ))] . Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 7. 7/35 Bayesian Decision Theory Decision making under uncertainty characterized by the posterior p(θ|D) Make optimal decisions h, given p(θ|D) and utility u(h, θ) defined over the parameters θ An optimal decision maximises the posterior gain (expected utility): G(h) = Θ u(h, θ)p(θ|D) dθ Or alternatively, minimizes the risk (expected loss): R(h) = Θ (h, θ)p(θ|D) dθ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 8. 8/35 Bayesian Decision Theory - Example When (h, θ) = (h − θ)2, what is the optimal decision h? Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 9. 9/35 Bayesian Decision Theory - Example When (h, θ) = (h − θ)2, what is the optimal decision h? How about when (h, θ) = |h − θ|? Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 10. 10/35 Bayesian Decision Theory The optimal decision maximizes the gain G(h) = Θ u(h, θ)p(θ|D) dθ h∗ p = arg max h∈H G(h) However, p(θ|D) is intractable Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 11. 11/35 Bayesian Decision Theory Approximate p(θ|D) with q(θ; λ) Gq(h) = u(h, y)q(θ; λ) dθ h∗ q = arg max H∈H Gq(h) Million dollar question: Is h∗ q = h∗ p? Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 12. 12/35 Bayesian Decision Theory Example: Nuclear power plant Collect temperature data D from sensor. Infer a posterior distribution p(θ|D) over θ. Utility Matrix θ < Tcrit θ ≥ Tcrit on 1010 100 off 105 1010 Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 13. 13/35 Bayesian Decision Theory Example: Nuclear power plant In each of the cases what is the optimal decision? G(h = ‘on’) = Tcrit 0 1010 × p(θ|D) dθ + 500 Tcrit 100 × p(θ|D) dθ G(h = ‘off’) = Tcrit 0 105 × p(θ|D) dθ + 500 Tcrit 1010 × p(θ|D) dθ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 14. 14/35 Bayesian Decision Theory Unimodal posteriors – VI Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 15. 15/35 Bayesian Decision Theory Multimodal posteriors – VI Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 16. 16/35 Bayesian Decision Theory Multimodal posteriors – Expectation Propagation Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 17. 17/35 Bayesian Decision Theory Multimodal posteriors – ideal fit Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 18. 18/35 Bayesian Decision Theory Two Types of Decisions Decision over parameters G(h) = Θ u(h, θ)p(θ|D) dθ h∗ = arg max h∈H G(h) Decision over model outputs G(h|x) = Θ Y u(h, y)p(y|θ, x) dy p(θ|D) dθ h∗ = arg max h∈H G(h|x) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 19. 19/35 Lessons learnt If you have access to the full posterior, you have nothing to worry about. The posteriors are necessary and sufficient information for making accurate decisions. If you are approximating a multi-modal posterior with a unimodal variational distribution, the decision making task should be part of the inference. Do not take anything for granted, especially because it is black-box. Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 20. 20/35 Loss Calibrated Lower Bound Lower bound the Gain log G(h) = log p(θ|D)u(θ, h)dθ = log q(θ) q(θ) p(θ|D)u(θ, h)dθ ≥ q(θ)log p(θ|D) q(θ) u(θ, h)dθ via Jensen’s inequality = −KL(q, p) + q(θ) log u(θ, h)dθ = ELBO(λ) − log p(D) + q(θ) log u(θ, h)dθ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 21. 21/35 LCVI: objective New objective: L(λ, h) = Eq(θ;λ)[log p(D, θ) − log q(θ; λ)] ELBO(λ) - expected lower bound + Eq(θ;λ) [log u(h, θ)] U(λ,h) - utility-dependent penalty term Optimization using EM: M-step: h∗ q = arg maxh∈H Gq(h) E-step: λ∗ = arg maxλ L(λ, h∗ q) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 22. 22/35 Discrete case: Diabetes D = {(x, y)}, x - patient covariates Y = {Healthy, Moderate, Severe} utility matrix u(h, y): u y He Mod Sev He 2.0 1.0 0.0 h Mod 1.2 2.0 1.3 Sev 1.1 1.4 2.0 it is bad to say ’Healthy’ when ’Severe’ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 23. 23/35 Discrete case: model & Automatic VI Likelihood (softmax): p(y = cj|θ, x) = ex·θj k ex·θk Some priors: p(θSe), p(θMod), p(θHe) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 24. 23/35 Discrete case: model & Automatic VI Likelihood (softmax): p(y = cj|θ, x) = ex·θj k ex·θk Some priors: p(θSe), p(θMod), p(θHe) Mean-field approximation family: q(θSe, θMod, θHe) = N(θSe|µSe, σ2 Se)N(θMod|µMod, σ2 Mod)N(θHe|µHe, σ2 He) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 25. 23/35 Discrete case: model & Automatic VI Likelihood (softmax): p(y = cj|θ, x) = ex·θj k ex·θk Some priors: p(θSe), p(θMod), p(θHe) Mean-field approximation family: q(θSe, θMod, θHe) = N(θSe|µSe, σ2 Se)N(θMod|µMod, σ2 Mod)N(θHe|µHe, σ2 He) Reparametrization: θSe = µSe + σSe · Se, θMod = µMod + σMod · Mod, θHe = µHe + σHe · He, Maximize LV I(λ) := ELBO(λ) w.r.t approximation parameters λ = {µSe, ..., σHe} Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 26. 24/35 Recap: LCVI objective in predictive setting L(λ, h) = ELBO(λ) + q(θ) log u(y, h)p(y|θ, D)dydθ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 27. 25/35 Discrete case: LCVI objective sum over possible outputs: L(λ, h) = ELBO(λ) + Eq(θ;λ) log y∈Y u(h, y)p(y|θ, D) expectation using MC: ≈ ELBO(λ) + 1 M θ∼q(θ;λ) log y∈Y u(h, y)p(y|θ, D) reparameterization: ≈ ELBO(λ) + 1 M ∼q( ) log y∈Y u (h, y) p (y|fθ( , λ), D) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 28. 26/35 Discrete case: LCVI Optimization u y He Mod Sev He 2.0 1.0 0.0 h Mod 1.2 2.0 1.3 Sev 1.1 1.4 2.0 M-step: choose h that maximizes L(λ, h) (λ fixed) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 29. 26/35 Discrete case: LCVI Optimization u y He Mod Sev He 2.0 1.0 0.0 h Mod 1.2 2.0 1.3 Sev 1.1 1.4 2.0 M-step: choose h that maximizes L(λ, h) (λ fixed) E-step: use λL(λ, h) to update λ (h fixed) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 30. 26/35 Discrete case: LCVI Optimization u y He Mod Sev He 2.0 1.0 0.0 h Mod 1.2 2.0 1.3 Sev 1.1 1.4 2.0 M-step: choose h that maximizes L(λ, h) (λ fixed) E-step: use λL(λ, h) to update λ (h fixed) for example if h = He: L(λ, He) ≈ ELBO + 1 M ∼q0 log 2.0 · ex·θHe k ex·θk + 1.0 · ex·θMod k ex·θk + 0.0 · ex·θSe k ex·θk Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 31. 27/35 VI vs. LCVI: Test Data Confusion Matrices He Mod Sev Predicted label He Mod Sev Truelabel 0.86 0.11 0.03 0.00 1.00 0.00 0.00 0.00 1.00 VI He Mod Sev Predicted label He Mod Sev Truelabel 0.99 0.01 0.00 0.00 1.00 0.00 0.00 0.00 1.00 LCVI Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 32. 28/35 Continuous case with double reparametrization MC approximation of both integrals: L(λ, h) ≈ ELBO + 1 M θ∼qλ(θ)  log 1 N y∼p(y|θ,x) u(h, y)   reparametrization: ≈ ELBO + 1 M ∼q0  log 1 N y∼p(y|fθ( ,λ),x) u(h, y)   ≈ ELBO + 1 M ∼q0  log 1 N δ∼p0 u(h, gy(δ, fθ( , λ))   gradient-based optimization w.r.t. h and λ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 33. 29/35 Posterior predictive distribution shift 2.5 5.0 7.5 10.0 Value 0.0 0.1 0.2 0.3 ProbabilityDensity hLCVI hVI data user no: 791 artist: Muse LCVI VI Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 34. 30/35 LCVI (blue) vs. VI (red/green) .328.330.333 Empirical .460.465 .398.400.402 1.701.75 q = 0.2 .320.325 Risk q = 0.5 .450.460 q = 0.8 .320.325 squared 1.101.201.30 tilted Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 35. 31/35 Conclusion Bad posterior approximations result in sub-optimal decisions / predictions Learn better approximations (better in concrete task) Learn how to make better decisions from bad posteriors Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 36. 32/35 References Adam D Cobb, Stephen J Roberts, and Yarin Gal. Loss-Calibrated Approximate Inference in Bayesian Neural Networks. In Theory of Deep Learning workshop, ICML, 2018. Tomasz Ku´smierczyk, Joseph Sakaya, and Arto Klami. Variational Bayesian Decision-making for Continuous Utilities. In Thirty-third Conference on Neural Information Processing Systems, NeurIPS, 2019. Simon Lacoste-Julien, Ferenc Husz´ar, and Zoubin Ghahramani. Approximate inference for the loss-calibrated Bayesian. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, AISTATS, 2011. Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 37. 33/35 Continous case (detailed): Monte Carlo L = ELBO + Eqλ(θ) log u(h, y)p(y|θ, x) dy Approximate expectation using MC: ≈ ELBO + 1 M θ∼qλ(θ) log u(h, y)p(y|θ, x) dy Approximate integral using MC: ≈ ELBO + 1 M θ∼qλ(θ)  log   1 N y∼p(y|θ,x) u(h, y)     Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 38. 34/35 Continuous case (detailed): double reparametrization L ≈ ELBO + 1 M θ∼qλ(θ)  log 1 N y∼p(y|θ,x) u(h, y)   The Monte Carlo expectation of λU(λ, h) is: ≈ ELBO + 1 M ∼q0  log 1 N y∼p(y|fθ( ,λ),x) u(h, y)   ≈ ELBO + 1 M ∼q0  log 1 N δ∼p0 u(h, gy(δ, fθ( , λ))   Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 39. 35/35 Continuous case (detailed): double reparametrization ≈ ELBO + 1 M ∼q0  log 1 N δ∼p0 u(h, gy(δ, fθ( , λ))   p(y|.) needs to be reparameterizable: until recently only for gaussians, but: Michael Figurnov, Shakir Mohamed, Andriy Mnih. Implicit Reparameterization Gradients, arXiv: May 2018. we need M × N samples computation graph is O(M × N) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration