SlideShare a Scribd company logo
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Variational Autoencoder
from scratch
Umar Jamil
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0):
https://creativecommons.org/licenses/by-nc/4.0/legalcode
Video: https://youtu.be/iwEzwTTalbg
Not for commercial use
Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and
Trends® in Machine Learning, 12(4), pp.307-392.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What is an Autoencoder?
X Encoder Z X’
Decoder
Input
Code
Reconstructed
Input
[1.2, 3.65, …]
[1.6, 6.00, …]
[10.1, 9.0, …]
[2.5, 7.0, …]
* The values are random and
have no meaning
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Analogy with file compression
X ZIP zebra.zip X’
UNZIP
Input Reconstructed
Input
zebra.jpg zebra.jpg
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
• The code should be as small as possible, that is, the dimension of the Z vector should be as small as possible.
• The reconstructed input should as close as possible to the original input.
What makes a good Autoencoder?
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What’s the problem with Autoencoders?
The code learned by the model makes no sense. That is, the model can just assign any vector to the inputs without the numbers in the vector representing any
pattern. The model doesn’t capture any semantic relationship between the data.
X Encoder X’
Decoder
Input
Code
Reconstructed
Input
Z
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Introducing the Variational Autoencoder
The variational autoencoder, instead of learning a code, learns a “latent space”. The latent space represents the parameters of a (multivariate) distribution.
X Encoder X’
Decoder
Input
Latent Space
Reconstructed
Input
Z
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Sampling the latent space
X Encoder X’
Decoder
Input
Latent Space
Reconstructed
Input
Just like when you use Python to generate a random number between 1 and 100, you’re sampling from a uniform (pseudo)random distribution between 1 and 100. In
the same way, we can sample from the latent space in order to generate a random vector, give it to the decoder and generate new data.
Z
[8.67, 12.8564, 0.44875, 874.22, …]
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Why is it called latent space?
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Plato’s allegory of the cave
Observable variable
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
Latent (hidden) variable
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Pep talk
1. VAE is the most important component of Stable Diffusion
models. Concepts like ELBO also come in Stable Diffusion.
2. In 2023 you shouldn’t be memorizing things without
understanding, ChatGPT can do that faster and better than
any human being. You need to be human to compete with a
machine, you can’t compete with a machine by acting like
one.
3. You should try to learn how things work not only for curiosity,
but because that’s the true engine of innovation and
creativity.
4. Math is fun.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Math Concepts
Expectation of a random variable
Chain rule of probability
Bayes’ Theorem
𝐸𝑥 𝑓(𝑥) = න 𝑥𝑓 𝑥 𝑑𝑥
𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 𝑃(𝑦)
𝑃 𝑥 | 𝑦 =
𝑃 𝑦 𝑥 𝑃(𝑥)
𝑃(𝑦)
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Kullback-Leibler Divergence
𝐷𝐾𝐿 ԡ
𝑃 𝑄 = න 𝑝 𝑥 log
𝑝 𝑥
𝑞 𝑥
𝑑𝑥
Properties:
• Not symmetric.
• Always ≥ 0
• It is equal to 0 if and only if 𝑃 = 𝑄
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Let’s define our model
We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent variable
… or we can use the Chain rule of probability
𝑝 𝒙 = න 𝑝 𝒙, 𝒛 𝑑𝒛 Is intractable because we would need to
evaluate this integral over all latent variables Z.
𝑝 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒛 𝒙
We don’t have a ground truth 𝑝 𝒛 𝒙
… which is also what we’re trying to find!
Intractable problem = a problem that can be solved in theory (e.g. given large but finite resources,
especially time), but for which in practice any solution takes too many resources to be useful, is
known as an intractable problem.
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
A chicken and egg problem
𝑝 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒛 𝒙
𝑝 𝒛 𝒙 =
𝑝(𝒙, 𝒛)
𝑝 𝒙
In order to have a tractable 𝑝 𝒙 we need a tractable 𝑝 𝒛 𝒙
In order to have a tractable 𝑝 𝒛 𝒙 we need a tractable 𝑝 𝒙
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Can we find a surrogate?
𝑝𝜃 𝒛 𝒙 ≈ 𝑞𝜑 𝒛 𝒙
Our true posterior (that we can’t evaluate due to its intractability)
Parametrized by 𝜃.
An approximate posterior.
Parametrized by 𝜑.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Let’s do some maths…
log 𝑝𝜃(𝒙) = log 𝑝𝜃 (𝒙)
= log 𝑝𝜃 (𝒙) න 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Multiply by 1
= න log 𝑝𝜃 (𝒙) 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Bring inside the integral
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 (𝒙) Definition of expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑝𝜃 𝒛 𝒙
𝑝𝜃 𝒙 =
𝑝𝜃(𝒙, 𝒛)
𝑝𝜃 𝒛 𝒙
Apply the equation
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)𝑞𝜑 𝒛 𝒙
𝑝𝜃 𝒛 𝒙 𝑞𝜑 𝒛 𝒙
Multiply by 1
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
+ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑞𝜑 𝒛 𝒙
𝑝𝜃 𝒛 𝒙
Split the expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
+ 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 Definition of KL divergence
≥ 0
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
What can we infer?
log 𝑝𝜃(𝒙) = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
+ 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙
≥ 0
ELBO
Total Compensation = Base Salary + Bonus
≥ 0
Total Compensation ≥ Base Salary
log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
We can for sure deduce the following:
ELBO = Evidence Lower Bound
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
ELBO in detail
log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃 𝒙 𝒛 𝑝(𝒛)
𝑞𝜑 𝒛 𝒙
Chain rule of probability
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 + 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝(𝒛)
𝑞𝜑 𝒛 𝒙
Split the expectation
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 Definition of KL divergence
Maximizing the ELBO means:
1. Maximizing the first term: maximizing the reconstruction likelihood of the decoder
2. Minimizing the second term: minimizing the distance between the learned distribution and the
prior belief we have over the latent variable.
Kingma, D.P. and Welling, M., 2019. An introduction to variational
autoencoders. Foundations and Trends® in Machine Learning, 12(4),
pp.307-392.
Profit = Revenue - Costs
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Maximizing the ELBO: A little introduction to
estimators
• When we have a function we want to maximize, we usually take the gradient and adjust the weights of the model so that they move along the gradient direction.
• When we have a function we want to minimize, we usually take the gradient, and adjust the weights of the model so that they move against the gradient direction.
Stochastic Gradient Descent
SGD is stochastic because we choose the minibatch randomly from our dataset and we then
average the loss over the minibatch
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ
𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
How to maximize the ELBO?
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
SCORE estimator
ELBO
This estimator is unbiased, meaning that even if at every step it may not be equal to the true
expectation, on average it will converge to it, but as it is stochastic, it also has a variance and it
happens to be high for practical use. Plus, we can’t run backpropagation through it!
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
We need a new estimator!
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
The reparameterization trick
X Z
Latent (hidden) variable
Observable variable
𝜇
𝜎2
[8.67, 12.8564, 0.44875, 874.22, …]
[4.59, 13.2548, 1.14569, 148.25, …]
[1.74, 32.3476, 5.18469, 358.14, …]
𝜖
Stochastic node
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Running backpropagation on the
reparametrized model
𝜇
𝜎2
Loss function
Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
A new estimator!
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= 𝐸𝑝(𝜖) log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
𝜖 ≈ 𝑝(𝜖)
𝑧 = 𝑔(𝜑, 𝑥, 𝜖)
𝐸𝑝(𝜖) log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
≅ ෨
𝐿 𝜃, 𝜑, 𝒙 = log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
ELBO
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Is the new estimator unbiased?
𝐸𝑝(𝜖) ∇𝜃,𝜑
෨
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) ∇𝜃,𝜑 log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
෨
𝐿 𝜃, 𝜑, 𝒙 = log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= ∇𝜃,𝜑(𝐸𝑝 𝜖 log
𝑝𝜃 𝒙, 𝒛
𝑞𝜑 𝒛 𝒙
)
𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) log
𝑝𝜃(𝒙, 𝒛)
𝑞𝜑 𝒛 𝒙
= ∇𝜃,𝜑(𝐿 𝜃, 𝜑, 𝒙 )
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Example network
X Encoder
𝜇
X’
Decoder
Input
Latent Space
Reconstructed
Input
Z
log(𝜎2
)
𝜖 ≈ 𝑁(0, 𝐼)
We prefer learning log(𝜎2
) because it can be negative, so the model doesn’t need to be forced
to produce only positive values for it.
Sampled using torch.randn_like(shape) function
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Show me the loss already!
X Encoder X’
Decoder
Latent Space
Z
𝜖 ≈ 𝑁(0, 𝐼)
MLP = Multi Layer Perceptron
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
How to derive the loss function?
https://stats.stackexchange.com/questions/318748/deriving-the-kl-divergence-loss-for-vaes
Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes
Thanks for watching!
Don’t forget to subscribe for
more amazing content on AI
and Machine Learning!

More Related Content

PDF
fuzzy logic
PPTX
Introduction to PyTorch
PDF
Gomory's cutting plane method
PDF
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
PPTX
Alternative cryptocurrencies
PPTX
Alternative cryptocurrencies
PDF
그림 그리는 AI
PDF
Functional Programming in Groovy
fuzzy logic
Introduction to PyTorch
Gomory's cutting plane method
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
Alternative cryptocurrencies
Alternative cryptocurrencies
그림 그리는 AI
Functional Programming in Groovy

Similar to Variational Autoencoder from scratch.pdf (20)

PDF
Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017
PPTX
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
PPTX
Introduction to Julia
PDF
spaGO: A self-contained ML & NLP library in GO
ODP
New Ideas for Old Code - Greach
PDF
Machine Learning at Geeky Base 2
PPTX
[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx
PDF
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on Hadoop
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
Node.js Event Loop & EventEmitter
PDF
Optimization in Programming languages
PPT
Programming with Java: the Basics
PPTX
20170415 當julia遇上資料科學
PPTX
20171127 當julia遇上資料科學
PPTX
transformer_and_attention_is_all_you_need.pptx
PPT
Pythonic Math
PDF
GANs for Anti Money Laundering
PDF
Using CNTK's Python Interface for Deep LearningDave DeBarr -
PDF
오토인코더의 모든 것
Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
Introduction to Julia
spaGO: A self-contained ML & NLP library in GO
New Ideas for Old Code - Greach
Machine Learning at Geeky Base 2
[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx
ScrewDriver Rebirth: Generate-Test-and-Aggregate Framework on Hadoop
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Node.js Event Loop & EventEmitter
Optimization in Programming languages
Programming with Java: the Basics
20170415 當julia遇上資料科學
20171127 當julia遇上資料科學
transformer_and_attention_is_all_you_need.pptx
Pythonic Math
GANs for Anti Money Laundering
Using CNTK's Python Interface for Deep LearningDave DeBarr -
오토인코더의 모든 것
Ad

Recently uploaded (20)

PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Revamp in MTO Odoo 18 Inventory - Odoo Slides
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Pre independence Education in Inndia.pdf
PPTX
NOI Hackathon - Summer Edition - GreenThumber.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Cell Structure & Organelles in detailed.
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
From loneliness to social connection charting
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Module 3: Health Systems Tutorial Slides S2 2025
PPTX
Cardiovascular Pharmacology for pharmacy students.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
Anesthesia in Laparoscopic Surgery in India
Revamp in MTO Odoo 18 Inventory - Odoo Slides
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Week 4 Term 3 Study Techniques revisited.pptx
Pre independence Education in Inndia.pdf
NOI Hackathon - Summer Edition - GreenThumber.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Cell Structure & Organelles in detailed.
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
From loneliness to social connection charting
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
O7-L3 Supply Chain Operations - ICLT Program
Module 3: Health Systems Tutorial Slides S2 2025
Cardiovascular Pharmacology for pharmacy students.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pharma ospi slides which help in ospi learning
STATICS OF THE RIGID BODIES Hibbelers.pdf
Ad

Variational Autoencoder from scratch.pdf

  • 1. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Variational Autoencoder from scratch Umar Jamil License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0): https://creativecommons.org/licenses/by-nc/4.0/legalcode Video: https://youtu.be/iwEzwTTalbg Not for commercial use Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392.
  • 2. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes What is an Autoencoder? X Encoder Z X’ Decoder Input Code Reconstructed Input [1.2, 3.65, …] [1.6, 6.00, …] [10.1, 9.0, …] [2.5, 7.0, …] * The values are random and have no meaning
  • 3. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Analogy with file compression X ZIP zebra.zip X’ UNZIP Input Reconstructed Input zebra.jpg zebra.jpg
  • 4. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes • The code should be as small as possible, that is, the dimension of the Z vector should be as small as possible. • The reconstructed input should as close as possible to the original input. What makes a good Autoencoder?
  • 5. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes What’s the problem with Autoencoders? The code learned by the model makes no sense. That is, the model can just assign any vector to the inputs without the numbers in the vector representing any pattern. The model doesn’t capture any semantic relationship between the data. X Encoder X’ Decoder Input Code Reconstructed Input Z
  • 6. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Introducing the Variational Autoencoder The variational autoencoder, instead of learning a code, learns a “latent space”. The latent space represents the parameters of a (multivariate) distribution. X Encoder X’ Decoder Input Latent Space Reconstructed Input Z
  • 7. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Sampling the latent space X Encoder X’ Decoder Input Latent Space Reconstructed Input Just like when you use Python to generate a random number between 1 and 100, you’re sampling from a uniform (pseudo)random distribution between 1 and 100. In the same way, we can sample from the latent space in order to generate a random vector, give it to the decoder and generate new data. Z [8.67, 12.8564, 0.44875, 874.22, …]
  • 8. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Why is it called latent space? X Z Latent (hidden) variable Observable variable 𝜇 𝜎2 [8.67, 12.8564, 0.44875, 874.22, …] [4.59, 13.2548, 1.14569, 148.25, …] [1.74, 32.3476, 5.18469, 358.14, …]
  • 9. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Plato’s allegory of the cave Observable variable [8.67, 12.8564, 0.44875, 874.22, …] [4.59, 13.2548, 1.14569, 148.25, …] [1.74, 32.3476, 5.18469, 358.14, …] Latent (hidden) variable
  • 10. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Pep talk 1. VAE is the most important component of Stable Diffusion models. Concepts like ELBO also come in Stable Diffusion. 2. In 2023 you shouldn’t be memorizing things without understanding, ChatGPT can do that faster and better than any human being. You need to be human to compete with a machine, you can’t compete with a machine by acting like one. 3. You should try to learn how things work not only for curiosity, but because that’s the true engine of innovation and creativity. 4. Math is fun.
  • 11. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Math Concepts Expectation of a random variable Chain rule of probability Bayes’ Theorem 𝐸𝑥 𝑓(𝑥) = න 𝑥𝑓 𝑥 𝑑𝑥 𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 𝑃(𝑦) 𝑃 𝑥 | 𝑦 = 𝑃 𝑦 𝑥 𝑃(𝑥) 𝑃(𝑦)
  • 12. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Kullback-Leibler Divergence 𝐷𝐾𝐿 ԡ 𝑃 𝑄 = න 𝑝 𝑥 log 𝑝 𝑥 𝑞 𝑥 𝑑𝑥 Properties: • Not symmetric. • Always ≥ 0 • It is equal to 0 if and only if 𝑃 = 𝑄
  • 13. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Let’s define our model We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent variable … or we can use the Chain rule of probability 𝑝 𝒙 = න 𝑝 𝒙, 𝒛 𝑑𝒛 Is intractable because we would need to evaluate this integral over all latent variables Z. 𝑝 𝒙 = 𝑝(𝒙, 𝒛) 𝑝 𝒛 𝒙 We don’t have a ground truth 𝑝 𝒛 𝒙 … which is also what we’re trying to find! Intractable problem = a problem that can be solved in theory (e.g. given large but finite resources, especially time), but for which in practice any solution takes too many resources to be useful, is known as an intractable problem. X Z Latent (hidden) variable Observable variable 𝜇 𝜎2
  • 14. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes A chicken and egg problem 𝑝 𝒙 = 𝑝(𝒙, 𝒛) 𝑝 𝒛 𝒙 𝑝 𝒛 𝒙 = 𝑝(𝒙, 𝒛) 𝑝 𝒙 In order to have a tractable 𝑝 𝒙 we need a tractable 𝑝 𝒛 𝒙 In order to have a tractable 𝑝 𝒛 𝒙 we need a tractable 𝑝 𝒙
  • 15. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Can we find a surrogate? 𝑝𝜃 𝒛 𝒙 ≈ 𝑞𝜑 𝒛 𝒙 Our true posterior (that we can’t evaluate due to its intractability) Parametrized by 𝜃. An approximate posterior. Parametrized by 𝜑.
  • 16. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Let’s do some maths… log 𝑝𝜃(𝒙) = log 𝑝𝜃 (𝒙) = log 𝑝𝜃 (𝒙) න 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Multiply by 1 = න log 𝑝𝜃 (𝒙) 𝑞𝜑 𝒛 𝒙 𝑑𝒛 Bring inside the integral = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 (𝒙) Definition of expectation = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑝𝜃 𝒛 𝒙 𝑝𝜃 𝒙 = 𝑝𝜃(𝒙, 𝒛) 𝑝𝜃 𝒛 𝒙 Apply the equation = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛)𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 𝑞𝜑 𝒛 𝒙 Multiply by 1 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 + 𝐸𝑞𝜑 𝒛 𝒙 log 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 Split the expectation = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 + 𝐷𝐾𝐿 ฮ 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 Definition of KL divergence ≥ 0
  • 17. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes What can we infer? log 𝑝𝜃(𝒙) = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 + 𝐷𝐾𝐿 ฮ 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 𝒙 ≥ 0 ELBO Total Compensation = Base Salary + Bonus ≥ 0 Total Compensation ≥ Base Salary log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 We can for sure deduce the following: ELBO = Evidence Lower Bound
  • 18. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes ELBO in detail log 𝑝𝜃(𝒙) ≥ 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 𝑝(𝒛) 𝑞𝜑 𝒛 𝒙 Chain rule of probability = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 + 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝(𝒛) 𝑞𝜑 𝒛 𝒙 Split the expectation = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛 Definition of KL divergence Maximizing the ELBO means: 1. Maximizing the first term: maximizing the reconstruction likelihood of the decoder 2. Minimizing the second term: minimizing the distance between the learned distribution and the prior belief we have over the latent variable. Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392. Profit = Revenue - Costs
  • 19. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Maximizing the ELBO: A little introduction to estimators • When we have a function we want to maximize, we usually take the gradient and adjust the weights of the model so that they move along the gradient direction. • When we have a function we want to minimize, we usually take the gradient, and adjust the weights of the model so that they move against the gradient direction. Stochastic Gradient Descent SGD is stochastic because we choose the minibatch randomly from our dataset and we then average the loss over the minibatch 𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃 𝒙 𝒛 − 𝐷𝐾𝐿 ฮ 𝑞𝜑 𝒛 𝒙 𝑝𝜃 𝒛
  • 20. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes How to maximize the ELBO? Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. SCORE estimator ELBO This estimator is unbiased, meaning that even if at every step it may not be equal to the true expectation, on average it will converge to it, but as it is stochastic, it also has a variance and it happens to be high for practical use. Plus, we can’t run backpropagation through it!
  • 21. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes We need a new estimator!
  • 22. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes The reparameterization trick X Z Latent (hidden) variable Observable variable 𝜇 𝜎2 [8.67, 12.8564, 0.44875, 874.22, …] [4.59, 13.2548, 1.14569, 148.25, …] [1.74, 32.3476, 5.18469, 358.14, …] 𝜖 Stochastic node
  • 23. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Running backpropagation on the reparametrized model 𝜇 𝜎2 Loss function Kingma, D.P. and Welling, M., 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), pp.307-392.
  • 24. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes A new estimator! 𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑞𝜑 𝒛 𝒙 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = 𝐸𝑝(𝜖) log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 𝜖 ≈ 𝑝(𝜖) 𝑧 = 𝑔(𝜑, 𝑥, 𝜖) 𝐸𝑝(𝜖) log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 ≅ ෨ 𝐿 𝜃, 𝜑, 𝒙 = log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. ELBO
  • 25. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Is the new estimator unbiased? 𝐸𝑝(𝜖) ∇𝜃,𝜑 ෨ 𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) ∇𝜃,𝜑 log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 ෨ 𝐿 𝜃, 𝜑, 𝒙 = log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = ∇𝜃,𝜑(𝐸𝑝 𝜖 log 𝑝𝜃 𝒙, 𝒛 𝑞𝜑 𝒛 𝒙 ) 𝐿 𝜃, 𝜑, 𝒙 = 𝐸𝑝(𝜖) log 𝑝𝜃(𝒙, 𝒛) 𝑞𝜑 𝒛 𝒙 = ∇𝜃,𝜑(𝐿 𝜃, 𝜑, 𝒙 )
  • 26. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Example network X Encoder 𝜇 X’ Decoder Input Latent Space Reconstructed Input Z log(𝜎2 ) 𝜖 ≈ 𝑁(0, 𝐼) We prefer learning log(𝜎2 ) because it can be negative, so the model doesn’t need to be forced to produce only positive values for it. Sampled using torch.randn_like(shape) function
  • 27. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Show me the loss already! X Encoder X’ Decoder Latent Space Z 𝜖 ≈ 𝑁(0, 𝐼) MLP = Multi Layer Perceptron Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • 28. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes How to derive the loss function? https://stats.stackexchange.com/questions/318748/deriving-the-kl-divergence-loss-for-vaes
  • 29. Umar Jamil - https://github.com/hkproj/vae-from-scratch-notes Thanks for watching! Don’t forget to subscribe for more amazing content on AI and Machine Learning!