SlideShare a Scribd company logo
Robustness Metrics for ML
Models based on Deep
Learning Methods
Davide Posillipo

Alkemy - Data Science Milan Meetup - AI Guild

21/07/2022, Milan
Who am I?
Hi! My name is Davide Posillipo.

I studied Statistics. 

For 8+ years I’ve been working on
projects that require using data to
answer questions. 

Freelancer Data Scientist and Teacher.

Currently building the Alkemy’s Deep
Learning & Big Data Department. 

Happy to connect and work together on
new ideas and projects. Get in touch!
This event,


tonight
Let’s get started
MNIST: a simple case
• MNIST: solved problem 

• LeNet: Deep Convolutional Neural Network, 99% accuracy
Predicted 

digit
What happens if
we pass a
chicken to
LeNet?
The chicken comes from a
di
ff
erent distribution.
What should the model do?
What happens if we pass a chicken to LeNet?
We expect this output:
?
What happens if we pass a chicken to LeNet?
… but we obtain this result:
6, with probability 0.9
A less obvious example: Fashion MNIST
63.4% of examples are classi
fi
ed with a probability
greater than 0.99

74.3% of examples are classi
fi
ed with a probability
greater than 0.95 

88.9% of examples are classi
fi
ed with a probability
greater than 0.75
LeNet (trained for MNIST) produces “con
fi
dent” predictions for the Fashion MNIST
Can we trust deep learning predictions?
• There are scenarios where it’s crucial to avoid meaningless
predictions (e.g. healthcare, high precision manufacturing,
autonomous driving…): safety in AI

• No prediction better than random guessing

• Need of some robustness metrics
Robustness metrics: how should it be?
A robustness metrics should tell us if the prediction is reliable or if it
is some sort of random guessing. It should provide some measure of
con
fi
dence in our prediction.

• Computable without labels 

• Computable in “real-time” 

• Easy to plug-in into a working ML pipeline

• Low false positive rate, but also low false negative rate

• High e
ff
ectiveness in detecting the anomaly points (low false negative
rate)

• Low rate of discarded “good predictions” (low false positive rate)
A few words about Robustness
Robustness: the quality of an entity that doesn’t break too easily if
stressed in some way.

Canonical approaches in Statistics involve some modi
fi
cation of
the predictive model/estimator (“If you get into a stressful
situation, handle it better”), with often a loss of performance as
side e
ff
ect.

Tonight we focus on a di
ff
erent approach: we don’t modify our
models but we protect them from threats (“Don’t get into stressful
situations”). We look for the chickens and keep them away from
our model.

Fight-or-
fl
ight(-or-freeze) response: we choose to
fl
y!
Lack of robustness is a risk
There is an increasing attention in risks
associated with AI applications. 

EU proposed an AI regulation that could
become applicable to industry players starting
from 2024 (https://digital-
strategy.ec.europa.eu/en/policies/regulatory-
framework-ai)

The Commission is proposing the
fi
rst-ever
legal framework on AI, which addresses the
risks of AI and positions Europe to play a
leading role globally.

At some point, robustness will be a mandatory
requirements of ML models in production, and
lack of it will be considered a risk.
Robustness metrics: two approaches
• GAN-based approach

• Decide if a prediction is worth the risk, checking the
“stability” of the classi
fi
er for the new input data

• VAE-based approach

• Decide if a prediction is worth the risk, checking the true
“origin” of the new input data
First approach:
WGAN + Inverter
GAN in a nutshell
• Generative Adversarial
Networks

• Composed of two networks:

• Generator: generates
inputs from random noise

• Discriminator: decides if
the inputs from the
Generator are authentic
or arti
fi
cial

• After many iterations, the
model learns to create
inputs really close to the
authentic ones
A fancier GAN: WGAN
• Wasserstein Generative Adversarial Networks

• It use Wasserstein distance as GAN loss function

• Wasserstein Distance is a measure of the distance between two
probability distributions

• Fore more info: https://lilianweng.github.io/posts/2017-08-20-
gan/
• Train a Generator (from random noise to the input space)

• Train a Discriminator (it tries to understand if a point is true or
generated by the Generator)

• The Generator learns how to fool the Discriminator 

• Train an Inverter (from the input space to the latent
representation)

• The Inverter learns how the Generator maps from the
random noise to the input space
Adding the Inverter
Approach 1: overview


and loss functions
WGAN Loss Function
Inverter Loss Function
WGAN + Inverter: robustness metrics
• Given a new data point x,
fi
nd the closest point to x in the latent space
that is, once translated back to the original space, able to confound the
classi
fi
er f

• Robustness metrics: the distance ( delta_z = z* - I(x) ) between these two
points, in the latent representation
• If delta_z is “small”, the classi
fi
er is not con
fi
dent about its prediction
(the classi
fi
er “changes its mind” too easily)

• If delta_z is “big”, the classi
fi
er is con
fi
dent about the prediction (the
classi
fi
er “knows what it wants”)
Rule: if delta_z < threshold then do not predict
WGAN + Inverter: Architecture
• PyTorch implementation 

• Generator: 1 Fully Connected + 4 Transposed Conv. Layers
(strides = 2), ReLu activation func. 

• Discriminator (“Critic”): 4 Conv. Layers (strides = 2) + 1 Fully
Connected, ReLu activation func. 

• Inverter: 4 Conv. Layers (strides = 2) + 2 Fully Connected, ReLu
activation func.

• Adam optimizers for the three nets
WGAN + Inverter: Training and
distribution of delta_z
Distribution of delta_z computed on 500 test set
input points
WGAN + Inverter: Results
Distribution of delta_z computed on 500 test set input
points
Delta_z for the chicken: 1.8e-06 (< 5.3e-02)
Using the 5° percentile test delta_z as threshold, we could get a 5% as false
positive rate and discard the meaningless predictions for the chicken picture.
5° percentile: 5.3e-02
WGAN + Inverter: Results with Fashion
MNIST
Using the 5° percentile test delta_z as threshold, we could get a 5% as false
positive rate but a 34.4% of lost good predictions (false negative rate)!
Distribution of delta_z computed on 500 test set input
points
5° percentile: 5.3e-02
We need to reduce the false negative rate!
Second approach:


Variational AutoEncoder
AutoEncoder in a nutshell
• VAE: Variational AutoEncoder

• Composed of two neural networks:

• Encoder: takes the inputs and “compress” them in a deep latent representation

• Decoder: takes the deep representation and creates back the original input as
much as it can 

• After many iterations, the model learns the best “latent representation” (kind of
compression) of the training data
From AE to Variational AutoEncoder
Instead of mapping the input into a 
fi
xed vector, we map it into a distribution. 

In this way we learn the likelihood p(x|z), in this context called probabilistic decoder,
that we can use to generate from the latent space.

The encoding happens likewise: we learn the posterior p(z|x) in this context called
probabilistic encoder.

For more info: https://lilianweng.github.io/posts/2018-08-12-vae/
Variational Autoencoder: robustness
metrics
The loss function of a variational autoencoder is an indirect measure of the probability
that an observation comes from the same distribution that generated the training set. 

An “unlikely” input will have troubles in the encoding-decoding process = high loss
value.

Robustness metrics: loss function

• “Big” loss -> Not robust prediction: the new input data doesn’t come from the
training set underlying distribution

• “Small” loss -> Robust prediction: the new input data comes from the training
set underlying distribution

Notice that with this approach we don’t need the classi
fi
er f for the robustness metrics
computation!
Rule: if VAE_loss > threshold then do not predict
Variational Autoencoder: Architecture
• PyTorch implementation 

• Encoder: 2 layers fully connected neural network 

• Decoder: 2 layers fully connected neural network

• Adam optimizer
Variational Autoencoder: Training and
VAE loss distribution
Distribution of VAE losses computed on 10.000 test
set input points
Variational Autoencoder: Results
Distribution of VAE losses computed on 10.000 test
set input points
Using the maximum test loss as threshold, we could get a 0% as false positive
rate and discard the meaningless predictions for the chicken picture.
VAE loss for the chicken: 309.4 (>212)
Variational Autoencoder: Results with
Fashion MNIST
Distribution of VAE losses computed on
10.000 test set input points
Using the maximum test loss as threshold, we could get a 0% as false positive
rate and a 3.55% of lost good predictions (false negative rate)

Using the 95° percentile test loss as threshold, we could get a 5% as false
positive rate and a 0.25% of lost good predictions (false negative rate)
Variational Autoencoder: robustness
metrics into production
• Train your classi
fi
er

• Train a VAE on your training set 

• Get the distribution of the VAE losses on your test set 

• De
fi
ne a threshold more or less “conservative” 

• Implement a “conditional classi
fi
er”:
Conclusions
Wrapping up: pros and cons of this
approach
Pros
• No need to modify your predictive models

• The same “monitoring” system can be used for di
ff
erent ML
models (for a given dataset)

• Applicable to any kind of data (tabular, images, …)

• “Easy” to explain 

• Easy to plug-in into existing pipelines

Cons
• Arbitrary thresholds must be set by the data scientist

• Introducing a further model that needs to be maintained
Thank you for your
attention!
https://www.linkedin.com/in/davide-posillipo/

More Related Content

PDF
Barga Data Science lecture 9
PDF
4. Classification.pdf
PPTX
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
PPTX
Subverting Machine Learning Detections for fun and profit
PPT
lec10svm.ppt
PPT
Support Vector Machines (lecture by Geoffrey Hinton)
PPT
lec10svm.ppt SVM lecture machine learning
PPT
Svm ms
Barga Data Science lecture 9
4. Classification.pdf
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Subverting Machine Learning Detections for fun and profit
lec10svm.ppt
Support Vector Machines (lecture by Geoffrey Hinton)
lec10svm.ppt SVM lecture machine learning
Svm ms

Similar to Robustness Metrics for ML Models based on Deep Learning Methods (20)

PPT
SVM_UNI_TORON_SPACE_VECTOR_MACHINE_MACHINE_LEARNING.ppt
PPT
SUPPORT _ VECTOR _ MACHINE _ PRESENTATION
PPT
lec10svm.ppt
PPTX
lecture 9 pdddddddddddddddddssdsdnn.pptx
PDF
Machine learning, biomarker accuracy and best practices
PPTX
Lec4(Multiple Regression) & Building a Model & Dummy Variable.pptx
PPTX
Anime_face_generation_through_DCGAN.pptx
PPTX
08 neural networks
PPTX
Predict Backorder on a supply chain data for an Organization
PPTX
in5490-classification (1).pptx
PPTX
Kaggle Gold Medal Case Study
PDF
Barga Data Science lecture 7
PDF
Barga Data Science lecture 5
PDF
DEF CON 24 - Clarence Chio - machine duping 101
PPTX
sentiment analysis using support vector machine
PPTX
Anomaly Detection for Real-World Systems
PDF
CounterFactual Explanations.pdf
PPT
lec10svm.ppt
PPTX
DoWhy Python library for causal inference: An End-to-End tool
PDF
深度學習在AOI的應用
SVM_UNI_TORON_SPACE_VECTOR_MACHINE_MACHINE_LEARNING.ppt
SUPPORT _ VECTOR _ MACHINE _ PRESENTATION
lec10svm.ppt
lecture 9 pdddddddddddddddddssdsdnn.pptx
Machine learning, biomarker accuracy and best practices
Lec4(Multiple Regression) & Building a Model & Dummy Variable.pptx
Anime_face_generation_through_DCGAN.pptx
08 neural networks
Predict Backorder on a supply chain data for an Organization
in5490-classification (1).pptx
Kaggle Gold Medal Case Study
Barga Data Science lecture 7
Barga Data Science lecture 5
DEF CON 24 - Clarence Chio - machine duping 101
sentiment analysis using support vector machine
Anomaly Detection for Real-World Systems
CounterFactual Explanations.pdf
lec10svm.ppt
DoWhy Python library for causal inference: An End-to-End tool
深度學習在AOI的應用
Ad

More from Data Science Milan (20)

PDF
ML & Graph algorithms to prevent financial crime in digital payments
PDF
How to use the Economic Complexity Index to guide innovation plans
PDF
"You don't need a bigger boat": serverless MLOps for reasonable companies
PDF
Question generation using Natural Language Processing by QuestGen.AI
PDF
Speed up data preparation for ML pipelines on AWS
PPTX
Serverless machine learning architectures at Helixa
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PDF
Reinforcement Learning Overview | Marco Del Pra
PDF
Time Series Classification with Deep Learning | Marco Del Pra
PDF
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
PDF
Audience projection of target consumers over multiple domains a ner and baye...
PDF
Weak supervised learning - Kristina Khvatova
PDF
GANs beyond nice pictures: real value of data generation, Alex Honchar
PDF
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
PDF
3D Point Cloud analysis using Deep Learning
PDF
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
PDF
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
PDF
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
PDF
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
PDF
A view of graph data usage by Cerved
ML & Graph algorithms to prevent financial crime in digital payments
How to use the Economic Complexity Index to guide innovation plans
"You don't need a bigger boat": serverless MLOps for reasonable companies
Question generation using Natural Language Processing by QuestGen.AI
Speed up data preparation for ML pipelines on AWS
Serverless machine learning architectures at Helixa
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Reinforcement Learning Overview | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Audience projection of target consumers over multiple domains a ner and baye...
Weak supervised learning - Kristina Khvatova
GANs beyond nice pictures: real value of data generation, Alex Honchar
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
3D Point Cloud analysis using Deep Learning
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
A view of graph data usage by Cerved
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
MIND Revenue Release Quarter 2 2025 Press Release
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx

Robustness Metrics for ML Models based on Deep Learning Methods

  • 1. Robustness Metrics for ML Models based on Deep Learning Methods Davide Posillipo Alkemy - Data Science Milan Meetup - AI Guild 21/07/2022, Milan
  • 2. Who am I? Hi! My name is Davide Posillipo. I studied Statistics. For 8+ years I’ve been working on projects that require using data to answer questions. Freelancer Data Scientist and Teacher. Currently building the Alkemy’s Deep Learning & Big Data Department. Happy to connect and work together on new ideas and projects. Get in touch!
  • 5. MNIST: a simple case • MNIST: solved problem • LeNet: Deep Convolutional Neural Network, 99% accuracy Predicted digit
  • 6. What happens if we pass a chicken to LeNet? The chicken comes from a di ff erent distribution. What should the model do?
  • 7. What happens if we pass a chicken to LeNet? We expect this output: ?
  • 8. What happens if we pass a chicken to LeNet? … but we obtain this result: 6, with probability 0.9
  • 9. A less obvious example: Fashion MNIST 63.4% of examples are classi fi ed with a probability greater than 0.99 74.3% of examples are classi fi ed with a probability greater than 0.95 88.9% of examples are classi fi ed with a probability greater than 0.75 LeNet (trained for MNIST) produces “con fi dent” predictions for the Fashion MNIST
  • 10. Can we trust deep learning predictions? • There are scenarios where it’s crucial to avoid meaningless predictions (e.g. healthcare, high precision manufacturing, autonomous driving…): safety in AI • No prediction better than random guessing • Need of some robustness metrics
  • 11. Robustness metrics: how should it be? A robustness metrics should tell us if the prediction is reliable or if it is some sort of random guessing. It should provide some measure of con fi dence in our prediction. • Computable without labels • Computable in “real-time” • Easy to plug-in into a working ML pipeline • Low false positive rate, but also low false negative rate • High e ff ectiveness in detecting the anomaly points (low false negative rate) • Low rate of discarded “good predictions” (low false positive rate)
  • 12. A few words about Robustness Robustness: the quality of an entity that doesn’t break too easily if stressed in some way. Canonical approaches in Statistics involve some modi fi cation of the predictive model/estimator (“If you get into a stressful situation, handle it better”), with often a loss of performance as side e ff ect. Tonight we focus on a di ff erent approach: we don’t modify our models but we protect them from threats (“Don’t get into stressful situations”). We look for the chickens and keep them away from our model. Fight-or- fl ight(-or-freeze) response: we choose to fl y!
  • 13. Lack of robustness is a risk There is an increasing attention in risks associated with AI applications. EU proposed an AI regulation that could become applicable to industry players starting from 2024 (https://digital- strategy.ec.europa.eu/en/policies/regulatory- framework-ai) The Commission is proposing the fi rst-ever legal framework on AI, which addresses the risks of AI and positions Europe to play a leading role globally. At some point, robustness will be a mandatory requirements of ML models in production, and lack of it will be considered a risk.
  • 14. Robustness metrics: two approaches • GAN-based approach • Decide if a prediction is worth the risk, checking the “stability” of the classi fi er for the new input data • VAE-based approach • Decide if a prediction is worth the risk, checking the true “origin” of the new input data
  • 16. GAN in a nutshell • Generative Adversarial Networks • Composed of two networks: • Generator: generates inputs from random noise • Discriminator: decides if the inputs from the Generator are authentic or arti fi cial • After many iterations, the model learns to create inputs really close to the authentic ones
  • 17. A fancier GAN: WGAN • Wasserstein Generative Adversarial Networks • It use Wasserstein distance as GAN loss function • Wasserstein Distance is a measure of the distance between two probability distributions • Fore more info: https://lilianweng.github.io/posts/2017-08-20- gan/
  • 18. • Train a Generator (from random noise to the input space) • Train a Discriminator (it tries to understand if a point is true or generated by the Generator) • The Generator learns how to fool the Discriminator • Train an Inverter (from the input space to the latent representation) • The Inverter learns how the Generator maps from the random noise to the input space Adding the Inverter
  • 19. Approach 1: overview and loss functions WGAN Loss Function Inverter Loss Function
  • 20. WGAN + Inverter: robustness metrics • Given a new data point x, fi nd the closest point to x in the latent space that is, once translated back to the original space, able to confound the classi fi er f • Robustness metrics: the distance ( delta_z = z* - I(x) ) between these two points, in the latent representation • If delta_z is “small”, the classi fi er is not con fi dent about its prediction (the classi fi er “changes its mind” too easily) • If delta_z is “big”, the classi fi er is con fi dent about the prediction (the classi fi er “knows what it wants”) Rule: if delta_z < threshold then do not predict
  • 21. WGAN + Inverter: Architecture • PyTorch implementation • Generator: 1 Fully Connected + 4 Transposed Conv. Layers (strides = 2), ReLu activation func. • Discriminator (“Critic”): 4 Conv. Layers (strides = 2) + 1 Fully Connected, ReLu activation func. • Inverter: 4 Conv. Layers (strides = 2) + 2 Fully Connected, ReLu activation func. • Adam optimizers for the three nets
  • 22. WGAN + Inverter: Training and distribution of delta_z Distribution of delta_z computed on 500 test set input points
  • 23. WGAN + Inverter: Results Distribution of delta_z computed on 500 test set input points Delta_z for the chicken: 1.8e-06 (< 5.3e-02) Using the 5° percentile test delta_z as threshold, we could get a 5% as false positive rate and discard the meaningless predictions for the chicken picture. 5° percentile: 5.3e-02
  • 24. WGAN + Inverter: Results with Fashion MNIST Using the 5° percentile test delta_z as threshold, we could get a 5% as false positive rate but a 34.4% of lost good predictions (false negative rate)! Distribution of delta_z computed on 500 test set input points 5° percentile: 5.3e-02 We need to reduce the false negative rate!
  • 26. AutoEncoder in a nutshell • VAE: Variational AutoEncoder • Composed of two neural networks: • Encoder: takes the inputs and “compress” them in a deep latent representation • Decoder: takes the deep representation and creates back the original input as much as it can • After many iterations, the model learns the best “latent representation” (kind of compression) of the training data
  • 27. From AE to Variational AutoEncoder Instead of mapping the input into a  fi xed vector, we map it into a distribution. In this way we learn the likelihood p(x|z), in this context called probabilistic decoder, that we can use to generate from the latent space. The encoding happens likewise: we learn the posterior p(z|x) in this context called probabilistic encoder. For more info: https://lilianweng.github.io/posts/2018-08-12-vae/
  • 28. Variational Autoencoder: robustness metrics The loss function of a variational autoencoder is an indirect measure of the probability that an observation comes from the same distribution that generated the training set. An “unlikely” input will have troubles in the encoding-decoding process = high loss value. Robustness metrics: loss function • “Big” loss -> Not robust prediction: the new input data doesn’t come from the training set underlying distribution • “Small” loss -> Robust prediction: the new input data comes from the training set underlying distribution Notice that with this approach we don’t need the classi fi er f for the robustness metrics computation! Rule: if VAE_loss > threshold then do not predict
  • 29. Variational Autoencoder: Architecture • PyTorch implementation • Encoder: 2 layers fully connected neural network • Decoder: 2 layers fully connected neural network • Adam optimizer
  • 30. Variational Autoencoder: Training and VAE loss distribution Distribution of VAE losses computed on 10.000 test set input points
  • 31. Variational Autoencoder: Results Distribution of VAE losses computed on 10.000 test set input points Using the maximum test loss as threshold, we could get a 0% as false positive rate and discard the meaningless predictions for the chicken picture. VAE loss for the chicken: 309.4 (>212)
  • 32. Variational Autoencoder: Results with Fashion MNIST Distribution of VAE losses computed on 10.000 test set input points Using the maximum test loss as threshold, we could get a 0% as false positive rate and a 3.55% of lost good predictions (false negative rate) Using the 95° percentile test loss as threshold, we could get a 5% as false positive rate and a 0.25% of lost good predictions (false negative rate)
  • 33. Variational Autoencoder: robustness metrics into production • Train your classi fi er • Train a VAE on your training set • Get the distribution of the VAE losses on your test set • De fi ne a threshold more or less “conservative” • Implement a “conditional classi fi er”:
  • 35. Wrapping up: pros and cons of this approach Pros • No need to modify your predictive models • The same “monitoring” system can be used for di ff erent ML models (for a given dataset) • Applicable to any kind of data (tabular, images, …) • “Easy” to explain • Easy to plug-in into existing pipelines Cons • Arbitrary thresholds must be set by the data scientist • Introducing a further model that needs to be maintained
  • 36. Thank you for your attention! https://www.linkedin.com/in/davide-posillipo/