SlideShare a Scribd company logo
Production machine learning:
Managing models, workflows &
risk at scale
Alex Housley
Founder & CEO, Seldon
@ahousley @seldon_io
#CogX2021
The unbundling of ML platforms
1. Tech giants build DIY ML platforms
from scratch to gain competitive
advangtage e.g. Michelangelo,
FBLearner, TFX.
2. Specialised tools emerge to solve
MLOps challenges - e.g. version
control, feature stores, CI/CD,
monitoring.
3. Cloud-native driving hybrid/multi-
cloud adoption: more control,
reduced vendor lock-in.
16/06/2021
#COGX2021
2
Hidden Technical Debt in Machine Learning Systems.
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young,
Jean-Francois Crespo, Dan Dennison Google, NIPS 2015 Conference
Analysis Tools
Serving
Infrastructure
Monitoring
Machine
Resource
Management
Process
Management Tools
Analysis Tools
Serving
Infrastructure
Monitoring
Machine
Resource
Management
Process
Management Tools
ML
Code
Data Verification
Data Collection
Feature Extraction
Configuration
AI adoption is accelerating in the enterprise
16/06/2021 3
AI Adoption in the Enterprise 2021 – O’Reilly (oreilly.com)
5,154 global respondents
How Tech Stacks Up in B2B - Andreessen Horowitz (a16z.com)
Survey of technology leaders at Fortune 500, Global 2000, and
SaaS 50 companies.
#COGX2021
Data scientists and DevOps must collaborate to
productionise models
16/06/2021
#COGX2021
4
Siloed teams in offices and working remote. From 1 week to 6 months to deploy a model.
ML Engineer
$141k average base salary
Production ML has a larger surface area than data
prep and training
The nine stages of the machine learning workflow (Amershi, IEEE 2019)
5
More Metadata
#COGX2021
Data Scientists
Data Engineers ML Engineers
“Day 0” “Day 1” “Day 2”
DevOps
Product / Mgmt
Scaling MLOps across the organisation
16/06/2021 6
Team
– 10 users
– < 50 models
– One training system
– Minimal or team-
level restrictions
Business Unit
– 50 users
– < 200 models
– 3-4 training systems,
multiple frameworks
– Large DevOps team
– Dept level
constraints
– Role based access
Organisation
– >100 users
– Hundreds or
thousands of models
– Multiple platforms
and clouds
– Org blueprints
– Compliance
– Higher level
principles AI ethics
#COGX2021
Production ML challenges
Orchestration
Monitoring
Explainability
Governance
…at Scale
16/06/2021 8
#COGX2021
Orchestration at Scale
Capital One created a ‘Model as a Service’ platform powered by Seldon
Case study: Capital One ‘Model as a Service’
Objectives
Improve the speed-to-market for ML
models
Lower the barrier to entry for developers
to get their models into production
Implement operational
efficiencies and economies of
scale
“With our MaaS platform running on
Seldon, we’ve gone from it taking
months to minutes to deploy or
update models.”
Steve Evangelista
Director of Project Management, Capital One
Results
–MVP in less than 90 days
–Deployment process now takes minutes instead of
months
–Versioning, vulnerability scanning, containerizing,
deployment, testing and promoting to production is all
taken care
–Use cases across business including fraud,
marketing, finance and customer service
–Rigorous compliance through model management and
monitoring
–Developers could work in any language/framework
Why not just wrap my models with Flask?
Flask works well in R&D until you need:
– Multiple optimized model servers
– Metrics and tracing
– Lineage and auditability
– Ingress configuration
– Complex inference graphs
(ensembles, AB tests, MABs, etc)
– Scalable solution that is battle-tested
by wide community of open-source
and commercial users
16/06/2021
ORCHESTRATION AT SCALE
11
Model serving: to achieve scale, you need to abstract
complex ML concepts into standardised infra components
ORCHESTRATION AT SCALE
12
Adversarial
Detector
Is the model
being attacked?
Leverage pre-packaged servers for framework-
agnostic model serving
– Leverage out of the box optimized
servers that wrap your model artifacts
– Enable data scientists deploy models
from their preferred framework
– Model servers are optimized for each
framework for optimal performance
– Extend existing pre-packaged servers
with simple SDKs
16/06/2021
ORCHESTRATION AT SCALE
13
Central Repository
(S3, ModelDB,...)
Model
Reusable
Server
Image Registry
The anatomy of e2e enterprise MLOps architectures
16/06/2021
ORCHESTRATION AT SCALE
14
Canary Tests, Shadows & Rolling Updates
16/06/2021
ORCHESTRATION AT SCALE
15
Remove
Revert
Models
Why does this matter?
Robust and safer testing in
production with zero
downtime to minimise risk.
Canary
90%
10%
Promote
100%
Create
100%
Resource
requests/limits
Autoscaling
spec
Tempo: open-source MLOps SDK for data scientists
16/06/2021
ORCHESTRATION AT SCALE
16
https://github.com/SeldonIO/tempo
Powerful Inference
Orchestration Logic
Pluggable Runtimes
Custom Python
Components
● Create custom business logic for
models.
● Use any python expressions/libraries to
orchestrate component requests.
Data Science
Friendly
● Allow any data science library to be used
easily. E.g.,
○ Custom Models
○ Alibi Explainers
○ Alibi-Detect Outlier Models
○ Multi-Armed Bandits
● Local testing before hand-off to
production
● Python first with output to YAML
● Extendable runtimes.
○ Seldon Deploy
○ Seldon Core
○ Docker with Seldon Containers
○ KFServing
Monitoring at Scale
Case Study: Microsoft & Philips Clinical Drift
Monitoring During Covid-19
• ICUs having to make difficult decisions to
optimize patient health outcomes.
• Built models to predict outcomes such as
patient mortality, length of ventilation,
length of stay.
• Challenges: catching changes to model
performance; time-intensive and
computationally expensive training pipeline.
• Solution needed to be scalable, repeatable
and secure: Azure Databricks, Azure
DevOps and Alibi Detect.
16/06/2021
MONITORING AT SCALE
18
Making your organisation proactive rather than
reactive
16/06/2021
MONITORING AT SCALE
19
Service Metrics Statistical Performance
Drift and outliers
Explainability
Service Metrics
– Microservice metrics such as requests
per second, latency, CPU usage,
memory usage, etc
– Performance monitoring leveraging
Prometheus and ELK
– Seldon Deploy Configures Metrics with
Prometheus
16/06/2021
MONITORING AT SCALE
20
Model A
API
(REST,
gRPC,
Kafka)
Request Logs
Tracing
Production microservice
From model weights
Model Metrics
Why does this matter?
Manage compute costs
and response times
associated with SLAs.
Statistical Monitoring
– Monitor the impact on business KPIs
– Advanced metrics exposed directly by model
servers
– Metrics can be calculated using “feedback”
– Custom metrics can be added by extending
metrics servers
16/06/2021
MONITORING AT SCALE
21
Model A
Metrics Server
Sends
Feedback
Reads inference
data
Statistical Metrics
Stores inference
data
Sends
inference
data
Sends
correct label
Request routing via
cloudevent KNative
infrastructure
Why does this matter?
Understand and monitor
the impact on your
business KPIs.
Outlier Monitoring
– Detecting anomalies in data instances
and flagging/alerting
– Identifying potential metadata that could
help diagnose outliers
– Do outliers indicate there’s an issue with
the model or data?
– Outlier detection runs as a separate
component and can receive input and
prediction data from model
16/06/2021
MONITORING AT SCALE
22
Model A
Outlier
Detector
Server
Sends model
input data
Stores
inference data
Sends
inference
data
Request routing via
cloudevent KNative
infrastructure
Stores
Outlier Data
Request +
outlier data
available
Why does this matter?
Outliers are more like to
have a negative impact if
acted upon automatically.
Drift
– Over time, live data in production
environments differs from the process
that generated the training data.
– Model performance during
deployment no longer matches that
observed on held out training data.
– Goal is to identify drifts in data
distribution and relationships between
input and output.
16/06/2021
MONITORING AT SCALE
23
Why does this matter?
Model performance has a
direct correlation with
business value or safety in
some use cases.
Challenges of online drift detection
– In production, data points arrive in
sequence – and we need to detect
drift ASAP
– So how do we decide whether
fluctuations are due to drift or natural
fluctuations?
– Statistical hypothesis testing
– Windowing strategies
16/06/2021
MONITORING AT SCALE
24
Why does this matter?
Detecting drift at the right
time enable you to improve
performance and reduce
costs. Request routing via
cloudevent KNative
infrastructure
Model A
Drift
Detector
Server
Sends model
input data Sends
inference
data
Drift Metrics
Explainability at Scale
Case Study: Explainability for Insurance
16/06/2021
EXPLAINABILITY AT SCALE
26
Context
Explainability is a critical requirement for all production models.
Operations staff require models to be interpretable to justify algorithmic decisions.
Before Seldon Deploy
Advanced algorithms can not be deployed to production due to a lack of interpretability.
After Seldon Deploy
Improvements to claims automation and payments processing can be realised as these
models can now be made interpretable.
ML models are a black box
● Lending decision
(yes/no)
● Medical diagnosis
● Credit applicant
data
● Medical image
Model
EXPLAINABILITY AT SCALE
Why explain machine learning models?
– Build trust in machine learning outputs
– Increase transparency
– Improve the customer experience
– Check for bias
– Gain insights for data scientists to
understand how models are working
– Avoid damage to business reputation
– Meet regulatory requirements
16/06/2021
EXPLAINABILITY AT SCALE
28
Why does this matter?
Lack of explainability is one
of the biggest blockers to
production ML and causes
of risk in organisations
Explaining model predictions
Types of explanations
– By scope (local vs global)
– By model type (black-box vs white-box)
– By task (classification, regression, structured prediction)
– By data type (tabular, images, text…)
– By insight (feature attributions, counterfactuals, influential training instances…)
16/06/2021 29
Image credit: Scott Lundberg (https://github.com/slundberg/shap)
Image credit: Barshan et al., RelatIF: Identifying Explanatory
Training Examples via Relative Influence (2020)
EXPLAINABILITY AT SCALE
How can we explain the black-box?
Anchors
Feature Attribution: what input subsets are necessary for a prediction to hold? [1]
[1] Ribeiro et al., Anchors: High-Precision Model-Agnostic Explanations (2018)
16/06/2021 30
Image source: Alibi Explain repository home page
EXPLAINABILITY AT SCALE
How can we explain the black-box?
Counterfactuals
How can you (minimally) change input to obtain a desired prediction? [2, 3]
[2] Wachter et al., Counterfactual Explanations without Opening the Black Box (2017)
[3] Van Looveren A., Klaise J. Interpretable Counterfactual Explanations Guided by Prototypes (2018)
16/06/2021 31
a) Images of digits minimally altered to
change a classifier’s prediction
b) A person’s attributes minimally altered
to change a classifier’s prediction (low
income to high income)
EXPLAINABILITY AT SCALE
Which explainer should I use?
16/06/2021
EXPLAINABILITY AT SCALE
32
Familiar API in the style of scikit-learn
16/06/2021 33
from alibi.explainers import AnchorTabular
explainer = AnchorTabular(predict_fn, feature_names)
explainer.fit(X_train)
explanation = explainer.explain(x)
>>> explanation.meta
{'name': 'AnchorTabular', 'type': ['blackbox'], 'explanations': ['local'],
'params': {'seed': None, 'disc_perc': (25, 50, 75), 'threshold': 0.95}}
>>> explanation.data
{'anchor': ['petal width (cm) > 1.80', 'sepal width (cm) <= 2.80'],
'precision': 0.98, 'coverage': 0.32}
EXPLAINABILITY AT SCALE
Deploying Alibi Explanations
Integration with Seldon Core, Seldon Deploy and KFServing
16/06/2021 34
EXPLAINABILITY AT SCALE
Explainability Monitoring
– Explanations are useful when paired
with a monitoring system. For
example, explain why a outlier may
have occurred.
– View model explanations on UI
– Trigger explanations for specific
requests on-demand
– Close integration with auditing
16/06/2021 35
EXPLAINABILITY AT SCALE
Governance at Scale
Critical infrastructure increasingly depend on ML
systems
The impact of a bad solution can be worse than no solution at all
16/06/2021
#COGX2021
37
Cybersecurity Attacks
Misuse of personal data
Software Outages
Algorithmic Bias
Range of varying strategies at a national level
16/06/2021 38
GOVERNANCE AT SCALE
Mapping Global AI Ethics
16/06/2021 39
Harvard. 2020. Principled Artificial Intelligence. [ONLINE] Available at:
https://cyber.harvard.edu/publication/2020/principled-ai. [Accessed 21 October 2020].
GOVERNANCE AT SCALE
EU AI Regulation
What does it mean?
– Emphasis on “trustworthy AI”
– Categorising risk. Regulating “high risk” AI
(e.g. autonomous driving) and prohibiting
uses (e.g. mass social scoring).
– Currently focuses more on e2e systems,
which would apply for the platforms applied AI
projects built within organisations.
– Post-market monitoring of AI systems to
evaluate the continued compliance with
regulation.
Timespan: expect 2 years given EU leaders want
it to be fast-tracked.
40
GOVERNANCE AT SCALE
Principles for Trusted AI
The 8 LFAI Principles for Trusted AI (R)REPEATS
16/06/2021 41
Robustness Privacy
Reproducibility Equitability
Accountability
Explainability Transparency Security
Adopted by Open Source Projects
GOVERNANCE AT SCALE
Alignment between capabilities
and governance, compliance & AI ethics
16/06/2021 42
Robustness Privacy
Reproducibility Equitability
Accountability
Explainability Transparency Security
Model
metadata
Request
logging
Language
wrappers
OpenAPI
Schema
APIs
Prepack.
servers
Out-of-
the-box
prom
metrics
Explainer
compo-
nents
Metrics
monitor-
ing
RBAC via
service
account
Historical
feedback
labelling
Namesp
aced
access
Auth via
Ingress
Explainer
compo-
nents
Metrics
monitor-
ing
Historical
feedback
labelling
GitOps
integrati-
on
Request
logging
Model
Metadata
Request
logging
Metrics
monitor-
ing
Model
Metadata
Auth via
Ingress
RBAC via
service
account
Model
Metadata
GOVERNANCE AT SCALE
Programmatic governance with open & closed source
as policy
16/06/2021 43
Open & Closed Source
Tools & Frameworks
3
Regulation, Compliance,
Organisational Policy
GDPR, ISO, etc.
2
Ethics Frameworks,
Principles, Guidelines
LF AI Principles
1
GOVERNANCE AT SCALE
Ensuring principles by design which can map into higher level
organisational principles and policies
Model Metadata Store
16/06/2021 44
GOVERNANCE AT SCALE
GitOps
Deploy Metadata
Store
External customer
metadata store
Discover “find available models”
Enrich “Add metadata to models”
Lineage/Audit “Check model history”
Artifact
Store
Metadata extraction
from artifacts
Model
Explainer
Drift Detector
Outlier
Detector
Automated
Why does this matter?
Ensure proper governance,
auditing and discoverability of
models for better compliance
and risk management
Reproducibility with GitOps
45
GOVERNANCE AT SCALE
Reproducibility with GitOps
16/06/2021 46
You can restore state
to previous versions
GOVERNANCE AT SCALE
Final thoughts
– As practitioners, we have a growing
professional responsibility to our craft
– Democratisation through COSS and
cloud-native tools
– Engage your peers in discussions
about Responsible AI
– Map Trusted AI principles to your
roadmap requirements
16/06/2021 47
Get access to production machine learning at scale
– Connect with us at #CogX2021
– Product demos at our virtual booth
– Free trials for delegates
16/06/2021 48
Thank you!
Questions? Please use Q&A feature.
Contact hello@seldon.io
@seldon_io

More Related Content

PDF
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
PPTX
Databricks Platform.pptx
PPTX
Azure Databricks - An Introduction (by Kris Bock)
PDF
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PPTX
Explainable Machine Learning (Explainable ML)
PPTX
Explainability for Natural Language Processing
PDF
Apache Nifi Crash Course
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
Databricks Platform.pptx
Azure Databricks - An Introduction (by Kris Bock)
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
Apache Flink: Real-World Use Cases for Streaming Analytics
Explainable Machine Learning (Explainable ML)
Explainability for Natural Language Processing
Apache Nifi Crash Course

What's hot (20)

PDF
The A-Z of Data: Introduction to MLOps
PDF
Simplifying Model Management with MLflow
PPTX
How to deploy machine learning models into production
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
PDF
Apply MLOps at Scale
PDF
Getting Started with Databricks SQL Analytics
PPTX
Introduction to Azure Databricks
PDF
Intro to Azure OpenAI Service L100 (Thai Ver).pdf
PPTX
Introduction to Data Engineering
PDF
MLflow Model Serving
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
PDF
Introduction to MLflow
PDF
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
PDF
ML Drift - How to find issues before they become problems
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
PDF
Building End-to-End Delta Pipelines on GCP
PDF
10 Key Considerations for AI/ML Model Governance
PDF
Knowledge Graphs - The Power of Graph-Based Search
The A-Z of Data: Introduction to MLOps
Simplifying Model Management with MLflow
How to deploy machine learning models into production
Azure Synapse Analytics Overview (r2)
Azure DataBricks for Data Engineering by Eugene Polonichko
Apply MLOps at Scale
Getting Started with Databricks SQL Analytics
Introduction to Azure Databricks
Intro to Azure OpenAI Service L100 (Thai Ver).pdf
Introduction to Data Engineering
MLflow Model Serving
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Introduction to MLflow
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
ML Drift - How to find issues before they become problems
Building Lakehouses on Delta Lake with SQL Analytics Primer
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Building End-to-End Delta Pipelines on GCP
10 Key Considerations for AI/ML Model Governance
Knowledge Graphs - The Power of Graph-Based Search
Ad

Similar to Production machine learning: Managing models, workflows and risk at scale (20)

PPTX
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
PDF
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
PDF
Transforming Deployment and Release Management for Salesforce with Copado's AI
PPT
Deploying ML Models using MLOps Pipelines.ppt
PDF
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
PDF
Overcome the Hurdles of Machine Learning Model Deployment_ A Comprehensive Gu...
PDF
IBM Think Milano
PPTX
Magdalena Stenius: MLOPS Will Change Machine Learning
PDF
MLOPS By Amazon offered and free download
PDF
Streamlining AI Deployment with MLOps Pipelines – Powered by DevSecCops.pdf
PPTX
Notes on Deploying Machine-learning Models at Scale
PDF
MLOps – Applying DevOps to Competitive Advantage
PDF
Accelerating Machine Learning as a Service with Automated Feature Engineering
PDF
globalnodes.tech-AI Agent Development Cost A Comprehensive Technical Guide.pdf
PPTX
How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...
PDF
BigQuery ML - Machine learning at scale using SQL
PDF
Best Practices for Integrating MLOps in Your AI_ML Pipeline
PDF
Zero to Production: Building AI Systems That Actually Scale Beyond the Notebook
PPTX
Serverless machine learning architectures at Helixa
PDF
Introduction to MLOps_ CI_CD for Machine Learning Models.pdf
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Transforming Deployment and Release Management for Salesforce with Copado's AI
Deploying ML Models using MLOps Pipelines.ppt
Productionizing Predictive Analytics using the Rendezvous Architecture - for ...
Overcome the Hurdles of Machine Learning Model Deployment_ A Comprehensive Gu...
IBM Think Milano
Magdalena Stenius: MLOPS Will Change Machine Learning
MLOPS By Amazon offered and free download
Streamlining AI Deployment with MLOps Pipelines – Powered by DevSecCops.pdf
Notes on Deploying Machine-learning Models at Scale
MLOps – Applying DevOps to Competitive Advantage
Accelerating Machine Learning as a Service with Automated Feature Engineering
globalnodes.tech-AI Agent Development Cost A Comprehensive Technical Guide.pdf
How Cisco is Leveraging MuleSoft to Drive Continuous Innovation​ at Enterpris...
BigQuery ML - Machine learning at scale using SQL
Best Practices for Integrating MLOps in Your AI_ML Pipeline
Zero to Production: Building AI Systems That Actually Scale Beyond the Notebook
Serverless machine learning architectures at Helixa
Introduction to MLOps_ CI_CD for Machine Learning Models.pdf
Ad

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Spectroscopy.pptx food analysis technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
August Patch Tuesday
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
A Presentation on Artificial Intelligence
PPTX
1. Introduction to Computer Programming.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Getting Started with Data Integration: FME Form 101
Spectroscopy.pptx food analysis technology
Spectral efficient network and resource selection model in 5G networks
A comparative study of natural language inference in Swahili using monolingua...
August Patch Tuesday
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A Presentation on Artificial Intelligence
1. Introduction to Computer Programming.pptx
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
Reach Out and Touch Someone: Haptics and Empathic Computing
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
TLE Review Electricity (Electricity).pptx
Empathic Computing: Creating Shared Understanding
SOPHOS-XG Firewall Administrator PPT.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm

Production machine learning: Managing models, workflows and risk at scale

  • 1. Production machine learning: Managing models, workflows & risk at scale Alex Housley Founder & CEO, Seldon @ahousley @seldon_io #CogX2021
  • 2. The unbundling of ML platforms 1. Tech giants build DIY ML platforms from scratch to gain competitive advangtage e.g. Michelangelo, FBLearner, TFX. 2. Specialised tools emerge to solve MLOps challenges - e.g. version control, feature stores, CI/CD, monitoring. 3. Cloud-native driving hybrid/multi- cloud adoption: more control, reduced vendor lock-in. 16/06/2021 #COGX2021 2 Hidden Technical Debt in Machine Learning Systems. D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison Google, NIPS 2015 Conference Analysis Tools Serving Infrastructure Monitoring Machine Resource Management Process Management Tools Analysis Tools Serving Infrastructure Monitoring Machine Resource Management Process Management Tools ML Code Data Verification Data Collection Feature Extraction Configuration
  • 3. AI adoption is accelerating in the enterprise 16/06/2021 3 AI Adoption in the Enterprise 2021 – O’Reilly (oreilly.com) 5,154 global respondents How Tech Stacks Up in B2B - Andreessen Horowitz (a16z.com) Survey of technology leaders at Fortune 500, Global 2000, and SaaS 50 companies. #COGX2021
  • 4. Data scientists and DevOps must collaborate to productionise models 16/06/2021 #COGX2021 4 Siloed teams in offices and working remote. From 1 week to 6 months to deploy a model. ML Engineer $141k average base salary
  • 5. Production ML has a larger surface area than data prep and training The nine stages of the machine learning workflow (Amershi, IEEE 2019) 5 More Metadata #COGX2021 Data Scientists Data Engineers ML Engineers “Day 0” “Day 1” “Day 2” DevOps Product / Mgmt
  • 6. Scaling MLOps across the organisation 16/06/2021 6 Team – 10 users – < 50 models – One training system – Minimal or team- level restrictions Business Unit – 50 users – < 200 models – 3-4 training systems, multiple frameworks – Large DevOps team – Dept level constraints – Role based access Organisation – >100 users – Hundreds or thousands of models – Multiple platforms and clouds – Org blueprints – Compliance – Higher level principles AI ethics #COGX2021
  • 9. Capital One created a ‘Model as a Service’ platform powered by Seldon Case study: Capital One ‘Model as a Service’ Objectives Improve the speed-to-market for ML models Lower the barrier to entry for developers to get their models into production Implement operational efficiencies and economies of scale “With our MaaS platform running on Seldon, we’ve gone from it taking months to minutes to deploy or update models.” Steve Evangelista Director of Project Management, Capital One Results –MVP in less than 90 days –Deployment process now takes minutes instead of months –Versioning, vulnerability scanning, containerizing, deployment, testing and promoting to production is all taken care –Use cases across business including fraud, marketing, finance and customer service –Rigorous compliance through model management and monitoring –Developers could work in any language/framework
  • 10. Why not just wrap my models with Flask? Flask works well in R&D until you need: – Multiple optimized model servers – Metrics and tracing – Lineage and auditability – Ingress configuration – Complex inference graphs (ensembles, AB tests, MABs, etc) – Scalable solution that is battle-tested by wide community of open-source and commercial users 16/06/2021 ORCHESTRATION AT SCALE 11
  • 11. Model serving: to achieve scale, you need to abstract complex ML concepts into standardised infra components ORCHESTRATION AT SCALE 12 Adversarial Detector Is the model being attacked?
  • 12. Leverage pre-packaged servers for framework- agnostic model serving – Leverage out of the box optimized servers that wrap your model artifacts – Enable data scientists deploy models from their preferred framework – Model servers are optimized for each framework for optimal performance – Extend existing pre-packaged servers with simple SDKs 16/06/2021 ORCHESTRATION AT SCALE 13 Central Repository (S3, ModelDB,...) Model Reusable Server Image Registry
  • 13. The anatomy of e2e enterprise MLOps architectures 16/06/2021 ORCHESTRATION AT SCALE 14
  • 14. Canary Tests, Shadows & Rolling Updates 16/06/2021 ORCHESTRATION AT SCALE 15 Remove Revert Models Why does this matter? Robust and safer testing in production with zero downtime to minimise risk. Canary 90% 10% Promote 100% Create 100% Resource requests/limits Autoscaling spec
  • 15. Tempo: open-source MLOps SDK for data scientists 16/06/2021 ORCHESTRATION AT SCALE 16 https://github.com/SeldonIO/tempo Powerful Inference Orchestration Logic Pluggable Runtimes Custom Python Components ● Create custom business logic for models. ● Use any python expressions/libraries to orchestrate component requests. Data Science Friendly ● Allow any data science library to be used easily. E.g., ○ Custom Models ○ Alibi Explainers ○ Alibi-Detect Outlier Models ○ Multi-Armed Bandits ● Local testing before hand-off to production ● Python first with output to YAML ● Extendable runtimes. ○ Seldon Deploy ○ Seldon Core ○ Docker with Seldon Containers ○ KFServing
  • 17. Case Study: Microsoft & Philips Clinical Drift Monitoring During Covid-19 • ICUs having to make difficult decisions to optimize patient health outcomes. • Built models to predict outcomes such as patient mortality, length of ventilation, length of stay. • Challenges: catching changes to model performance; time-intensive and computationally expensive training pipeline. • Solution needed to be scalable, repeatable and secure: Azure Databricks, Azure DevOps and Alibi Detect. 16/06/2021 MONITORING AT SCALE 18
  • 18. Making your organisation proactive rather than reactive 16/06/2021 MONITORING AT SCALE 19 Service Metrics Statistical Performance Drift and outliers Explainability
  • 19. Service Metrics – Microservice metrics such as requests per second, latency, CPU usage, memory usage, etc – Performance monitoring leveraging Prometheus and ELK – Seldon Deploy Configures Metrics with Prometheus 16/06/2021 MONITORING AT SCALE 20 Model A API (REST, gRPC, Kafka) Request Logs Tracing Production microservice From model weights Model Metrics Why does this matter? Manage compute costs and response times associated with SLAs.
  • 20. Statistical Monitoring – Monitor the impact on business KPIs – Advanced metrics exposed directly by model servers – Metrics can be calculated using “feedback” – Custom metrics can be added by extending metrics servers 16/06/2021 MONITORING AT SCALE 21 Model A Metrics Server Sends Feedback Reads inference data Statistical Metrics Stores inference data Sends inference data Sends correct label Request routing via cloudevent KNative infrastructure Why does this matter? Understand and monitor the impact on your business KPIs.
  • 21. Outlier Monitoring – Detecting anomalies in data instances and flagging/alerting – Identifying potential metadata that could help diagnose outliers – Do outliers indicate there’s an issue with the model or data? – Outlier detection runs as a separate component and can receive input and prediction data from model 16/06/2021 MONITORING AT SCALE 22 Model A Outlier Detector Server Sends model input data Stores inference data Sends inference data Request routing via cloudevent KNative infrastructure Stores Outlier Data Request + outlier data available Why does this matter? Outliers are more like to have a negative impact if acted upon automatically.
  • 22. Drift – Over time, live data in production environments differs from the process that generated the training data. – Model performance during deployment no longer matches that observed on held out training data. – Goal is to identify drifts in data distribution and relationships between input and output. 16/06/2021 MONITORING AT SCALE 23 Why does this matter? Model performance has a direct correlation with business value or safety in some use cases.
  • 23. Challenges of online drift detection – In production, data points arrive in sequence – and we need to detect drift ASAP – So how do we decide whether fluctuations are due to drift or natural fluctuations? – Statistical hypothesis testing – Windowing strategies 16/06/2021 MONITORING AT SCALE 24 Why does this matter? Detecting drift at the right time enable you to improve performance and reduce costs. Request routing via cloudevent KNative infrastructure Model A Drift Detector Server Sends model input data Sends inference data Drift Metrics
  • 25. Case Study: Explainability for Insurance 16/06/2021 EXPLAINABILITY AT SCALE 26 Context Explainability is a critical requirement for all production models. Operations staff require models to be interpretable to justify algorithmic decisions. Before Seldon Deploy Advanced algorithms can not be deployed to production due to a lack of interpretability. After Seldon Deploy Improvements to claims automation and payments processing can be realised as these models can now be made interpretable.
  • 26. ML models are a black box ● Lending decision (yes/no) ● Medical diagnosis ● Credit applicant data ● Medical image Model EXPLAINABILITY AT SCALE
  • 27. Why explain machine learning models? – Build trust in machine learning outputs – Increase transparency – Improve the customer experience – Check for bias – Gain insights for data scientists to understand how models are working – Avoid damage to business reputation – Meet regulatory requirements 16/06/2021 EXPLAINABILITY AT SCALE 28 Why does this matter? Lack of explainability is one of the biggest blockers to production ML and causes of risk in organisations
  • 28. Explaining model predictions Types of explanations – By scope (local vs global) – By model type (black-box vs white-box) – By task (classification, regression, structured prediction) – By data type (tabular, images, text…) – By insight (feature attributions, counterfactuals, influential training instances…) 16/06/2021 29 Image credit: Scott Lundberg (https://github.com/slundberg/shap) Image credit: Barshan et al., RelatIF: Identifying Explanatory Training Examples via Relative Influence (2020) EXPLAINABILITY AT SCALE
  • 29. How can we explain the black-box? Anchors Feature Attribution: what input subsets are necessary for a prediction to hold? [1] [1] Ribeiro et al., Anchors: High-Precision Model-Agnostic Explanations (2018) 16/06/2021 30 Image source: Alibi Explain repository home page EXPLAINABILITY AT SCALE
  • 30. How can we explain the black-box? Counterfactuals How can you (minimally) change input to obtain a desired prediction? [2, 3] [2] Wachter et al., Counterfactual Explanations without Opening the Black Box (2017) [3] Van Looveren A., Klaise J. Interpretable Counterfactual Explanations Guided by Prototypes (2018) 16/06/2021 31 a) Images of digits minimally altered to change a classifier’s prediction b) A person’s attributes minimally altered to change a classifier’s prediction (low income to high income) EXPLAINABILITY AT SCALE
  • 31. Which explainer should I use? 16/06/2021 EXPLAINABILITY AT SCALE 32
  • 32. Familiar API in the style of scikit-learn 16/06/2021 33 from alibi.explainers import AnchorTabular explainer = AnchorTabular(predict_fn, feature_names) explainer.fit(X_train) explanation = explainer.explain(x) >>> explanation.meta {'name': 'AnchorTabular', 'type': ['blackbox'], 'explanations': ['local'], 'params': {'seed': None, 'disc_perc': (25, 50, 75), 'threshold': 0.95}} >>> explanation.data {'anchor': ['petal width (cm) > 1.80', 'sepal width (cm) <= 2.80'], 'precision': 0.98, 'coverage': 0.32} EXPLAINABILITY AT SCALE
  • 33. Deploying Alibi Explanations Integration with Seldon Core, Seldon Deploy and KFServing 16/06/2021 34 EXPLAINABILITY AT SCALE
  • 34. Explainability Monitoring – Explanations are useful when paired with a monitoring system. For example, explain why a outlier may have occurred. – View model explanations on UI – Trigger explanations for specific requests on-demand – Close integration with auditing 16/06/2021 35 EXPLAINABILITY AT SCALE
  • 36. Critical infrastructure increasingly depend on ML systems The impact of a bad solution can be worse than no solution at all 16/06/2021 #COGX2021 37 Cybersecurity Attacks Misuse of personal data Software Outages Algorithmic Bias
  • 37. Range of varying strategies at a national level 16/06/2021 38 GOVERNANCE AT SCALE
  • 38. Mapping Global AI Ethics 16/06/2021 39 Harvard. 2020. Principled Artificial Intelligence. [ONLINE] Available at: https://cyber.harvard.edu/publication/2020/principled-ai. [Accessed 21 October 2020]. GOVERNANCE AT SCALE
  • 39. EU AI Regulation What does it mean? – Emphasis on “trustworthy AI” – Categorising risk. Regulating “high risk” AI (e.g. autonomous driving) and prohibiting uses (e.g. mass social scoring). – Currently focuses more on e2e systems, which would apply for the platforms applied AI projects built within organisations. – Post-market monitoring of AI systems to evaluate the continued compliance with regulation. Timespan: expect 2 years given EU leaders want it to be fast-tracked. 40 GOVERNANCE AT SCALE
  • 40. Principles for Trusted AI The 8 LFAI Principles for Trusted AI (R)REPEATS 16/06/2021 41 Robustness Privacy Reproducibility Equitability Accountability Explainability Transparency Security Adopted by Open Source Projects GOVERNANCE AT SCALE
  • 41. Alignment between capabilities and governance, compliance & AI ethics 16/06/2021 42 Robustness Privacy Reproducibility Equitability Accountability Explainability Transparency Security Model metadata Request logging Language wrappers OpenAPI Schema APIs Prepack. servers Out-of- the-box prom metrics Explainer compo- nents Metrics monitor- ing RBAC via service account Historical feedback labelling Namesp aced access Auth via Ingress Explainer compo- nents Metrics monitor- ing Historical feedback labelling GitOps integrati- on Request logging Model Metadata Request logging Metrics monitor- ing Model Metadata Auth via Ingress RBAC via service account Model Metadata GOVERNANCE AT SCALE
  • 42. Programmatic governance with open & closed source as policy 16/06/2021 43 Open & Closed Source Tools & Frameworks 3 Regulation, Compliance, Organisational Policy GDPR, ISO, etc. 2 Ethics Frameworks, Principles, Guidelines LF AI Principles 1 GOVERNANCE AT SCALE Ensuring principles by design which can map into higher level organisational principles and policies
  • 43. Model Metadata Store 16/06/2021 44 GOVERNANCE AT SCALE GitOps Deploy Metadata Store External customer metadata store Discover “find available models” Enrich “Add metadata to models” Lineage/Audit “Check model history” Artifact Store Metadata extraction from artifacts Model Explainer Drift Detector Outlier Detector Automated Why does this matter? Ensure proper governance, auditing and discoverability of models for better compliance and risk management
  • 45. Reproducibility with GitOps 16/06/2021 46 You can restore state to previous versions GOVERNANCE AT SCALE
  • 46. Final thoughts – As practitioners, we have a growing professional responsibility to our craft – Democratisation through COSS and cloud-native tools – Engage your peers in discussions about Responsible AI – Map Trusted AI principles to your roadmap requirements 16/06/2021 47
  • 47. Get access to production machine learning at scale – Connect with us at #CogX2021 – Product demos at our virtual booth – Free trials for delegates 16/06/2021 48
  • 48. Thank you! Questions? Please use Q&A feature. Contact [email protected] @seldon_io