SlideShare a Scribd company logo
Deploying Models
Eng Teong Cheah
Microsoft MVP
Inferencing?
In machine learning, inferencing refers to the use of a trained model to predict labels for
new data on which the model has not been trained. Often, the model is deployed as part
of a service that enables applications to request immediate, or real-time, predictions for
individual, or small numbers of data observations.
In Azure Machine Learning, you can create real-time inferencing solutions by deploying a
model as a service, hosted in a containerized platform, such as Azure Kubernetes Services
(AKS).
Deploying a Real-Time Inferencing Service
Machine learning inference during deployment
When deploying your AI model during production, you need to consider how it will make
predictions. The two main processes for AI models are:
•Batch inference: An asynchronous process that bases its predictions on a batch of
observations. The predictions are stored as files or in a database for end users or business
applications.
•Real-time (or interactive) inference: Frees the model to make predictions at any time
and trigger an immediate response. This pattern can be used to analyze streaming and
interactive application data.
Machine learning inference during deployment
Consider the following questions to evaluate your model, compare the two processes,
and select the one that suits your model:
•How often should predictions be generated?
•How soon are the results needed?
•Should predictions be generated individually, in small batches, or in large batches?
•Is latency to be expected from the model?
•How much compute power is needed to execute the model?
•Are there operational implications and costs to maintain the model?
Batch inference
Batch inference, sometimes called offline inference, is a simpler inference process that
helps models to run in timed intervals and business applications to store predictions.
Consider the following best practices for batch inference:
•Trigger batch scoring: Use Azure Machine Learning pipelines and
the ParallelRunStep feature in Azure Machine Learning to set up a schedule or event-
based automation.
•Compute options for batch inference: Since batch inference processes don't run
continuously, it's recommended to automatically start, stop, and scale reusable clusters
that can handle a range of workloads. Different models require different environments,
and your solution needs to be able to deploy a specific environment and remove it when
inference is over for the compute to be available for the next model.
Real-time inference
Real-time, or interactive, inference is architecture where model inference can be triggered
at any time, and an immediate response is expected. This pattern can be used to analyze
streaming data, interactive application data, and more. This mode allows you to take
advantage of your machine learning model in real time and resolves the cold-start
problem outlined above in batch inference.
The following considerations and best practices are available if real-time inference is right
for your model:
•The challenges of real-time inference: Latency and performance requirements make
real-time inference architecture more complex for your model. A system might need to
respond in 100 milliseconds or less, during which it needs to retrieve the data, perform
inference, validate and store the model results, run any required business logic, and
return the results to the system or application.
Real-time inference
•Compute options for real-time inference: The best way to implement real-time
inference is to deploy the model in a container form to Docker or Azure Kubernetes
Service (AKS) cluster and expose it as a web service with a REST API. This way, the model
runs in its own isolated environment and can be managed like any other web service.
Docker and AKS capabilities can then be used for management, monitoring, scaling, and
more. The model can be deployed on-premises, in the cloud, or on the edge. The
preceding compute decision outlines real-time inference.
Real-time inference
•Multiregional deployment and high availability: Regional deployment and high
availability architectures need to be considered in real-time inference scenarios, as latency
and the model's performance will be critical to resolve. To reduce latency in multiregional
deployments, it's recommended to locate the model as close as possible to the
consumption point. The model and supporting infrastructure should follow the business'
high availability and DR principles and strategy.
Create a real-time
inference service
https://ceteongvanness.wordpress.com/2022/11/01/
create-a-real-time-inference-service/
References
Microsoft Docs

More Related Content

PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
PDF
Strata parallel m-ml-ops_sept_2017
PDF
Mtc strategy-briefing-houston-pd m-05212018-3
PDF
Train, predict, serve: How to go into production your machine learning model
PDF
Ml 3 ways
PDF
Time series deep learning
PDF
Fault tolerance on cloud computing
PDF
Productionizing Machine Learning with a Microservices Architecture
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Strata parallel m-ml-ops_sept_2017
Mtc strategy-briefing-houston-pd m-05212018-3
Train, predict, serve: How to go into production your machine learning model
Ml 3 ways
Time series deep learning
Fault tolerance on cloud computing
Productionizing Machine Learning with a Microservices Architecture

Similar to Deploying Models (20)

PDF
Auto-Train a Time-Series Forecast Model With AML + ADB
PDF
A survey on Machine Learning In Production (July 2018)
PPTX
Machine Learning with ML.NET and Azure - Andy Cross
PPTX
Azure Machine Learning Dotnet Campus 2015
PPTX
Overview on Azure Machine Learning
PDF
Ml ops on AWS
PDF
Building predictive models in Azure Machine Learning
PPTX
Introduction to Time Series Analytics with Microsoft Azure
PPTX
Real time streaming analytics
PDF
Real-Time Decisions Using ML on the Google Cloud Platform
PPTX
[DSC Europe 23] Miroslav Bicanic - Under Pressure: Applying ML in Real Time
PPTX
Automated Analytics at Scale
PDF
仕事ではじめる機械学習
PDF
Aplications for machine learning in IoT
PDF
Parallel machines flinkforward2017
PDF
Intro to machine learning for web folks @ BlendWebMix
PDF
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct...
PDF
Making Netflix Machine Learning Algorithms Reliable
PPTX
Real-Time Machine Learning with Pulsar Functions - Pulsar Summit NA 2021
PDF
Predictive Modelling of Network Capacity Demands: A Machine Learning Approach...
Auto-Train a Time-Series Forecast Model With AML + ADB
A survey on Machine Learning In Production (July 2018)
Machine Learning with ML.NET and Azure - Andy Cross
Azure Machine Learning Dotnet Campus 2015
Overview on Azure Machine Learning
Ml ops on AWS
Building predictive models in Azure Machine Learning
Introduction to Time Series Analytics with Microsoft Azure
Real time streaming analytics
Real-Time Decisions Using ML on the Google Cloud Platform
[DSC Europe 23] Miroslav Bicanic - Under Pressure: Applying ML in Real Time
Automated Analytics at Scale
仕事ではじめる機械学習
Aplications for machine learning in IoT
Parallel machines flinkforward2017
Intro to machine learning for web folks @ BlendWebMix
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct...
Making Netflix Machine Learning Algorithms Reliable
Real-Time Machine Learning with Pulsar Functions - Pulsar Summit NA 2021
Predictive Modelling of Network Capacity Demands: A Machine Learning Approach...
Ad

More from Eng Teong Cheah (20)

PDF
Modern Cross-Platform Apps with .NET MAUI
PDF
Efficiently Removing Duplicates from a Sorted Array
PDF
Monitoring Models
PDF
Responsible Machine Learning
PDF
Training Optimal Models
PDF
Machine Learning Workflows
PDF
Working with Compute
PDF
Working with Data
PDF
Experiments & TrainingModels
PDF
Automated Machine Learning
PDF
Getting Started with Azure Machine Learning
PDF
Hacking Containers - Container Storage
PDF
Hacking Containers - Looking at Cgroups
PDF
Hacking Containers - Linux Containers
PDF
Data Security - Storage Security
PDF
Application Security- App security
PDF
Application Security - Key Vault
PDF
Compute Security - Container Security
PDF
Compute Security - Host Security
PDF
Virtual Networking Security - Network Security
Modern Cross-Platform Apps with .NET MAUI
Efficiently Removing Duplicates from a Sorted Array
Monitoring Models
Responsible Machine Learning
Training Optimal Models
Machine Learning Workflows
Working with Compute
Working with Data
Experiments & TrainingModels
Automated Machine Learning
Getting Started with Azure Machine Learning
Hacking Containers - Container Storage
Hacking Containers - Looking at Cgroups
Hacking Containers - Linux Containers
Data Security - Storage Security
Application Security- App security
Application Security - Key Vault
Compute Security - Container Security
Compute Security - Host Security
Virtual Networking Security - Network Security
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Getting Started with Data Integration: FME Form 101
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
1. Introduction to Computer Programming.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Univ-Connecticut-ChatGPT-Presentaion.pdf
A comparative analysis of optical character recognition models for extracting...
Getting Started with Data Integration: FME Form 101
Machine learning based COVID-19 study performance prediction
Group 1 Presentation -Planning and Decision Making .pptx
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative study of natural language inference in Swahili using monolingua...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
OMC Textile Division Presentation 2021.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Unlocking AI with Model Context Protocol (MCP)
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine Learning_overview_presentation.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Network Security Unit 5.pdf for BCA BBA.

Deploying Models

  • 3. Inferencing? In machine learning, inferencing refers to the use of a trained model to predict labels for new data on which the model has not been trained. Often, the model is deployed as part of a service that enables applications to request immediate, or real-time, predictions for individual, or small numbers of data observations. In Azure Machine Learning, you can create real-time inferencing solutions by deploying a model as a service, hosted in a containerized platform, such as Azure Kubernetes Services (AKS).
  • 4. Deploying a Real-Time Inferencing Service
  • 5. Machine learning inference during deployment When deploying your AI model during production, you need to consider how it will make predictions. The two main processes for AI models are: •Batch inference: An asynchronous process that bases its predictions on a batch of observations. The predictions are stored as files or in a database for end users or business applications. •Real-time (or interactive) inference: Frees the model to make predictions at any time and trigger an immediate response. This pattern can be used to analyze streaming and interactive application data.
  • 6. Machine learning inference during deployment Consider the following questions to evaluate your model, compare the two processes, and select the one that suits your model: •How often should predictions be generated? •How soon are the results needed? •Should predictions be generated individually, in small batches, or in large batches? •Is latency to be expected from the model? •How much compute power is needed to execute the model? •Are there operational implications and costs to maintain the model?
  • 7. Batch inference Batch inference, sometimes called offline inference, is a simpler inference process that helps models to run in timed intervals and business applications to store predictions. Consider the following best practices for batch inference: •Trigger batch scoring: Use Azure Machine Learning pipelines and the ParallelRunStep feature in Azure Machine Learning to set up a schedule or event- based automation. •Compute options for batch inference: Since batch inference processes don't run continuously, it's recommended to automatically start, stop, and scale reusable clusters that can handle a range of workloads. Different models require different environments, and your solution needs to be able to deploy a specific environment and remove it when inference is over for the compute to be available for the next model.
  • 8. Real-time inference Real-time, or interactive, inference is architecture where model inference can be triggered at any time, and an immediate response is expected. This pattern can be used to analyze streaming data, interactive application data, and more. This mode allows you to take advantage of your machine learning model in real time and resolves the cold-start problem outlined above in batch inference. The following considerations and best practices are available if real-time inference is right for your model: •The challenges of real-time inference: Latency and performance requirements make real-time inference architecture more complex for your model. A system might need to respond in 100 milliseconds or less, during which it needs to retrieve the data, perform inference, validate and store the model results, run any required business logic, and return the results to the system or application.
  • 9. Real-time inference •Compute options for real-time inference: The best way to implement real-time inference is to deploy the model in a container form to Docker or Azure Kubernetes Service (AKS) cluster and expose it as a web service with a REST API. This way, the model runs in its own isolated environment and can be managed like any other web service. Docker and AKS capabilities can then be used for management, monitoring, scaling, and more. The model can be deployed on-premises, in the cloud, or on the edge. The preceding compute decision outlines real-time inference.
  • 10. Real-time inference •Multiregional deployment and high availability: Regional deployment and high availability architectures need to be considered in real-time inference scenarios, as latency and the model's performance will be critical to resolve. To reduce latency in multiregional deployments, it's recommended to locate the model as close as possible to the consumption point. The model and supporting infrastructure should follow the business' high availability and DR principles and strategy.
  • 11. Create a real-time inference service https://ceteongvanness.wordpress.com/2022/11/01/ create-a-real-time-inference-service/