SlideShare a Scribd company logo
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning
using Kubeflow
Arun Gupta, @arungupta
Principal Open Source Technologist
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://dilbert.com/strip/2013-02-02
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning 101
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Broadest and deepest set of capabilities
T H E AW S M L S TAC K
Broadest and deepest set of capabilities
FRAMEWORKS INTERFACES INFRASTRUCTURE
ML Frameworks + Infrastructure
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Broadest and deepest set of capabilities
T H E AW S M L S TAC K
Amazon EKS
Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin
Containerized ML
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Broadest and deepest set of capabilities
T H E AW S M L S TAC K
ML Services
Amazon SageMaker
Ground Truth
data labelling
ML
Marketplace
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks
SageMaker
Experiments
Model
tuning
SageMaker
Autopilot
Model
hosting
SageMaker
Model Monitor
SageMakerStudioIDE
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AI Services Broadest and deepest set of capabilities
T H E AW S M L S TAC K
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Amazon
Rekognition
+Custom Labels
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
Amazon
Textract
Amazon
Kendra
Amazon
Connect
with Contact Lens
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AI Services Broadest and deepest set of capabilities
T H E AW S M L S TAC K
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Amazon
Rekognition
+Custom Labels
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
Amazon
Textract
Amazon
Kendra
Amazon
Connect
with Contact Lens
FRAMEWORKS INTERFACES INFRASTRUCTURE
ML Frameworks + Infrastructure
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
ML Services
Amazon SageMaker
Ground Truth
data labelling
ML
Marketplace
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks
SageMaker
Experiments
Model
tuning
SageMaker
Autopilot
Model
hosting
SageMaker
Model Monitor
SageMakerStudioIDE
Amazon EKS
Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin
Containerized ML
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
M A C H I N E L E A R N I N G
S T O R A G E
Amazon Redshift
+ Redshift Spectrum
Amazon
QuickSight
Amazon EMR
Hadoop, Spark, Presto,
Pig, Hive…19 total
Amazon
Athena
Amazon
Kinesis
Amazon
Elasticsearch
Service
AWS Glue
A N A L Y T I C S
Amazon S3
Standard-IA
Amazon S3
Standard
Amazon S3
One Zone-IA
Amazon
Glacier
Amazon S3
Intelligent-
Tiering
N E W
Amazon
EBS
Amazon S3
Glacier Deep
Archive
N E W
Storage and Analytics for Machine Learning
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Machine Learning on Kubernetes?
Composability Portability Scalability
O N - P R E M I S E S C L O U D
http://www.shutterstock.com/gallery-635827p1.html
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on K8s: Without KubeFlow
@aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on K8s: With KubeFlow
@aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Kubeflow
Containerized machine learning platform
Makes it easy to develop, deploy, and manage portable,
scalable end-to-end ML workflows on k8s
“Toolkit” – loosely coupled tools and blueprints for ML
End to End ML workflow – ML code is only a small component
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What’s in
KubeFlow?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EKS: run Kubernetes in cloud
Managed Kubernetes control plane, attach data plane
Native upstream Kubernetes experience
Platform for enterprises to run production-grade workloads
Integrates with additional AWS services
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting started with Amazon EKS
eksctl CLI—create Amazon EKS clusters (eksctl.io)
Creates all resources needed for the cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EKS-Optimized GPU AMI
Built on top of the standard Amazon EKS-Optimized AMI
Includes packages to support Amazon P2/P3/G3/G4
instances
• NVIDIA drivers
• nvidia-docker2 package
• nvidia-container-runtime (as default runtime)
GPU Clock Optimization
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cluster Autoscaler Improvements
Add GPU support
autoscaler#1584 GPU autoscaling supported for AWS
autoscaler#1589 GPU scale down performance optimization
Prevent CA from removing a node with ML training job
running
Annotate job ”cluster-autoscaler.kubernetes.io/safe-to-evict”: “false”
Recommended to create GPU node group per AZ
Improve network communication performance
Prevent ASG rebalancing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Deep Learning Containers
KEY FEATURES
Customizable
container images
Support for TensorFlow,
Apache MXNet
Single and multi-node
training and inference
Pre-packaged Docker
container images
fully configured
and validated
Best performance
and scalability
without tuning
Works with Amazon EKS,
Amazon ECS,
and Amazon EC2
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow on Desktop
MiniKF: Local Kubeflow deployment using VirtualBox and
Vagrant
• Minikube -> Kubernetes
• MiniKF -> Kubeflow (includes minikube)
Runs on macOS, Linux, and Windows
Does not require k8s-specific knowledge
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow on Cloud
Major cloud providers supported
Choices on Amazon Web Services
• Self-managed k8s on EC2: Kops, CloudFormation, Terraform
• Amazon EKS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started with Kubeflow
on Amazon EKS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jupyter Notebook
Create and share documents that contain live code,
equations, visualizations, and narrative text
• UI to manage notebooks
• Integrate with RBAC/IAM
• Ingress / Service Mesh
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jupyter Notebook
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fairing
Python SDK to build, train and deploy ML models
• Easily package ML
training jobs
• Train ML models from
notebook to k8s
• Streamline the model
development process
Setup KubeflowFairing for training and prediction
https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/02_Fairing/02_06_fairing_e2e.ipynb
Train an XGBoost model remotely on Kubeflow
Deploy the trained model to Kubeflowfor prediction
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Katib – Hyperparameter Tuning
Hyperparameter are parameters external to the model to control the
training, e.g. learning rate, batch size, epochs
Tuning finds a set of hyperparameters that optimizes an objective
function, e.g. Find the optimal batch size and learning rate to maximize
prediction accuracy
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hyperparameter Tuning is Hard
More hyperparameters -> exponential space growth
Tuning by hands is inefficient and error-prone
Need to track metrics across multiple jobs
Managing resources and infrastructure for lot of jobs is hard
Variety of frameworks and algorithms to support
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Katib – Hyperparameter Tuning
trialName Validation-accuracy accuracy --lr --num-layers --optimizer
random-experiment-rfwwbnsd 0.974920 0.984844 0.013831565266960293 4 sgd
random-experiment-vxgwlgqq 0.113854 0.116646 0.024225789898529138 4 ftrl
random-experiment-wclrwlcq 0.979697 0.998437 0.021916171239020756 4 sgd
random-experiment-7lsc4pwb 0.113854 0.115312 0.024163810384272653 5 ftrl
random-experiment-86vv9vgv 0.963475 0.971562 0.02943228249244735 3 adam
random-experiment-jh884cxz 0.981091 0.999219 0.022372025623908262 2 sgd
random-experiment-sgtwhrgz 0.980693 0.997969 0.016641686851083654 4 sgd
random-experiment-c6vvz6dv 0.980792 0.998906 0.0264125850165842 3 sgd
random-experiment-vqs2xmfj 0.113854 0.105313 0.026629394628228185 4 ftrl
random-experiment-bv8lsh2m 0.980195 0.999375 0.021769570793012488 2 sgd
random-experiment-7vbnqc7z 0.113854 0.102188 0.025079750575740783 4 ftrl
random-experiment-kwj9drmg 0.979498 0.995469 0.014985919312945063 4 sgd
Hyperparameters
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Trial template
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KFServing: Model serving and management
Provides a Kubernetes CRD for serving ML models on arbitrary frameworks.
Encapsulates the complexity of autoscaling, networking and server configuration to bring features
like scale to zero, transformations, and canary rollouts to your deployments
Enables a simple, pluggable, and complete story for your production ML inference server by providing
prediction, pre-processing, post-processing and explainability.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KFServing Custom Resource
S3 secret
attached to
Service
Account
Trained
model
https://github.com/kubeflow/kfserving/blob/master/docs/samples/s3/tensorflow_s3.yaml
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pluggable Interface
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
default:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris"
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
default:
tensorflow:
storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "KFService"
metadata:
name: "pytorch-cifar10"
spec:
default:
pytorch:
storageUri: "gs://kfserving-samples/models/pytorch/cifar10"
modelClassName: "Net"
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KFServing Interface – Scikit Learn
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "KFService"
metadata:
name: "sklearn-iris"
spec:
default:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris"
serviceAccount: inferencing-robot
minReplicas: 3
maxReplicas: 10
resources:
requests:
cpu: 2
gpu: 1
memory: 10Gi
canaryTrafficPercent: 25
canary:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris-v2"
serviceAccount: inferencing-robot
minReplicas: 3
maxReplicas: 10
resources:
requests:
cpu: 2
gpu: 1
memory: 10Gi
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Distributed Training
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices for Optimizing Distributed Deep
Learning Performance on Amazon EKS
https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pipelines – Machine Learning Job Orchestrator
Compose, deploy, and manage
end-to-end ML workflows
End-to-end orchestration
Easy, rapid, and reliable
experimentation
Easy re-use
Built using Pipelines SDK
kfp.compiler,
kfp.components, kfp.Client
Uses Argo under the hood to
orchestrate resources
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Creating Kubeflow Pipeline Components
@dsl.pipeline(
name='Sample Trainer',
description=’’
)
def sample_train_pipeline(... ):
create_cluster_op = CreateClusterOp('create-cluster', ...)
analyze_op = AnalyzeOp('analyze', ...)
transform_op = TransformOp('transform', ...)
train_op = TrainerOp('train', ...)
predict_op = PredictOp('predict', ...)
confusion_matrix_op = ConfusionMatrixOp('confusion-matrix', ...)
roc_op = RocOp('roc', ...)
kfp.compiler.Compiler().compile(sample_train_pipeline , 'my-pipeline.zip’)
Pipeline component
Pipeline decorator
Pipeline function
Compile pipeline
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Creating Kubeflow Pipeline Components
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata – Model Tracking
• Metadata schema to track artifacts related to
execution contexts
• Metadata API for storing and retrieving
metadata
• Client libraries for end-users to interact with
the Metadata service from their Notebooks or
Pipelines code.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Making Kubeflow a first class citizen on AWS
• Centralized and unified Kubernetes cluster logs in Amazon CloudWatch
• External traffic and authentication management with ALB Ingress Controller
• TLS and authentication with AWS Certificate Manager and AWS Cognito
• In-built FSx CSI driver w/S3 data repository integration to optimize training performance
• Elastic File System integration for common data sharing in JupyterHub
• Easier and customizable Kubeflow installation with kfctl and Kustomize support
• Kubeflow Pipeline integration with AWS Services – Amazon EMR, Athena, SageMaker
• Add ECR integration to Kubeflow Fairing
• Jupyter Notebook images with AWS CLI installed and ECR support
• Auto detect GPU worker nodes and install NVIDIA device plugin
https://www.kubeflow.org/docs/aws/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Kubeflow Roadmap
Kubeflow v1.0 - Theme: Enterprise Readiness
E2E examples and increased docs on Kubeflow site
Upstream testing for Kubeflow on AWS
Support DIY K8S on AWS
IAM Roles for Service Accounts integration with Jupyter
notebooks
Support for managed contributors
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Feature store - Feast
• Discoverability and reuse of features
• Standardization of features
• Access to features for training and serving
• Consistency between training and serving
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fully managed
infrastructure in Amazon SageMake
Introducing Amazon SageMaker Operators for Kubernetes
Kubernetes customers can now train, tune, & deploy models in
Amazon SageMaker
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Under the hood – Amazon SageMaker and
Kubernetes
Kubectl apply
YAML
Key Features
• Amazon SageMaker
Operators for training,
tuning, inference
• Natively interact with
Amazon SageMaker jobs
using K8s tools (e.g., get
pods, describe)
• Stream and view logs from
Amazon SageMaker in K8s
• Helm Charts to assist with
setup and spec creation
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://github.com/aws/amazon-
sagemaker-operator-for-k8s
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
References
Workshop: eksworkshop.com/advanced/420_kubeflow/
Jupyter notebooks: github.com/aws-samples/eks-kubeflow-
workshop/
Optimizing Machine Learning performance:
aws.amazon.com/blogs/opensource/optimizing-distributed-
deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Please fill up the session rating in the feedback form
and collect your goodie at the end of the day
THANK YOU

More Related Content

PDF
Building Cloud Native Applications
PPTX
The Serverless Tidal Wave - SwampUP 2018 Keynote
PDF
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
PDF
눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018
PPTX
Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019
PDF
Amazon SageMaker Build, Train and Deploy Your ML Models
PDF
自己紹介とStoryblok紹介(5分 ver)
PDF
Amazon AI/ML Overview
Building Cloud Native Applications
The Serverless Tidal Wave - SwampUP 2018 Keynote
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018
Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019
Amazon SageMaker Build, Train and Deploy Your ML Models
自己紹介とStoryblok紹介(5分 ver)
Amazon AI/ML Overview

Similar to Machine Learning using Kubernetes - AI Conclave 2019 (20)

PDF
Machine Learning using Kubeflow and Kubernetes
PDF
Machine learning using Kubernetes
PDF
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
PDF
Intro - End to end ML with Kubeflow @ SignalConf 2018
PPTX
Build, train and deploy ML models with SageMaker (October 2019)
PDF
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
PPTX
An Introduction to Amazon SageMaker (October 2018)
PPTX
Quickly and easily build, train, and deploy machine learning models at any scale
PDF
Machine Learning with Amazon SageMaker
PDF
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
PDF
[AWS Innovate 온라인 컨퍼런스] Kubernetes와 SageMaker를 활용하여 Machine Learning 워크로드 관리하...
PPTX
Amazon SageMaker (December 2018)
PDF
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
PPTX
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
PDF
Amazon SageMaker workshop
PDF
Mcl345 re invent_sagemaker_dmbanga
PDF
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
PPTX
AWS re:Invent 2018 - Machine Learning recap (December 2018)
PDF
[AWS Tech Talk] Using containers for deep learning workflows
PDF
Data Summer Conf 2018, “Build, train, and deploy machine learning models at s...
Machine Learning using Kubeflow and Kubernetes
Machine learning using Kubernetes
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Intro - End to end ML with Kubeflow @ SignalConf 2018
Build, train and deploy ML models with SageMaker (October 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
An Introduction to Amazon SageMaker (October 2018)
Quickly and easily build, train, and deploy machine learning models at any scale
Machine Learning with Amazon SageMaker
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
[AWS Innovate 온라인 컨퍼런스] Kubernetes와 SageMaker를 활용하여 Machine Learning 워크로드 관리하...
Amazon SageMaker (December 2018)
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
Amazon SageMaker workshop
Mcl345 re invent_sagemaker_dmbanga
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
AWS re:Invent 2018 - Machine Learning recap (December 2018)
[AWS Tech Talk] Using containers for deep learning workflows
Data Summer Conf 2018, “Build, train, and deploy machine learning models at s...
Ad

More from Arun Gupta (20)

PDF
5 Skills To Force Multiply Technical Talents.pdf
PPTX
Secure and Fast microVM for Serverless Computing using Firecracker
PPTX
Building Java in the Open - j.Day at OSCON 2019
PPTX
Why Amazon Cares about Open Source
PDF
Chaos Engineering with Kubernetes
PDF
How to be a mentor to bring more girls to STEAM
PDF
Java in a World of Containers - DockerCon 2018
PDF
Introduction to Amazon EKS - KubeCon 2018
PDF
Mastering Kubernetes on AWS - Tel Aviv Summit
PDF
Top 10 Technology Trends Changing Developer's Landscape
PDF
Container Landscape in 2017
PDF
Java EE and NoSQL using JBoss EAP 7 and OpenShift
PDF
Docker, Kubernetes, and Mesos recipes for Java developers
PDF
Thanks Managers!
PDF
Migrate your traditional VM-based Clusters to Containers
PDF
NoSQL - Vital Open Source Ingredient for Modern Success
PDF
Package your Java EE Application using Docker and Kubernetes
PDF
Nuts and Bolts of WebSocket Devoxx 2014
PDF
How to run your first marathon ? JavaOne 2014 Ignite
PDF
Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014
5 Skills To Force Multiply Technical Talents.pdf
Secure and Fast microVM for Serverless Computing using Firecracker
Building Java in the Open - j.Day at OSCON 2019
Why Amazon Cares about Open Source
Chaos Engineering with Kubernetes
How to be a mentor to bring more girls to STEAM
Java in a World of Containers - DockerCon 2018
Introduction to Amazon EKS - KubeCon 2018
Mastering Kubernetes on AWS - Tel Aviv Summit
Top 10 Technology Trends Changing Developer's Landscape
Container Landscape in 2017
Java EE and NoSQL using JBoss EAP 7 and OpenShift
Docker, Kubernetes, and Mesos recipes for Java developers
Thanks Managers!
Migrate your traditional VM-based Clusters to Containers
NoSQL - Vital Open Source Ingredient for Modern Success
Package your Java EE Application using Docker and Kubernetes
Nuts and Bolts of WebSocket Devoxx 2014
How to run your first marathon ? JavaOne 2014 Ignite
Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014
Ad

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm

Machine Learning using Kubernetes - AI Conclave 2019

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning using Kubeflow Arun Gupta, @arungupta Principal Open Source Technologist
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://dilbert.com/strip/2013-02-02
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning 101
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Broadest and deepest set of capabilities T H E AW S M L S TAC K Broadest and deepest set of capabilities FRAMEWORKS INTERFACES INFRASTRUCTURE ML Frameworks + Infrastructure Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Inferentia FPGA
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Broadest and deepest set of capabilities T H E AW S M L S TAC K Amazon EKS Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin Containerized ML
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Broadest and deepest set of capabilities T H E AW S M L S TAC K ML Services Amazon SageMaker Ground Truth data labelling ML Marketplace SageMaker Neo Built-in algorithms SageMaker Notebooks SageMaker Experiments Model tuning SageMaker Autopilot Model hosting SageMaker Model Monitor SageMakerStudioIDE
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AI Services Broadest and deepest set of capabilities T H E AW S M L S TAC K VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS Amazon Rekognition +Custom Labels Amazon Polly Amazon Transcribe +Medical Amazon Comprehend +Medical Amazon Translate Amazon Lex Amazon Personalize Amazon Forecast Amazon Fraud Detector Amazon CodeGuru Amazon Textract Amazon Kendra Amazon Connect with Contact Lens
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AI Services Broadest and deepest set of capabilities T H E AW S M L S TAC K VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS Amazon Rekognition +Custom Labels Amazon Polly Amazon Transcribe +Medical Amazon Comprehend +Medical Amazon Translate Amazon Lex Amazon Personalize Amazon Forecast Amazon Fraud Detector Amazon CodeGuru Amazon Textract Amazon Kendra Amazon Connect with Contact Lens FRAMEWORKS INTERFACES INFRASTRUCTURE ML Frameworks + Infrastructure Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Inferentia FPGA ML Services Amazon SageMaker Ground Truth data labelling ML Marketplace SageMaker Neo Built-in algorithms SageMaker Notebooks SageMaker Experiments Model tuning SageMaker Autopilot Model hosting SageMaker Model Monitor SageMakerStudioIDE Amazon EKS Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin Containerized ML
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. M A C H I N E L E A R N I N G S T O R A G E Amazon Redshift + Redshift Spectrum Amazon QuickSight Amazon EMR Hadoop, Spark, Presto, Pig, Hive…19 total Amazon Athena Amazon Kinesis Amazon Elasticsearch Service AWS Glue A N A L Y T I C S Amazon S3 Standard-IA Amazon S3 Standard Amazon S3 One Zone-IA Amazon Glacier Amazon S3 Intelligent- Tiering N E W Amazon EBS Amazon S3 Glacier Deep Archive N E W Storage and Analytics for Machine Learning
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Machine Learning on Kubernetes? Composability Portability Scalability O N - P R E M I S E S C L O U D http://www.shutterstock.com/gallery-635827p1.html
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on K8s: Without KubeFlow @aronchik
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on K8s: With KubeFlow @aronchik
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Kubeflow Containerized machine learning platform Makes it easy to develop, deploy, and manage portable, scalable end-to-end ML workflows on k8s “Toolkit” – loosely coupled tools and blueprints for ML End to End ML workflow – ML code is only a small component https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s in KubeFlow?
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EKS: run Kubernetes in cloud Managed Kubernetes control plane, attach data plane Native upstream Kubernetes experience Platform for enterprises to run production-grade workloads Integrates with additional AWS services
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started with Amazon EKS eksctl CLI—create Amazon EKS clusters (eksctl.io) Creates all resources needed for the cluster
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EKS-Optimized GPU AMI Built on top of the standard Amazon EKS-Optimized AMI Includes packages to support Amazon P2/P3/G3/G4 instances • NVIDIA drivers • nvidia-docker2 package • nvidia-container-runtime (as default runtime) GPU Clock Optimization
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cluster Autoscaler Improvements Add GPU support autoscaler#1584 GPU autoscaling supported for AWS autoscaler#1589 GPU scale down performance optimization Prevent CA from removing a node with ML training job running Annotate job ”cluster-autoscaler.kubernetes.io/safe-to-evict”: “false” Recommended to create GPU node group per AZ Improve network communication performance Prevent ASG rebalancing
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Deep Learning Containers KEY FEATURES Customizable container images Support for TensorFlow, Apache MXNet Single and multi-node training and inference Pre-packaged Docker container images fully configured and validated Best performance and scalability without tuning Works with Amazon EKS, Amazon ECS, and Amazon EC2
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow on Desktop MiniKF: Local Kubeflow deployment using VirtualBox and Vagrant • Minikube -> Kubernetes • MiniKF -> Kubeflow (includes minikube) Runs on macOS, Linux, and Windows Does not require k8s-specific knowledge
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow on Cloud Major cloud providers supported Choices on Amazon Web Services • Self-managed k8s on EC2: Kops, CloudFormation, Terraform • Amazon EKS
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started with Kubeflow on Amazon EKS
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook Create and share documents that contain live code, equations, visualizations, and narrative text • UI to manage notebooks • Integrate with RBAC/IAM • Ingress / Service Mesh
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fairing Python SDK to build, train and deploy ML models • Easily package ML training jobs • Train ML models from notebook to k8s • Streamline the model development process Setup KubeflowFairing for training and prediction https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/02_Fairing/02_06_fairing_e2e.ipynb Train an XGBoost model remotely on Kubeflow Deploy the trained model to Kubeflowfor prediction
  • 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Katib – Hyperparameter Tuning Hyperparameter are parameters external to the model to control the training, e.g. learning rate, batch size, epochs Tuning finds a set of hyperparameters that optimizes an objective function, e.g. Find the optimal batch size and learning rate to maximize prediction accuracy
  • 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hyperparameter Tuning is Hard More hyperparameters -> exponential space growth Tuning by hands is inefficient and error-prone Need to track metrics across multiple jobs Managing resources and infrastructure for lot of jobs is hard Variety of frameworks and algorithms to support
  • 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Katib – Hyperparameter Tuning trialName Validation-accuracy accuracy --lr --num-layers --optimizer random-experiment-rfwwbnsd 0.974920 0.984844 0.013831565266960293 4 sgd random-experiment-vxgwlgqq 0.113854 0.116646 0.024225789898529138 4 ftrl random-experiment-wclrwlcq 0.979697 0.998437 0.021916171239020756 4 sgd random-experiment-7lsc4pwb 0.113854 0.115312 0.024163810384272653 5 ftrl random-experiment-86vv9vgv 0.963475 0.971562 0.02943228249244735 3 adam random-experiment-jh884cxz 0.981091 0.999219 0.022372025623908262 2 sgd random-experiment-sgtwhrgz 0.980693 0.997969 0.016641686851083654 4 sgd random-experiment-c6vvz6dv 0.980792 0.998906 0.0264125850165842 3 sgd random-experiment-vqs2xmfj 0.113854 0.105313 0.026629394628228185 4 ftrl random-experiment-bv8lsh2m 0.980195 0.999375 0.021769570793012488 2 sgd random-experiment-7vbnqc7z 0.113854 0.102188 0.025079750575740783 4 ftrl random-experiment-kwj9drmg 0.979498 0.995469 0.014985919312945063 4 sgd Hyperparameters
  • 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Trial template
  • 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing: Model serving and management Provides a Kubernetes CRD for serving ML models on arbitrary frameworks. Encapsulates the complexity of autoscaling, networking and server configuration to bring features like scale to zero, transformations, and canary rollouts to your deployments Enables a simple, pluggable, and complete story for your production ML inference server by providing prediction, pre-processing, post-processing and explainability.
  • 32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing Custom Resource S3 secret attached to Service Account Trained model https://github.com/kubeflow/kfserving/blob/master/docs/samples/s3/tensorflow_s3.yaml
  • 33. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pluggable Interface apiVersion: "serving.kubeflow.org/v1alpha1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: default: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris" apiVersion: "serving.kubeflow.org/v1alpha1" kind: "InferenceService" metadata: name: "flowers-sample" spec: default: tensorflow: storageUri: "gs://kfserving-samples/models/tensorflow/flowers" apiVersion: "serving.kubeflow.org/v1alpha1" kind: "KFService" metadata: name: "pytorch-cifar10" spec: default: pytorch: storageUri: "gs://kfserving-samples/models/pytorch/cifar10" modelClassName: "Net"
  • 34. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing Interface – Scikit Learn apiVersion: "serving.kubeflow.org/v1alpha1" kind: "KFService" metadata: name: "sklearn-iris" spec: default: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris" serviceAccount: inferencing-robot minReplicas: 3 maxReplicas: 10 resources: requests: cpu: 2 gpu: 1 memory: 10Gi canaryTrafficPercent: 25 canary: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris-v2" serviceAccount: inferencing-robot minReplicas: 3 maxReplicas: 10 resources: requests: cpu: 2 gpu: 1 memory: 10Gi
  • 35. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Distributed Training
  • 36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Optimizing Distributed Deep Learning Performance on Amazon EKS https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
  • 37. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pipelines – Machine Learning Job Orchestrator Compose, deploy, and manage end-to-end ML workflows End-to-end orchestration Easy, rapid, and reliable experimentation Easy re-use Built using Pipelines SDK kfp.compiler, kfp.components, kfp.Client Uses Argo under the hood to orchestrate resources
  • 38. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Creating Kubeflow Pipeline Components @dsl.pipeline( name='Sample Trainer', description=’’ ) def sample_train_pipeline(... ): create_cluster_op = CreateClusterOp('create-cluster', ...) analyze_op = AnalyzeOp('analyze', ...) transform_op = TransformOp('transform', ...) train_op = TrainerOp('train', ...) predict_op = PredictOp('predict', ...) confusion_matrix_op = ConfusionMatrixOp('confusion-matrix', ...) roc_op = RocOp('roc', ...) kfp.compiler.Compiler().compile(sample_train_pipeline , 'my-pipeline.zip’) Pipeline component Pipeline decorator Pipeline function Compile pipeline
  • 39. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Creating Kubeflow Pipeline Components
  • 40. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata – Model Tracking • Metadata schema to track artifacts related to execution contexts • Metadata API for storing and retrieving metadata • Client libraries for end-users to interact with the Metadata service from their Notebooks or Pipelines code.
  • 41. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Making Kubeflow a first class citizen on AWS • Centralized and unified Kubernetes cluster logs in Amazon CloudWatch • External traffic and authentication management with ALB Ingress Controller • TLS and authentication with AWS Certificate Manager and AWS Cognito • In-built FSx CSI driver w/S3 data repository integration to optimize training performance • Elastic File System integration for common data sharing in JupyterHub • Easier and customizable Kubeflow installation with kfctl and Kustomize support • Kubeflow Pipeline integration with AWS Services – Amazon EMR, Athena, SageMaker • Add ECR integration to Kubeflow Fairing • Jupyter Notebook images with AWS CLI installed and ECR support • Auto detect GPU worker nodes and install NVIDIA device plugin https://www.kubeflow.org/docs/aws/
  • 42. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 43. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Kubeflow Roadmap Kubeflow v1.0 - Theme: Enterprise Readiness E2E examples and increased docs on Kubeflow site Upstream testing for Kubeflow on AWS Support DIY K8S on AWS IAM Roles for Service Accounts integration with Jupyter notebooks Support for managed contributors
  • 44. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Feature store - Feast • Discoverability and reuse of features • Standardization of features • Access to features for training and serving • Consistency between training and serving
  • 45. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fully managed infrastructure in Amazon SageMake Introducing Amazon SageMaker Operators for Kubernetes Kubernetes customers can now train, tune, & deploy models in Amazon SageMaker
  • 46. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Under the hood – Amazon SageMaker and Kubernetes Kubectl apply YAML Key Features • Amazon SageMaker Operators for training, tuning, inference • Natively interact with Amazon SageMaker jobs using K8s tools (e.g., get pods, describe) • Stream and view logs from Amazon SageMaker in K8s • Helm Charts to assist with setup and spec creation
  • 47. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://github.com/aws/amazon- sagemaker-operator-for-k8s
  • 48. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References Workshop: eksworkshop.com/advanced/420_kubeflow/ Jupyter notebooks: github.com/aws-samples/eks-kubeflow- workshop/ Optimizing Machine Learning performance: aws.amazon.com/blogs/opensource/optimizing-distributed- deep-learning-performance-amazon-eks/
  • 49. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Please fill up the session rating in the feedback form and collect your goodie at the end of the day THANK YOU

Editor's Notes

  • #2: Kubernetes provides isolation, auto-scaling, load balancing, flexibility and GPU support. These features are critical to run computationally, data intensive and hard to parallelize machine learning models. Declarative syntax of Kubernetes deployment descriptors make it easy for non-operationally focused engineers to easily train machine learning models on Kubernetes. This talk will explain why and how Amazon EKS, our managed service is the right Kubernetes platform for building your Machine Learning solutions.
  • #3: When we learn how to ride a bike, we’re using the data provided by a friend, or a sibling, or a parent to train our mind to ride the bike. Machine Learning is applying those concepts, but machine-to-machine.
  • #4: Machine Learning can be realized in a variety of ways. Let’s see how it looks in a DIY fashion. There is a training phase. You start with a training data that will be used to create a model. The blue cloud in the middle is your code that reads the training data and creates a model. Once the model is generated, then test data is fed to the model to find out accuracy of the model. Algorithm chosen in your application also defines how long it takes to generate the model. You keep repeating this cycle until a model with reasonable accuracy is obtained. After training is done, there is an inference phase. In this phase, input data, typically real-world data, is fed to the the generated model and predictions are made. The ultimate goal of ML is to ensure how good are the predictions? So if your ML model is to identify a hand-written number and if it is presented a hand-written number, can it be accurately identified? If not, then you go back to training and then infer again.
  • #5: Within AWS we see the stack as having three layers:   The bottom layer of the stack is for expert machine learning practitioners who work at the framework level and are comfortable building, training, tuning, and deploying machine learning models. This is the foundation for all of the innovation we drive at every other layer of the stack. There are GPU instances like P3 and P3dn where the vast majority of deep learning and machine learning is done in the cloud. All the common frameworks such as TensorFlow, PyTorch, Caffe2, and Apache MXNet are supported. We will always make sure that all the frameworks you care about are supported equally well, so you have the right tool for the right job.
  • #6: While we’re seeing a lot of activity at that bottom layer (infrastructure and frameworks), the reality is that there just aren't that many expert machine learning practitioners in the world. That’s why we built and launched Amazon SageMaker, a managed ML service in the middle tier, which makes it much easier for every day developers and data scientists to get up and running with machine learning…
  • #7: Moving on, the top level of the stack is what people often call artificial intelligence (AI), because it closely mimics human cognition. And our services here are for customers that don’t want to deal with models and training. Customers can easily build these capabilities into new and existing applications to reduce costs, increase speed, and improve customer satisfaction and insight. We offer multiple pre-trained AI services covering vision, speech, language, chatbots, forecasting and recommendations. The key here is that developers with no prior machine learning experience can easily build sophisticated AI driven applications, like an AI driven contact center or live media subtitling.
  • #8: Moving on, the top level of the stack is what people often call artificial intelligence (AI), because it closely mimics human cognition. And our services here are for customers that don’t want to deal with models and training. Customers can easily build these capabilities into new and existing applications to reduce costs, increase speed, and improve customer satisfaction and insight. We offer multiple pre-trained AI services covering vision, speech, language, chatbots, forecasting and recommendations. The key here is that developers with no prior machine learning experience can easily build sophisticated AI driven applications, like an AI driven contact center or live media subtitling.
  • #9: To summarize, the AWS AI and ML stack has three layers. Each layer addressing different audiences: ML Frameworks & Infrastructure: For expert machine learning practitioners who work at the framework level. ML Services: For every day developers and data scientists we built and launched Amazon SageMaker. AI Services: Developers with no prior machine learning experience can easily build sophisticated AI driven applications
  • #10: Deep storage and analytics capabilities are needed for a comprehensive ML solution. Storage systems should be able to support high throughput and low latency, with best security around that data. You need a deep collection of real-time analytics. AWS offers all of that. We’ll cover one part of this later in this talk as well.
  • #11: Why is Kubernetes well suited for Machine Learning? There are three reasons: Composability, Portability, Scalability ML is about data ingestion, data analysis, data transformation, data validation, building a model, model validation, training at scale, inference and much more. Each of these phase ends up being a microservice. And turns out Kubernetes provides a great platform for composing these microservices together. It allows multiple Data Scientists to chose a solution that works for them. It also enables separation of duties between Ops and Data scientists. Using containers and Kubernetes as a base layer allows you to use open source frameworks (ex: kubeflow), and use it to train & develop models on k8s. This allows you to easily migrate your solution from from laptop, to on-prem to the cloud. We’ll talk about kubeflow later in this preso. Kubernetes allows you to scale the applications. It not only provides support for more nodes, but more GPUs and we’ll talk about performance optimizations on Amazon EKS for near linear scalability later in this talk. More disk/network, low latency filesystems which again we’ll talk later. Also, you need to run the experiments multiple times by tweaking parameters a little bit every time. So you need to be able to scale your infrastructure to support these needs. AWS cloud meet your needs well for that.
  • #12: Lets see how we can leverage these containers on K8s. As a Data Scientist, you just want to do Data Science, run your ML models on EKS. But you need to become an expert in containers, packaging, persistent volumes, scaling, GPUs, drivers, DevOps and much more. Ever Data Scientist has a slightly different view on what the different tools are for modeling, UX, framework, storage and and multiple other items that are needed for ML.
  • #13: KubeFlow makes it easy for everyone to develop, deploy and manage portable, distributed Machine Learning on k8s. Anywhere you are running K8s, you should be able to run kubeflow. Once Amazon EKS cluster is up and running,
  • #14: Kubeflow was introduced about 2 years ago at KubeCon. It provides a containerized machine learning platform. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes: simple, portable and scalable. Kubeflow has evolved to become a toolkit of loosely couple toolkits for machine learning. Their goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.  Lets see how we can leverage these containers on K8s. As a Data Scientist, you just want to do Data Science, run your ML models on EKS. But you need to become an expert in containers, packaging, persistent volumes, scaling, GPUs, drivers, DevOps and much more. Every Data Scientist has a slightly different view on what the different tools are for modeling, UX, framework, storage and and multiple other items that are needed for ML. Kubeflow provides a unified layer on top of k8s to run ML workloads. --- Unify competing interests of data scientists and devops Data scientists want to write algorithms, experiment fast, get access to data DevOps care about security, reliability, and cost
  • #15: Containerized Machine Learning platform JupyterHub for collaborative & interactive training A TensorFlow Training Controller A TensorFlow Serving Deployment, SeldonCore for complex inference and non TF models Pipelines, powered by Argo, for workflows Experiments allow to run different configuration of pipelines Metadata is the information about executions (runs), models, datasets, and other artifacts Wiring to make it work on any k8s anywhere
  • #18: We released an EKS-optimized GPU AMI. This is the basic building block that allows you to create GPU-powered Amazon EKS cluster. It is built on top of standard Amazon EKS-optimized AMI. Includes the usual NVIDIA drivers, package, and runtime to provide support for GPU instances in AWS cloud. NVIDIA driver uses an autoboost feature, which varies the GPU clock speeds. By disabling the autoboost feature and setting the GPU clock speeds to their maximum frequency, you can consistently achieve the maximum performance with your GPU instances. 
  • #19: Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster for burstable workloads. This means that there will be times when pods fail to run due to insufficient resources. At other times, nodes are underutilized for an extended period of time and so the pods there can be placed on other nodes and that node can be reclaimed. Most of the work for GPUs is now being put into cluster autoscaler. Lets talk about some of that. A couple of pull requests that we made are highlighted here. The first PR was enabling support for GPU to multiple cloud providers which allowed it to be solved in an upstream compliant way. It basically added label and type to the GPU nodes. GPU nodes are not sensitive to CPU and memory but use a different set of metrics for CA to scale down the nodes. This second PR adds support for that. safe-to-evict is a standard k8s annotation. It can be specified on a pod and then CA would not remove that node during scale down. This is particularly relevant for ML workloads as you’d not like the pod to be terminated that has completed hours of training, but still got work to do. The recommendation is to create GPU node groups per AZ. This has a couple of benefits – first it helps with network communication and low data transfer costs in a distributed training job. Secondly, it avoids ASG rebalancing across multiple AZs. --- Escalator is designed for large batch or job based workloads that cannot be force-drained and moved when the cluster needs to scale down - Escalator will ensure pods have been completed on nodes before terminating them. It is also optimized for scaling up the cluster as fast as possible to ensure pods are not left in a pending state.
  • #20: 1/ AWS Deep Learning Containers provides pre-packaged Docker container images that are fully configured and validated, so customers no longer have to spend time building and testing the images. 2/ Since we’ve already optimized these docker images for AWS, customers can get the best performance and scalability right away – no tuning required. 3/ AWS Deep Learning Containers are built to work with Amazon EKS, Amazon ECS, and Amazon EC2 to give developers the flexibility and choice. 4/ Customers can also customize these container images to include their own tools and packages for a high degree of control over features of their environment such as monitoring, compliance, and scaling.
  • #24: Jupyter notebook provides an easy on-ramp to build, deploy and train ML models. You can create notebooks that contain live code and provide interactive output in a wide variety of formats such as HTML, images, video, and custom MIME types. Jupyter supports over 40 programming languages, including Python, R, Julia, and Scala. Jupyter notebook in Kubeflow doesn't have that many language kernels yet. However, users can definitely customize on their own. Kubeflow is 0.7 today. One of the new features introduced in 0.6 was multi-user isolation of user-created resources. This feature allow multiple users to operate on a shared Kubeflow deployment without stepping on each others’ jobs and resources. The isolation mechanisms also prevent accidental deletion/modification of resources of other users in the deployment. An administrator needs to deploy Kubeflow and configure the authentication. service for the deployment. A user can log into the system and will by default be accessing their primary profile. A profile is a collection of Kubernetes resources along with a Kubernetes namespace of the same name.
  • #26: By using Kubeflow Fairing and adding a few lines of code, you can run your ML training job locally or in the cloud, directly from Python code or a Jupyter notebook. After your training job is complete, you can use Kubeflow Fairing to deploy your trained model as a prediction endpoint. Kubeflow Fairing packages your Jupyter notebook, Python function, or Python file as a Docker image, then deploys and runs the training job on Kubeflow. After your training job is complete, you can use Kubeflow Fairing to deploy your trained model as a prediction endpoint. - Easily package ML training jobs: Enable ML practitioners to easily package their ML model training code, and their code’s dependencies, as a Docker image. - Easily train ML models in a hybrid cloud environment: Provide a high-level API for training ML models to make it easy to run training jobs in the cloud, without needing to understand the underlying infrastructure. - Streamline the process of deploying a trained model: Make it easy for ML practitioners to deploy trained ML models to a hybrid cloud environment.
  • #31: Hyperparameter are the parameters that are specified for an algorithm before the learning/training begins. So, lets say a data scientist has chosen an algorithm, that will then define what kind of hyperparameters can be specified. For example, learning rate, batch size, number of epochs, maximum depth allowed for the decision tree, number of trees in random forest, number of neurons in neural network layer, how many layers in my neural network?. Once the hyperparameters are chosen, then multiple training runs are conducted with different values of hyperparameters. A model is generated after each run and evaluated for optimality such as time taken to complete the training, error rate, and accuracy. There are methods like grid search, random search, and Bayesian optimization to define the spectrum of values of hyperparameters. Katib means secretary or scribe in Arabic. As Vizier stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier. Extensible Framework agnostic: TensorFlow, PyTorch, MXNet, … Customizable algorithm backend Experiment: “optimization loop” for some specific problem Suggestion: a proposed solution to the problem Trial: one iteration of the loop Job: evaluate a trial and calculate objective value
  • #32: HPO is hard because more hyperparameters means exponential space growth. Also tuning by hand is not efficient. May be you want to anchor on one particular value, and then vary others. For example, batch size and then vary learning rate to find optimal values. You want to be able to track metrics across different jobs. You can choose different frameworks like TensorFlow, PyTorch or MXNet. Algorithm could be random search, grid search, or bayesian optimization.
  • #33: Hyperparameter are the parameters that are specified for an algorithm before the learning/training begins. So, lets say a data scientist has chosen an algorithm, that will then define what kind of hyperparameters can be specified. For example, learning rate, batch size, number of epochs, maximum depth allowed for the decision tree, number of trees in random forest, number of neurons in neural network layer, how many layers in my neural network?. Once the hyperparameters are chosen, then multiple training runs are conducted with different values of hyperparameters. A model is generated after each run and evaluated for optimality such as time taken to complete the training, error rate, and accuracy. There are methods like grid search, random search, and Bayesian optimization to define the spectrum of values of hyperparameters. HPO is hard because more hyperparameters means exponential space growth. Also tuning by hand is error Katib means secretary or scribe in Arabic. As Vizier stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier. Extensible Framework agnostic: TensorFlow, PyTorch, MXNet, … Customizable algorithm backend Experiment: “optimization loop” for some specific problem Suggestion: a proposed solution to the problem Trial: one iteration of the loop Job: evaluate a trial and calculate objective value
  • #37: The lines go from right to left indicating the three parameters that are used for each run of the experiment. The left side shows validation-accuracy and accuracy for the training output, they both use different datasets. The left most columns match the required objective specified in the Experiment. Now, we can choose the best/optimized value of validation accuracy and accuracy, identify the corresponding parameters and use them for training.
  • #38: One of the most important aspects of Building Cloud Native is knowing your responsibilities and where ever possible leveraging the awesome landscape of Cloud Native technologies. KFServing has chosen Istio and Knative to solve core serverless, networking, and revision management problems, and KFServing decorate those layers with ML specific opinions. It builds on top of Kubernetes, so you have full control up and down the stack for whatever you need. Because the whole stack is open, there is a privilege and responsibility to upstream functionality. If ML customers have networking or serverless requirements, the KFServing community will deliver them, but not in our code. We will move down the stack as far as we can, and contribute the code where it serves its widest purpose. This makes the entire ecosystem stronger, and avoids reinventing the wheel over and over. KFServing’s mission to build an ML serving platform that is simple yet powerful. It focuses on data scientists, which means it build concepts that make sense in their domain. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX. Low Bar High Ceiling More recent version more extensible with concept of transformers – essentially a wrapper function you can call before inferencing to get more information, for example take a user id and get location, and past purchase history
  • #40: We started with our interface. This was one of the most important thing to get right. We needed to find the lowest floor possible for Data Scientists. Unlike other serving platforms, one of our key decisions was to choose a non-container interface. Most ML frameworks support model serialization to a file, so we picked the most common frameworks and built or found a set of out of the box servers to support them. These servers would simply load the serialized model and start an http endpoint. That meant no custom servers, no containers, no readiness checks, no bespoke code, all the complexity around server management was simple handled -- and it just worked. In about 8 lines, you could describe all of the infrastructure you needed to get your model up and running. 
  • #41: Keep in mind that principle of Low floor with a High Ceiling. We’d found a pretty low floor, but where can our users go from there. They might want to manage Replica Limits, Service Accounts, or even specify resource requests like GPUs. Everything extends cleanly and consistently using Cloud Native Terminology that users might be familiar with or might encounter in the future.  In the first half of this spec, our users workload scales between 3 and 10 replicas, with 2 cores and a gpu per replica. They’ve also specified a service account that grants permission. All of these features should feel very familiar if you’ve ever deployed with Kubernetes, and in fact, that’s what we’re passing through to kubernetes under the hood. One of our principles was a single resource semantic. We wanted to see if we take all of the infrastructure related to a single model, and contain it within a single resource. One good example of this is the second chunk of this spec which is our canary specification. This allows users to specify a second serving configuration, and experiment on it with a percentage of their traffic. In other systems, this would mean a ton of extra resources to wire up routing configurations and two full stacks, but in KFServing you can simply copy your default spec, tweak it, and rename it canary. In our example, the only difference between default and canary is the pointer to the storageUri.  Without a single resource semantic, a data scientist would need to know whether that all of their configurations were applied together and at the same time. If one of the resources failed to apply, they might get stuck in limbo. Because we’ve simplified this structure into a single resource, those risks don’t even need to enter their mind.
  • #43: Training large models on vast amounts of data can drastically improve model performance. But consider a deep network with millions of parameters, how do we achieve this without waiting for days, or even multiple weeks? That’s where distributed training comes. It allows us train and serve a model on multiple physical machines. This can be achieved using model parallelism and data parallelism. When a big model can not fit into a single node's memory, model parallel training can be employed to handle the big model. Data parallelism allows to distribute the data between different tasks. "Data Parallelism" is the most common training configuration, it involves multiple tasks in a worker job training the same model on different mini-batches of data, updating shared parameters hosted in one or more tasks in a ps (parameter server) job. All tasks typically run on different machines or containers.  Distributed training in Kubeflow is provided using TFJob Use a different view -> Ps parameter. If you have fast link with NCCL, distributed training performance is better with sync parameter server instead of async method
  • #44: We wrote a blog post that discussed best practices to optimize machine learning training performance on Amazon EKS to improve the throughput and minimize training times. We used Kubeflow and FSx for CSI driver. ResNet-50 (kind of network) with ImageNet database. We trained using mixed precision on 20 P3.16xlarge instances (160 V100 GPUs) with a batch size of 256 per GPU (aggregate batch size of ~41k). To achieve better scaling efficiency, we used Horovod and TensorFlow, We observed near-linear scaling, between 90-100% scaling efficiency up to 160 GPUs and 98k images per second.
  • #45: Enable and simplify the orchestration of end to end ML pipelines. ML workflow include all of the components that make up the steps in the workflow and how the components interact with each other. It makes it easy to try numerous ideas and techniques, and manage your various trials/experiments. Also enable to re-use components and pipelines to quickly cobble together end to end solutions, without having to re-build each time. kfp.compiler includes classes and methods for building Docker container images for your pipeline components. kfp.components includes classes and methods for interacting with pipeline components. kfp.Client contains the Python client libraries and allows to run a pipeline and create an experiment
  • #46: You can create pipeline directly using YAML file, or create pipeline using SDK Several different ways to run a pipeline Directly from the UI Invoke it from the SDK Setting up a schedule Pipeline run metadata is stored in Kubeflow DB
  • #48: Write your application code, my-app-code.py. For example, write code to transform data or train a model. Create a Docker container image that packages your program (my-app-code.py) and upload the container image to a registry. To build a container image based on a given Dockerfile, you can use the Docker command-line interface or the kfp.compiler.build_docker_image method from the Kubeflow Pipelines SDK. Write a component function using the Kubeflow Pipelines DSL to define your pipeline’s interactions with the component’s Docker container. Your component function must return a kfp.dsl.ContainerOp. Write a pipeline function using the Kubeflow Pipelines DSL to define the pipeline and include all the pipeline components. Use the kfp.dsl.pipeline decorator to build a pipeline from your pipeline function. Compile the pipeline to generate a compressed YAML definition of the pipeline. The Kubeflow Pipelines service converts the static configuration into a set of Kubernetes resources for execution. Use the Kubeflow Pipelines SDK to run the pipeline.
  • #49: Your ML workflows generate a lot of metadata. This is information such as executions (runs), models, datasets, and other artifacts. Artifacts are the files and objects that form the inputs and outputs of the components in your ML workflow. As you start trying out multiple combinations, you need a way to manage all of this metadata together. That’s exactly the purpose of Metadata project. It tracks and manages the metadata that the workflows produce. Metadata components comes pre-installed in 0.7. This is currently an Alpha version and development team is interested in your feedback.
  • #50: Manage EKS cluster provision with eksctl and provide flexibility to start different flavor of GPU nodes. Manage external traffic with AWS ALB Ingress Controller. Traffic will go through ALB Ingress controller to Istio-Gateway and then forward to ambassador inside cluster. Leverage Amazon FSx CSI driver to manage Lustre file system which is optimized for compute-intensive workloads, such as high-performance computing and machine learning. It can scale to hundreds of GBps of throughput and millions of IOPS. Centralized and unified Kubernetes cluster logs in CloudWatch which helps debugging and troubleshooting. Enable TLS and Authentication with AWS Certificate Manager and AWS Cognito Enable Private Access for your Kubernetes cluster's API server endpoint Automatically detect GPU instance and install Nvidia Device Plugin
  • #52: As Kubeflow continues to evolve, you can be assured it will continue to work well on AWS
  • #53: If you are modeling a taxi service then Driver might be an entity and daily count trips might be a feature. Other interesting features might be the distance between the driver and a destination, or the time of day. A combination of multiple features are used as inputs for a machine learning model. As your ML workloads scale, features play an important role for both training and serving. Typical challenges: Features not being reused: Features representing the same business concepts are being redeveloped many times, when existing work from other teams could have been reused. Feature definitions vary: Teams define features differently and there is no easy access to the documentation of a feature. Hard to serve up to date features: Combining streaming and batch derived features, and making them available for serving, requires expertise that not all teams have. Ingesting and serving features derived from streaming data often requires specialised infrastrastructure. As such, teams are deterred from making use of real time data. Inconsistency between training and serving: Training requires access to historical data, whereas models that serve predictions need the latest values. Inconsistencies arise when data is siloed into many independent systems requiring separate tooling. Feast is an open source feature store that can be integrated with Kubeflow to address the feature storage needs. Feast solutions: Discoverability and reuse of features: A centralized feature store allows organizations to build up a foundation of features that can be reused across projects. Teams are then able to utilize features developed by other teams, and as more features are added to the store it becomes easier and cheaper to build models. Access to features for training: Feast allows users to easily access historical feature data. This allows users to produce datasets of features for use in training models. ML practitioners can then focus more on modelling and less on feature engineering. Access to features in serving: Feature data is also available to models in production through a feature serving API. The serving API has been designed to provide low latency access to the latest feature values. Consistency between training and serving: Feast provides consistency by managing and unifying the ingestion of data from batch and streaming sources, using Apache Beam, into both the feature warehouse and feature serving stores. Users can query features in the warehouse and the serving API using the same set of feature identifiers. Standardization of features: Teams are able to capture documentation, metadata and metrics about features. This allows teams to communicate clearly about features, test features data, and determine if a feature is useful for a particular model.
  • #54: Amazon SageMaker is investing in delivering the best experience for Machine Learning (ML) on Kubernetes by creating Kubernetes operators and Kubeflow Pipeline Components for SageMaker services. Kubernetes users will be able to use managed ML services for training, model tuning, and inference without leaving their Kubernetes environments and pipelines and without learning SageMaker APIs. Miran makes DevOps easier giving customers the benefits of a managed service for ML in Kubernetes, without migrating their workload from Kubernetes and without learning SageMaker APIs. Customers lower training costs by not paying for idle GPU instances. With SageMaker operators and pipeline components for training and model tuning, GPU resources are fully managed by SageMaker and utilized only for the duration of a job. Customers with 30% or higher idle GPU resources on their local self-managed resources for training would see a reduction in total cost by using SageMaker operators and pipeline components. Customers can create hybrid pipelines with SageMaker pipeline components for Kubeflow that can seamlessly execute jobs on AWS, on-premise resources, and other cloud providers.