Machine Learning using Kubernetes - AI Conclave 2019

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning
using Kubeflow
Arun Gupta, @arungupta
Principal Open Source Technologist

https://dilbert.com/strip/2013-02-02

Machine Learning 101

Broadest and deepest set of capabilities
T H E AW S M L S TAC K
FRAMEWORKS INTERFACES INFRASTRUCTURE
ML Frameworks + Infrastructure
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA

Amazon EKS
Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin
Containerized ML

ML Services
Amazon SageMaker
Ground Truth
data labelling
ML
Marketplace
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks
SageMaker
Experiments
Model
tuning
SageMaker
Autopilot
Model
hosting
SageMaker
Model Monitor
SageMakerStudioIDE

AI Services Broadest and deepest set of capabilities
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Amazon
Rekognition
+Custom Labels
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
Amazon
Textract
Amazon
Kendra
Amazon
Connect
with Contact Lens

AI Services Broadest and deepest set of capabilities
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Amazon
Rekognition
+Custom Labels
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
Amazon
Textract
Amazon
Kendra
Amazon
Connect
with Contact Lens
FRAMEWORKS INTERFACES INFRASTRUCTURE
ML Frameworks + Infrastructure
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
ML Services
Amazon SageMaker
Ground Truth
data labelling
ML
Marketplace
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks
SageMaker
Experiments
Model
tuning
SageMaker
Autopilot
Model
hosting
SageMaker
Model Monitor
SageMakerStudioIDE
Amazon EKS
Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin
Containerized ML

M A C H I N E L E A R N I N G
S T O R A G E
Amazon Redshift
+ Redshift Spectrum
Amazon
QuickSight
Amazon EMR
Hadoop, Spark, Presto,
Pig, Hive…19 total
Amazon
Athena
Amazon
Kinesis
Amazon
Elasticsearch
Service
AWS Glue
A N A L Y T I C S
Amazon S3
Standard-IA
Amazon S3
Standard
Amazon S3
One Zone-IA
Amazon
Glacier
Amazon S3
Intelligent-
Tiering
N E W
Amazon
EBS
Amazon S3
Glacier Deep
Archive
N E W
Storage and Analytics for Machine Learning

Why Machine Learning on Kubernetes?
Composability Portability Scalability
O N - P R E M I S E S C L O U D
http://www.shutterstock.com/gallery-635827p1.html

Machine Learning on K8s: Without KubeFlow
@aronchik

Machine Learning on K8s: With KubeFlow
@aronchik

What is Kubeflow
Containerized machine learning platform
Makes it easy to develop, deploy, and manage portable,
scalable end-to-end ML workflows on k8s
“Toolkit” – loosely coupled tools and blueprints for ML
End to End ML workflow – ML code is only a small component
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

What’s in
KubeFlow?

Amazon EKS: run Kubernetes in cloud
Managed Kubernetes control plane, attach data plane
Native upstream Kubernetes experience
Platform for enterprises to run production-grade workloads
Integrates with additional AWS services

Getting started with Amazon EKS
eksctl CLI—create Amazon EKS clusters (eksctl.io)
Creates all resources needed for the cluster

Amazon EKS-Optimized GPU AMI
Built on top of the standard Amazon EKS-Optimized AMI
Includes packages to support Amazon P2/P3/G3/G4
instances
• NVIDIA drivers
• nvidia-docker2 package
• nvidia-container-runtime (as default runtime)
GPU Clock Optimization

Cluster Autoscaler Improvements
Add GPU support
autoscaler#1584 GPU autoscaling supported for AWS
autoscaler#1589 GPU scale down performance optimization
Prevent CA from removing a node with ML training job
running
Annotate job ”cluster-autoscaler.kubernetes.io/safe-to-evict”: “false”
Recommended to create GPU node group per AZ
Improve network communication performance
Prevent ASG rebalancing

AWS Deep Learning Containers
KEY FEATURES
Customizable
container images
Support for TensorFlow,
Apache MXNet
Single and multi-node
training and inference
Pre-packaged Docker
container images
fully configured
and validated
Best performance
and scalability
without tuning
Works with Amazon EKS,
Amazon ECS,
and Amazon EC2

Kubeflow on Desktop
MiniKF: Local Kubeflow deployment using VirtualBox and
Vagrant
• Minikube -> Kubernetes
• MiniKF -> Kubeflow (includes minikube)
Runs on macOS, Linux, and Windows
Does not require k8s-specific knowledge

Kubeflow on Cloud
Major cloud providers supported
Choices on Amazon Web Services
• Self-managed k8s on EC2: Kops, CloudFormation, Terraform
• Amazon EKS

Getting Started with Kubeflow
on Amazon EKS

Jupyter Notebook
Create and share documents that contain live code,
equations, visualizations, and narrative text
• UI to manage notebooks
• Integrate with RBAC/IAM
• Ingress / Service Mesh

Jupyter Notebook

Fairing
Python SDK to build, train and deploy ML models
• Easily package ML
training jobs
• Train ML models from
notebook to k8s
• Streamline the model
development process
Setup KubeflowFairing for training and prediction
https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/02_Fairing/02_06_fairing_e2e.ipynb
Train an XGBoost model remotely on Kubeflow
Deploy the trained model to Kubeflowfor prediction

Katib – Hyperparameter Tuning
Hyperparameter are parameters external to the model to control the
training, e.g. learning rate, batch size, epochs
Tuning finds a set of hyperparameters that optimizes an objective
function, e.g. Find the optimal batch size and learning rate to maximize
prediction accuracy

Hyperparameter Tuning is Hard
More hyperparameters -> exponential space growth
Tuning by hands is inefficient and error-prone
Need to track metrics across multiple jobs
Managing resources and infrastructure for lot of jobs is hard
Variety of frameworks and algorithms to support

Katib – Hyperparameter Tuning
trialName Validation-accuracy accuracy --lr --num-layers --optimizer
random-experiment-rfwwbnsd 0.974920 0.984844 0.013831565266960293 4 sgd
random-experiment-vxgwlgqq 0.113854 0.116646 0.024225789898529138 4 ftrl
random-experiment-wclrwlcq 0.979697 0.998437 0.021916171239020756 4 sgd
random-experiment-7lsc4pwb 0.113854 0.115312 0.024163810384272653 5 ftrl
random-experiment-86vv9vgv 0.963475 0.971562 0.02943228249244735 3 adam
random-experiment-jh884cxz 0.981091 0.999219 0.022372025623908262 2 sgd
random-experiment-sgtwhrgz 0.980693 0.997969 0.016641686851083654 4 sgd
random-experiment-c6vvz6dv 0.980792 0.998906 0.0264125850165842 3 sgd
random-experiment-vqs2xmfj 0.113854 0.105313 0.026629394628228185 4 ftrl
random-experiment-bv8lsh2m 0.980195 0.999375 0.021769570793012488 2 sgd
random-experiment-7vbnqc7z 0.113854 0.102188 0.025079750575740783 4 ftrl
random-experiment-kwj9drmg 0.979498 0.995469 0.014985919312945063 4 sgd
Hyperparameters

Trial template

KFServing: Model serving and management
Provides a Kubernetes CRD for serving ML models on arbitrary frameworks.
Encapsulates the complexity of autoscaling, networking and server configuration to bring features
like scale to zero, transformations, and canary rollouts to your deployments
Enables a simple, pluggable, and complete story for your production ML inference server by providing
prediction, pre-processing, post-processing and explainability.

KFServing Custom Resource
S3 secret
attached to
Service
Account
Trained
model
https://github.com/kubeflow/kfserving/blob/master/docs/samples/s3/tensorflow_s3.yaml

Pluggable Interface
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
default:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
default:
tensorflow:
storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
kind: "KFService"
metadata:
name: "pytorch-cifar10"
spec:
default:
pytorch:
storageUri: "gs://kfserving-samples/models/pytorch/cifar10"
modelClassName: "Net"

KFServing Interface – Scikit Learn
kind: "KFService"
metadata:
name: "sklearn-iris"
spec:
default:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris"
serviceAccount: inferencing-robot
minReplicas: 3
maxReplicas: 10
resources:
requests:
cpu: 2
gpu: 1
memory: 10Gi
canaryTrafficPercent: 25
canary:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris-v2"
serviceAccount: inferencing-robot
minReplicas: 3
maxReplicas: 10
resources:
requests:
cpu: 2
gpu: 1
memory: 10Gi

Distributed Training

Best Practices for Optimizing Distributed Deep
Learning Performance on Amazon EKS
https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/

Pipelines – Machine Learning Job Orchestrator
Compose, deploy, and manage
end-to-end ML workflows
End-to-end orchestration
Easy, rapid, and reliable
experimentation
Easy re-use
Built using Pipelines SDK
kfp.compiler,
kfp.components, kfp.Client
Uses Argo under the hood to
orchestrate resources

Creating Kubeflow Pipeline Components
@dsl.pipeline(
name='Sample Trainer',
description=’’
)
def sample_train_pipeline(... ):
create_cluster_op = CreateClusterOp('create-cluster', ...)
analyze_op = AnalyzeOp('analyze', ...)
transform_op = TransformOp('transform', ...)
train_op = TrainerOp('train', ...)
predict_op = PredictOp('predict', ...)
confusion_matrix_op = ConfusionMatrixOp('confusion-matrix', ...)
roc_op = RocOp('roc', ...)
kfp.compiler.Compiler().compile(sample_train_pipeline , 'my-pipeline.zip’)
Pipeline component
Pipeline decorator
Pipeline function
Compile pipeline

Creating Kubeflow Pipeline Components

Metadata – Model Tracking
• Metadata schema to track artifacts related to
execution contexts
• Metadata API for storing and retrieving
metadata
• Client libraries for end-users to interact with
the Metadata service from their Notebooks or
Pipelines code.

Making Kubeflow a first class citizen on AWS
• Centralized and unified Kubernetes cluster logs in Amazon CloudWatch
• External traffic and authentication management with ALB Ingress Controller
• TLS and authentication with AWS Certificate Manager and AWS Cognito
• In-built FSx CSI driver w/S3 data repository integration to optimize training performance
• Elastic File System integration for common data sharing in JupyterHub
• Easier and customizable Kubeflow installation with kfctl and Kustomize support
• Kubeflow Pipeline integration with AWS Services – Amazon EMR, Athena, SageMaker
• Add ECR integration to Kubeflow Fairing
• Jupyter Notebook images with AWS CLI installed and ECR support
• Auto detect GPU worker nodes and install NVIDIA device plugin
https://www.kubeflow.org/docs/aws/

AWS Kubeflow Roadmap
Kubeflow v1.0 - Theme: Enterprise Readiness
E2E examples and increased docs on Kubeflow site
Upstream testing for Kubeflow on AWS
Support DIY K8S on AWS
IAM Roles for Service Accounts integration with Jupyter
notebooks
Support for managed contributors

Feature store - Feast
• Discoverability and reuse of features
• Standardization of features
• Access to features for training and serving
• Consistency between training and serving

Fully managed
infrastructure in Amazon SageMake
Introducing Amazon SageMaker Operators for Kubernetes
Kubernetes customers can now train, tune, & deploy models in
Amazon SageMaker

Under the hood – Amazon SageMaker and
Kubernetes
Kubectl apply
YAML
Key Features
• Amazon SageMaker
Operators for training,
tuning, inference
• Natively interact with
Amazon SageMaker jobs
using K8s tools (e.g., get
pods, describe)
• Stream and view logs from
Amazon SageMaker in K8s
• Helm Charts to assist with
setup and spec creation

https://github.com/aws/amazon-
sagemaker-operator-for-k8s

References
Workshop: eksworkshop.com/advanced/420_kubeflow/
Jupyter notebooks: github.com/aws-samples/eks-kubeflow-
workshop/
Optimizing Machine Learning performance:
aws.amazon.com/blogs/opensource/optimizing-distributed-
deep-learning-performance-amazon-eks/

Please fill up the session rating in the feedback form
and collect your goodie at the end of the day
THANK YOU

Machine Learning using Kubernetes - AI Conclave 2019

More Related Content

Similar to Machine Learning using Kubernetes - AI Conclave 2019 (20)

More from Arun Gupta (20)

Recently uploaded (20)

Machine Learning using Kubernetes - AI Conclave 2019

Editor's Notes