SlideShare a Scribd company logo
Jupyter Notebooks
Workflow Building
Pipelines
Tools
Serving
Metadata
Kale
Fairing
TFX
KF Pipelines
HP Tuning
Tensorboard
KFServing
Seldon Core
TFServing, + Training Operators
Pytorch
XGBoost, +
Tensorflow
Prometheus
Kubeflow: End to End ML Platform
Animesh Singh
MPI
MXNet
©	2019	IBM	Corporation	
Animesh	Singh		
STSM	and	Chief	Architect	-	Data	and	AI	Open	Source	
Platform	
o  CTO,	IBM	RedHat	Data	and	AI	Open	Source	Alignment	
o  IBM	Kubeflow	Engagement	Lead,	Kubeflow	Committer	
o  Chair,	Linux	Foundation	AI	-	Trusted	AI	
o  Chair,	CD	Foundation	MLOps	Sig	
o  Ambassador,	CNCF	
o  Member	of	IBM	Academy	of	Technology	(IBM	AoT)	
Kubeflow
github.com/kubeflow
Your Speaker Today: CODAIT	
2
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
Data
Untrained
Model
Kubeflow: Current IBM Contributors
Christian Kadner Weiqiang Zhuang Tommy Li Andrew Butler
Jin Chi He Feng Li Ke Zhu Kevin Yu
IBM is the 2nd Largest Contributor
IBM is the 2nd Largest Contributor
IBMers contributing across projects in Kubeflow
Kubeflow Services
High	Level	
Services	
	
Low	Level	APIs	/	Services	
Katib	
Pipelines	
Notebooks	
TFJob	 PyTorchJob	
Jupyter	CR	
Seldon	CR	
Kubebench	
Pipelines	CR	
Argo	
Study	Job	
MPIJob	
Spark	Job	
KFServing	
TFX	 Developed	By	Kubeflow	 Developed	Outside	Kubeflow	
Adapted from Kubeflow Contributor Summit 2019 talk: Kubeflow and ML
Landscape (Not all components are shown)
Kubernetes	API	Server	
Istio	Mesh	and	Gateway		
kubectl apply -f tfjob
Community is growing!
8
Multi-User Isolation
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
Data
Untrained
Model
ML Lifecycle: Build: Development, Training and HPO
Develop (Kubeflow Jupyter Notebooks)
–  Data	Scientist	
–  Self-service	Jupyter	Notebooks	provide	faster	model	experimentation	
–  Simplified	configuration	of	CPU/GPU,	RAM,	Persistent	Volumes	
–  Faster	model	creation	with	training	operators,		TFX,	magics,	workflow	automation	(Kale,	Fairing)	
–  Simplify	access	to	external	data	sources	(using	stored	secrets)	
–  Easier	protection,	faster	restoration	&	sharing	of	“complete”	notebooks	
–  IT	Operator	
–  Profile	Controller,	Istio,	Dex	enable	secure		RBAC	to	notebooks,	data	&	resources	
–  Smaller	base	container	images	for	notebooks,	fewer	crashes,	faster	to	recover
Develop (Kubeflow Jupyter Notebooks)
12
Distributed Training Operators
13
Distributed
Training Operators
14
Distributed Tensorflow Operator
•  A	distributed	Tensorflow	Job	is	collection	of	the	following	processes	
o  Chief	–	The	chief	is	responsible	for	orchestrating	training	and	performing	tasks	like	checkpointing	the	
model	
o  Ps	–	The	ps	are	parameters	servers;	the	servers	provide	a	distributed	data	store	for	the	model	
parameters	to	access	
o  Worker	–	The	workers	do	the	actual	work	of	training	the	model.	In	some	cases,	worker	0	might	also	
act	as	the	chief	
o  Evaluator	-		The	evaluators	can	be	used	to	compute	evaluation	metrics	as	the	model	is	trained
Distributed MPI Operator - AllReduce
•  AllReduce	is	an	operation	that	reduces	many	
arrays	spread	across	multiple	processes	into	a	
single	array	which	can	be	returned	to	all	the	
processes	
•  This	ensures	consistency	between	distributed	
processes	while	allowing	all	of	them	to	take	on	
different	workloads	
•  The	operation	used	to	reduce	the	multiple	
arrays	back	into	a	single	array	can	vary	
and	that	is	what	makes	the	different	options	
for	AllReduce
Hyper Parameter Optimization and
Neural Architecture Search - Katib
•  Katib:	Kubernetes	Native	System	for	Automated	
tuning	of	machine	learning	model’s	
Hyperparameter	Turning	and	Neural	
Architecture	Search.	
•  Github	Repository:		
https://github.com/kubeflow/katib	
	
	
	
•  Hyperparameter	Tuning	
q  Random	Search	
q  Tree	of	Parzen	Estimators	(TPE)	
q  Grid	Search	
q  Hyperband	
q  Bayesian	Optimization	
q  CMA	Evolution	Strategy	
•  Neural	Architecture	Search	
q  Efficient	Neural	Architecture	Search	(ENAS)	
q  Differentiable	Architecture	Search	(DARTS)
Katib
18	
Think	2020	/	DOC	ID	/	Month	XX,	2020	/	©	2020	IBM	
Corporation
❑  Rollouts:
Is this rollout safe? How do I roll
back? Can I test a change
without swapping traffic?
❑  Protocol Standards:
How do I make a prediction?
GRPC? HTTP? Kafka?
❑  Cost:
Is the model over or under scaled?
Are resources being used efficiently?
❑  Monitoring:
Are the endpoints healthy? What is
the performance profile and request
trace?
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
Data
Untrained
Model
❑  Frameworks:
How do I serve on Tensorflow?
XGBoost? Scikit Learn? Pytorch?
Custom Code?
❑  Features:
How do I explain the predictions?
What about detecting outliers and
skew? Bias detection? Adversarial
Detection?	
❑  How do I wire up custom pre and
post processing	
ML Lifecycle: Production Model Serving
❑  How do I handle batch
predictions?
❑  How do I leverage standardized
Data Plane protocol so that I can
move my model across MLServing
platforms?
●  Seldon	Core	was	pioneering	Graph	Inferencing.	
●  IBM	and	Bloomberg	were	exploring	serverless	ML	lambdas.	IBM	gave	a	talk	on	
the	ML	Serving	with	Knative	at	last	KubeCon	in	Seattle	
●  Google	had	built	a	common	Tensorflow	HTTP	API	for	models.	
●  Microsoft	Kubernetizing	their	Azure	ML	Stack	
Experts fragmented across industry
●  Kubeflow	created	the	conditions	for	collaboration.	
●  A	promise	of	open	code	and	open	community.	
●  Shared	responsibilities	and	expertise	across	multiple	companies.	
●  Diverse	requirements	from	different	customer	segments	
Putting the pieces together
●  Founded by Google, Seldon,
IBM, Bloomberg and Microsoft	
●  Part of the Kubeflow project
●  Focus on 80% use cases -
single model rollout and update
●  Kfserving 1.0 goals:
○  Serverless ML Inference
○  Canary rollouts
○  Model Explanations
○  Optional Pre/Post
processing
Model Serving - KFServing
Manages the hosting aspects of your models
•  InferenceService	-	manages the lifecycle of
models
	
•  Configuration	-	manages history of model
deployments. Two configurations for default and
canary.
	
•  Revision	-	A snapshot of your model version
•  Route	-	Endpoint and network traffic management
Route Default
Configuration		
Revision	1
Revision	M	90
%
KFService	
Canary
Configuration		
Revision	1
Revision	N	10
%
KFServing: Default and
Canary Configurations
Model	Servers	
							-		TensorFlow
- Nvidia TRTIS
- PyTorch
- XGBoost
- SKLearn
- ONNX
				
	
							Components:	
•  									-		Predictor, Explainer, Transformer
(pre-processor, post-processor)
							Storage	
	-		AWS/S3
- GCS
- Azure Blob
- PVC
Supported Frameworks, Components and
Storage Subsystems
GPU Autoscaling - KNative solution
Ingress	
Activator	
(buffers	requests)	
Autoscaler	
Queue	
Proxy	
Model	
server	
when	scale	==	0	or	handling	
burst	capacity	
when	scale	>	0	
metrics	
●  Scale	based	on	#	in-flight	requests	against	expected	concurrency	
●  Simple	solution	for	heterogeneous	ML	inference	autoscaling	
scale	
metrics	
0...N	Replicas	
API	
Requests
But the Data Scientist Sees...
●  A pointer to a Serialized Model File
●  9 lines of YAML
●  A live model at an HTTP endpoint
=
http
●  Scale to Zero
●  GPU Autoscaling
●  Safe Rollouts
●  Optimized Serving Containers
●  Network Policy and Auth
●  HTTP APIs (gRPC soon)
●  Tracing
●  Metrics
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
default:
predictor:
tensorflow:
storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
Production	users	include:	
Bloomberg
`
27	
KFServing: Default, Canary and Autoscaler
KFServing – Existing Features
q  Crowd sourced capabilities – Contributions by AWS, Bloomberg, Google, Seldon, IBM, NVidia and others.
q  Support for multiple runtimes pre-integrated (TFServing, Nvdia Triton (GPU optimization), ONNX Runtime, SKLearn,
PyTorch, XGBoost, Custom models.
q  Serverless ML Inference and Autoscaling: Scale to zero (with no incoming traffic) and Request queue based autoscaling
q  Canary and Pinned rollouts: Control traffic percentage and direction, pinned rollouts
q  Pluggable pre-processor/post-processor via Transformer: Gives capabilities to plug in pre-processing/post-processing
implementation, control routing and placement (e.g. pre-processor on CPU, predictor on GPU)
q  Pluggable analysis algorithms: Explainability, Drift Detection, Anomaly Detection, Adversarial Detection (contributed by
Seldon) enabled by Payload Logging (built using CloudEvents standardized eventing protocol)
q  Batch Predictions: Batch prediction support for ML frameworks (TensorFlow, PyTorch, ...)
q  Integration with existing monitoring stack around Knative/Istio ecosystem: Kiali (Service placements, traffic and graphs),
Jaeger (request tracing), Grafana/Prometheus plug-ins for Knative)
q  Multiple clients: kubectl, Python SDK, Kubeflow Pipelines SDK
q  Standardized Data Plane V2 protocol for prediction/explainability et all: Already implemented by Nvidia Triton
q  MMS: Multi-Model-Serving for serving multiple models per custom KFService instance
q  More Data Plane v2 API Compliant Servers: SKLearn, XGBoost, PyTorch…
q  Multi-Model-Graphs and Pipelines: Support chaining multiple models together in a Pipelines
q  PyTorch support via AWS TorchServe
q  gRPC Support for all Model Servers
q  Support for multi-armed-bandits
q  Integration with IBM AIX360 for Explainability, AIF360 for Bias detection and ART for Adversarial detection
KFServing – Upcoming Features
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
Data
Untrained
Model
ML Lifecycle: Orchestrate Build, Train, Validate and Deploy
Kubeflow Pipelines
§  Containerized implementations of ML Tasks
§  Pre-built components: Just provide params or code snippets
(e.g. training code)
§  Create	your	own	components	from	code	or	libraries	
§  Use	any	runtime,	framework,	data	types	
§  Attach	k8s	objects	-	volumes,	secrets
§  Specification of the sequence of steps
§  Specified via Python DSL
§  Inferred from data dependencies on input/output
§  Input Parameters
§  A “Run” = Pipeline invoked w/ specific parameters
§  Can be cloned with different parameters
§  Schedules	
§  Invoke a single run or create a recurring scheduled pipeline
Define Pipeline with Python SDK
@dsl.pipeline(name='Taxi	Cab	Classification	Pipeline	Example’)	
def	taxi_cab_classification(	
				output_dir,		
				project,	
				Train_data						=	'gs://bucket/train.csv',	
				Evaluation_data	=	'gs://bucket/eval.csv',	
				Target										=	'tips',		
				Learning_rate			=	0.1,	hidden_layer_size	=	'100,50’,	steps=3000):	
	
				 	tfdv	 	 	=	TfdvOp(train_data,	evaluation_data,	project,	output_dir)	
				 	preprocess	 	=	PreprocessOp(train_data,	evaluation_data,	tfdv.output[“schema”],	project,	output_dir)	
				 	training		=	DnnTrainerOp(preprocess.output,	tfdv.schema,	learning_rate,	hidden_layer_size,	steps,		
target,	output_dir)	
				 	tfma	 	 	=	TfmaOp(training.output,	evaluation_data,	tfdv.schema,	project,	output_dir)	
				 	deploy	 	=	TfServingDeployerOp(training.output)	
	
Compile and Submit Pipeline Run
dsl.compile(taxi_cab_classification,		'tfx.tar.gz')	
run	=	client.run_pipeline(	
'tfx_run',	'tfx.tar.gz',	params={'output':	‘gs://dpa22’,	'project':	‘my-project-33’})
Visualize the state of various components
Pipelines versioning
Pipelines	lets	you	group	and	manage	multiple	versions	of	a	pipeline.
Artifact Tracking
Artifacts	for	a	run	of	
the	“TFX	Taxi	Trip”	
example	pipeline.	For	
each	artifact,	you	can	
view	details	and	get	
the	artifact	URL—in	
this	case,	for	the	
model.
Lineage Tracking
For	a	given	run,	the	Pipelines	Lineage	Explorer	lets	you	view	the	history	
and	versions	of	your	models,	data,	and	more.
Kubeflow Pipeline Architecture
Kubeflow Pipelines can train, deploy and serve
Open	Source	Dojo	 38
Kubernetes
Ready
ML and AI Platform
Operator Hub - operatorhub.io
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
Data
Untrained
Model
Watson Productization of Kubeflow Pipelines
Watson AI Pipelines
•  Demonstrate	that	Watson	can	be	used	for	end-end	AI	lifecycledata	prep/model	training/model	risk	
validation/model	deployment/monitoring/updating	models	
•  Demonstrate	that	the	full	lifecycle	can	be	operated	programmatically,	and	have	Tekton	as	a	backend	
instead	of	Argo
Pipeline: Train the model and monitor with OpenScale
Tekton
q  A	PipelineResource	defines	
an	object	that	is	an	input	
(such	as	a	git	repository)	or	an	
output	(such	as	a	docker	
image)	of	the	pipeline.	
q  A	PipelineRun	defines	an	
execution	of	a	pipeline.	It	
references	the	Pipeline	to	run	
and	the	PipelineResources	to	
use	as	inputs	and	outputs.	
q  A	Pipeline	defines	the	set	
of	Tasks	that	compose	a	
pipeline.	
q  A	Task	defines	a	set	of	build	
Steps	such	as	compiling	code,	
running	tests,	and	building	
and	deploying	images.	
TASK	
	
	STEP	
POD	
	
	
STEP	
TASK	
	
	STEP	 STEP	
POD	
	
	Container	 Container	 Container	 Container	
TEKTON	
q  The	Tekton	Pipelines	project	
provides	Kubernetes-style	
resources	for	declaring	CI/CD-
style	pipelines.	
q  	Tekton	introduces	several	new	
CRDs	including	Task,	Pipeline,	
TaskRun,	and	PipelineRun.		
q  A	PipelineRun	represents	a	
single	running	instance	of	a	
Pipeline	and	is	responsible	for	
creating	a	Pod	for	each	of	its	
Tasks	and	as	many	containers	
within	each	Pod	as	it	has	Steps.
KFP	API	Server	
Components	Pipelines	
Object	Store	
KFP	UI	
Relational	
DB	
Argo	
Pipeline	
Yaml	
	
Tekton	
Pipeline	
Yaml	
	
KFP – Tekton Phase One
Pluggable	Components	
	
	
Watson	
Studio	 WML	
Open	
Scale	Spark	
Kubeflow	
Training	
Seldon	 AIF360	 ART	 KATIB	 KFSERVING	
!
!
!
!
!
!
!
…
…!
COMPILE
KFP	SDK	
TASK	
	
	STEP	
POD	
	
	
STEP	STEP	
POD	POD	POD	
STEP	
TASK	
	
	STEP	 STEP	
STEP	
POD	
	
	Container	 Container	 Container	 Container	
ARGO	
TEKTON
KFP – Tekton Phase Two
Pluggable	Components	
	
	
Watson	
Studio	 WML	
Open	
Scale	Spark	
Kubeflow	
Training	
Seldon	 AIF360	 ART	 KATIB	 KFSERVING	
!
!
!
!
!
!
!
…
…!
TASK	
	
	STEP	
POD	
	
	
STEP	STEP	
POD	POD	POD	
STEP	
TASK	
	
	STEP	 STEP	
STEP	
POD	
	
	Container	 Container	 Container	 Container	
ARGO	
TEKTON	
KFP	API	Server	
Components	Pipelines	
Object	Store	
KFP	UI	
Relational	
DB	
Argo	
Pipeline	
Yaml	
	
Tekton	
Pipeline	
Yaml	
	
COMPILE
KFP	SDK
KFP – Tekton Challenges
46	
Multiple	Moving	parts,	with	different	stakeholders	
	
	Tekton	Community:	Argo	with	version	2.6	much	more	mature	than	Tekton	v0.11	(alpha)	when	the	work	started	around	5	months	ago	
•		Multiple	features	and	capabilities	lacking	in	Tekton	when	we	kick	started	
•		The	team	had	to	default	to	a	spreadsheet	to	start	tracking	and	mapping	KFP	DSL	features,	and	areas	where	Tekton	needed	to	bring	features	and	functions.	
Overall	50	DSL	capabilities	identified	and	corresponding	Tekton	features	started	getting	mapped.	
•		Multiple	features	like	Kubernetes	resources	support	to	create/patch/update/delete	them,	image	pull	secrets,	loops,	conditionals,	support	for	system	params	didn’t	
exist.	Or	existed	partially	
•		Tekton	started	moving	from	alpha	to	beta	as	the	work	progressed,	and	few	features	left	behind	in	alpha	mode	
•		Multiple	issues	opened	on	Tekton.	Required	ramping	up	the	team	of	Tekton	contributors	to	help	drive	these	issues	.	Formed	a	virtual	team	of	IBM	Open	tech	
developers	(Andrea	Frittoli,	Priti	Desai),	IBM	Systems	team	(Vincent	Pli)	DevOps	team	(Simon	Kaegi),	RedHat	(Vincent	Demeester	etc.)	to	drive	Tekton	requirements	
	
Kubeflow	Pipeline	and	TFX	Community:	Open	source	team	needed	to	be	formed	for	the	specific	mission.	And	trained.	Additionally	Google	
needed	to	be	brought	up	on	the	same	page,	and	convinced	the	validity	of	integration.	
•		Multiple	design	reviews	established	with	Google,	and	jointly	agreed	on	a	direction	after	they	were	convinced	why	we	were	doing	it,	and	why	it	makes	sense.	
•		Convincing	to	accelerate	the	IR	(Intermediate	Representation)	strategy	with	TFX,	so	as	to	be	able	to	drive	this	the	right	way	
•		Huge	dependency	in	Kubeflow	Pipeline	code	on	Argo,	including	the	API	backend	and	UI	all	written	with	Argo	dependency	
•		Internal	IBM	team	divided	to	attack	different	areas:	Compiler	(Christian	Kadner),	API	(Tommy	Li),	UI	(Andrew),	Feng	Li	(IBM	Systems,	China)	
•		Inability	of	Kubeflow	Pipeline	backend	to	take	multiple	CRDs,	which	is	the	default	model	Tekton	follows.	So	everything	needed	to	be	bundled	in	one	Pipeline	Spec	
•		Type	check,	workflow	utils,	and	parameter	replacement	are	heavily	tied	with	Argo	API.	In	addition,	the	persistent	agent	is	watching	the	resources	using	the	Argo	API	
type.	
•		MLOps	Sig	in	CD	Foundation	leveraged	to	bring	Kubeflow	Pipelines	and	Tekton	team	together
KFP – Tekton: Delivered
Pluggable	Components	
	
	
Watson	
Studio	 WML	
Open	
Scale	Spark	
Kubeflow	
Training	
Seldon	 AIF360	 ART	 KATIB	 KFSERVING	
!
!
!
!
!
!
!
…
…!
TASK	
	
	STEP	
POD	
	
	
STEP	
TASK	
	
	STEP	 STEP	
POD	
	
	Container	 Container	 Container	 Container	
TEKTON	
KFP	API	Server	
Components	Pipelines	
Object	Store	
KFP	UI	
Relational	
DB	
Tekton	
Pipeline	
Yaml	
	
COMPILE
KFP	SDK
Same KFP Experience: DAG, backed by Tekton YAML
48
Same KFP Exp: Logs, Lineage Tracking and Artifact Tracking
49
50	
End to end Kubeflow Components : With KFP-Tekton
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
Data
Untrained
Model
Kubeflow Adoption: External and Internal
Telstra AI Lab - (TAIL) - Configuration	
•  Kubernetes	–	1.15	
•  Spectrum	Scale	CSI	Driver	
•  MetalLB	for	Load	Balancing		
•  Istio	1.3.1	for	ingress	
•  Kubeflow	–	1.0.1		
•  Jupyter	Notebook	images	are	IBM’s	
multiarchitecture	powerai	images	(
https://hub.docker.com/r/ibmcom/powerai/tags)		
Telstra: Collaborating with IBM to build an Open Source based
OneAnalytics Platform leveraging Kubeflow
THINK	2020	Session:	End-to-End	Data	Science	and	Machine	Learning	for	Telcos:	Telstra's	Use	Case	
https://www.ibm.com/events/think/watch/replay/126561688
Telstra AI Lab - (TAIL) – Future state
•  RedHat	Openshift	–	4.3	
•  GPU	Operator	
•  Kubeflow	Operator	
•  Extending	the	compute		
•  Integrate	feature	stores	and	streaming	
technologies	
•  Integrate	with	CI/CD	tools	(Tekton	
Pipelines)
Yara – Working with IBM to build a Data Science Platform for Digital Farming
ML use cases based on Kubeflow
54
THINK	2020	Session:	Enable	Smart	Farming	using	Kubeflow	
https://www.ibm.com/events/think/watch/replay/126494864
Watson STT: Kubeflow Pipelines running Operations
Watson SpeechToText training Kubeflow pipeline
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
Data
Untrained
Model
OpenDataHub
'Upstream' is about extracting oil and natural gas from the ground; 'midstream' is about safely moving them thousands of miles;
and 'downstream' is converting these resources into the fuels and finished products we all depend on.
Upstream, Midstream and Downstream
Upstream, Midstream and Downstream
'Upstream' is about extracting oil and natural gas from the ground; 'midstream' is about safely moving them thousands of miles;
and 'downstream' is converting these resources into the fuels and finished products we all depend on.
Data Platform
Operator Hub - operatorhub.io
OpenShift
Ready
OPEN DATA HUB - Ecosystem
61
Red Hat
OpenShift Container Platform
OPEN DATA HUB
REFERENCE ARCHITECTURE
Storage
Metadata
Management
Data
Analysis
AI
and
ML
Security and
Governance
Monitoring
and
Orchestratio
n
Data in
Motion
Data
Lake
In Memory
Relational
Databases
Streaming Data Object Storage Data Log Data
Big Data
Processing
Streaming Data Exploration
Interactive
Notebooks
Model Lifecycle
ML
Applications
Business
Applications
Metastore
Red Hat
OpenShift Container Platform
OPEN DATA HUB
REFERENCE IMPLEMENTATION
Storage
Metadata
Management
Data
Analysis
AI
and
ML
Security and
Governance
OpenShift Oauth
OpenShift Single
SignOn
(Keycloak)
RedHat Ceph
Object Gateway
RedHat 3scale
Monitoring
and
Orchestratio
n
Prometheus
Grafana
Kubeflow
Pipelines
Jenkins CI/CD
Data in
Motion
Data Lake
RedHat Ceph
Storage
In Memory
RedHat Data Grid
(Infinispan)
Relational
Databases
PostgreSQL
MySQL
Streaming Data
RedHat AMQ
Streams
Kafka Connect
Object Storage Data
RedHat Ceph S3 API
Log Data
FluentD
Logstash
Big Data
Processing
Spark
SparkSQL
Thrift
Streaming
Kafka Streams
Elastic Search
Data Exploration
Hue
Kibana
Interactive
Notebooks
JupyterHub
Hue
Model Lifecycle
Kubeflow
Seldon
MLFlow
ML
Applications
OpenDataHub
AI Library
Business
Applications
Superset
Metastore
Hive
Prepared
and
Analyzed
Data
Trained
Model
Deployed
Model
Prepared
Data
Untrained
Model
OpenDataHub	and	Kubeflow:	Relationship
Initial Goals: OpenDataHub and Kubeflow
Initial Goals:
•  Kubeflow has a great traction, Make it available for OpenShift users
Done in https://github.com/opendatahub-io/manifests
•  Offer ODH users components installed by KF
•  And offer components from ODH (Kafka, Apache SuperSet, Hive…) to KF community
•  Decide if we can leverage KF project and community as upstream for ODH
•  Think Kubernetes -> OpenShift
•  Frees up ODH maintainers time to make sure KF keeps running well on OpenShift
Kubeflow Operator – Contributed by IBM to Kubeflow community
to help enable OpenDataHub
•  https://operatorhub.io/operator/kubeflow	
	
•  Deploy,	manage	and	monitor	Kubeflow	
	
•  On	various	environments	
q  IBM	Cloud	
q  GCP	
q  AWS	
q  Azure	
q  OpenShift	
q  Other	K8S
Outcome: Kubeflow an Upstream for OpenDataHub
●  A	version	of	the	Operator	based	on	Kubeflow	
Architecture	released:
https://developers.redhat.com/blog/2020/05/07/open-
data-hub-0-6-brings-component-updates-and-kubeflow-
architecture/?sc_cid=7013a000002DTqEAAW	
●  Most	of	the	components	converted:		
https://github.com/opendatahub-io/odh-manifests		
	
●  Still	a	separate	deployment	–	needs	to	do	both	ODH	
and	Kubeflow	in	one	go.	
Future
•  KF	1.0	on	OpenShift	
•  Disconnected	deployment	
•  Open	Data	Hub	CI/CD	
•  Kubeflow	on	OpenShift	CI	
•  UBI	based	ODH	&	KF	
•  Multitenancy	model	
•  Mixing	KF	&	ODH
OPEN DATA HUB 0.6.x
Open Data Hub in OpenShift
69
Apache Superset
70
Think 2020 / DOC ID / Month XX, 2020 / © 2020 IBM
Corporation
Spark with Open Data Hub
71	
•  Open Data Hub will also deploy
the Spark Operator to manage
Spark as an application.
•  Two versions of Spark – Spark in
dedicated mode and Spark on
K8s
•  Currently moving towards Spark
on K8s Operator from Google for
serverless Spark. IBM
Hummingbird team investigating
this
Airflow integration with Open Data Hub
72	
•  Open Data Hub will also deploy the Airflow Operator to manage Airflow as an application.
•  Using the Airflow Operator originally developed in the GoogleCloudPlatform repository and later donated to
Apache.
•  The Operator creates a controller-manager pod which will be created as a part of the Open Data Hub
deployment.
•  Users can then install the Airflow components they need from the available options (eg: CeleryExecutor or
KubernetesExecutor, Postgres deployment or MySQL deployment etc. )
Apache Hive with OpenDataHub
•  Hive	was	one	of	the	first	abstraction	engines	to	be	built	
on	top	of	MapReduce.	
•  Started	at	Facebook	to	enable	data	analysts	to	analyse	
data	in	Hadoop	by	using	familiar	SQL	syntax	without	
having	to	learn	how	to	write	MapReduce.	
•  Hive	an	essential	tool	in	the	Hadoop	ecosystem	that	
provides	an	SQL	dialect	for	querying	data	stored	in	
HDFS,	other	file	systems	that	integrate	with	Hadoop	
such	as	MapR-FS	and	Amazon’s	S3	and	databases	like	
HBase(the	Hadoop	database)	and	Cassandra.	
•  Hive	is	a	Hadoop	based	system	for	querying	and	
analysing	large	volumes	of	structured	data	which	is	
stored	on	HDFS.	
•  Hive	is	a	query	engine	built	to	work	on	top	of	Hadoop	
that	can	compile	queries	into	MapReduce	jobs	and	run	
them	on	the	cluster.
Data Platform
Operator Hub - operatorhub.io
OpenShift
Ready
Kubernetes
Ready
ML and AI Platform
Operator Hub - operatorhub.io
Kubernetes
Ready
Upstream Kubeflow Midstream OpenDataHub
OpenShift
Ready
Operator Hub - operatorhub.io
Kubeflow
OpenDataHub
Open Source End To End
Data and AI Platform
RedHat MarketPlace https://marketplace.redhat.com/en-us
Coming Next: Kubeflow Dojo
https://github.com/kubeflow	
	
https://github.com/opendatahub-io	
	
			
https://github.com/IBM/
KubeflowDojo
Kubeflow Dojo: Prerequisites
•  Knowledge of Kubernetes, watch the dojo for Kubernetes project with the IBM internal link or external link
•  Access to a Kubernetes cluster, either minikube or remote hosted
•  Source code control and development with git and github, watch the presentation with the
IBM internal link or external link for git and external link for pull requests
•  Get familiar with golang language, watch the introduction dojo with the IBM internal link or external link
•  (optional) Knowledge of Istio and knative
•  If you have more time,
o  Read Kubeflow document to learn more about Kubeflow project
o  Browse through Kubeflow community github
Kubeflow Dojo: Tips for success
•  Access to a Kubernetes cluster
•  minimal spec: 8vcpu, 16gb ram and at least 50gb disk for docker registry
•  On IBM Kubernetes Service, provision the cluster with machine type b2c.4x16 and 2 worker
nodes
•  Follow Kubeflow document to have your cluster prepared
•  On IKS cluster, follow this link to install the IBM Cloud CLI and helm followed by setting up
IBM Cloud Block Storage as the default storage class
©	2019	IBM	Corporation	
Kubeflow	Dojo:	Live	
Dates:	15th	and	16th	July	
	
	
Kubeflow Dojo: Virtual
github.com/ibm/KubeflowDojo
80
Reach	Out!	
	
Animesh	Singh	
singhan@us.ibm.com	
twitter.com/AnimeshSingh	
github.com/AnimeshSingh	
	
	
	
		
https://ec.yourlearning.ibm.com/w3/event/10082348

More Related Content

PDF
Kubeflow
PDF
MLOps with Kubeflow
PDF
Databricks Overview for MLOps
PDF
Kubernetes a comprehensive overview
PDF
The A-Z of Data: Introduction to MLOps
PDF
generative-ai-fundamentals and Large language models
PDF
Vertex AI: Pipelines for your MLOps workflows
PDF
Machine learning Algorithms
Kubeflow
MLOps with Kubeflow
Databricks Overview for MLOps
Kubernetes a comprehensive overview
The A-Z of Data: Introduction to MLOps
generative-ai-fundamentals and Large language models
Vertex AI: Pipelines for your MLOps workflows
Machine learning Algorithms

What's hot (20)

PDF
Machine Learning using Kubeflow and Kubernetes
PDF
What is MLOps
PDF
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
PDF
Kubeflow Pipelines (with Tekton)
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
PDF
MLOps for production-level machine learning
PPTX
From Data Science to MLOps
PDF
“Houston, we have a model...” Introduction to MLOps
PDF
MLOps Using MLflow
PDF
Kubeflow Distributed Training and HPO
PPTX
MLOps in action
PPTX
Steering the Course with Helm
PDF
KFServing and Kubeflow Pipelines
PPTX
MLOps - The Assembly Line of ML
PDF
Ml ops past_present_future
PDF
Apply MLOps at Scale by H&M
PDF
Ml ops intro session
PDF
MLOps Bridging the gap between Data Scientists and Ops.
PDF
Productionzing ML Model Using MLflow Model Serving
PDF
Service Mesh on Kubernetes with Istio
Machine Learning using Kubeflow and Kubernetes
What is MLOps
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Kubeflow Pipelines (with Tekton)
Using MLOps to Bring ML to Production/The Promise of MLOps
MLOps for production-level machine learning
From Data Science to MLOps
“Houston, we have a model...” Introduction to MLOps
MLOps Using MLflow
Kubeflow Distributed Training and HPO
MLOps in action
Steering the Course with Helm
KFServing and Kubeflow Pipelines
MLOps - The Assembly Line of ML
Ml ops past_present_future
Apply MLOps at Scale by H&M
Ml ops intro session
MLOps Bridging the gap between Data Scientists and Ops.
Productionzing ML Model Using MLflow Model Serving
Service Mesh on Kubernetes with Istio
Ad

Similar to End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage (20)

PDF
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
PDF
KFServing - Serverless Model Inferencing
PDF
Scaling AI/ML with Containers and Kubernetes
PDF
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
PDF
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
PDF
KubeCon & CloudNative Con 2024 Artificial Intelligent
PDF
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
PDF
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
PDF
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
PDF
Containerized architectures for deep learning
PDF
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
PPTX
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
PDF
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
PPTX
Episode 1: Building Kubernetes-as-a-Service
PDF
PDF
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
PDF
Serving models using KFServing
PDF
Continuous Lifecycle London 2018 Event Keynote
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
KFServing - Serverless Model Inferencing
Scaling AI/ML with Containers and Kubernetes
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
KubeCon & CloudNative Con 2024 Artificial Intelligent
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
Containerized architectures for deep learning
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Episode 1: Building Kubernetes-as-a-Service
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Serving models using KFServing
Continuous Lifecycle London 2018 Event Keynote
Ad

More from Animesh Singh (20)

PDF
Machine Learning Exchange (MLX)
PDF
KFServing Payload Logging for Trusted AI
PDF
KFServing and Feast
PPTX
Defend against adversarial AI using Adversarial Robustness Toolbox
PDF
Trusted, Transparent and Fair AI using Open Source
PDF
AIF360 - Trusted and Fair AI
PDF
AI & Machine Learning Pipelines with Knative
PDF
Fabric for Deep Learning
PDF
Microservices, Kubernetes and Istio - A Great Fit!
PDF
How to build a Distributed Serverless Polyglot Microservices IoT Platform us...
PDF
How to build an event-driven, polyglot serverless microservices framework on ...
PDF
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
PDF
Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...
PDF
Finding and-organizing Great Cloud Foundry User Groups
PDF
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
PDF
Building a PaaS Platform like Bluemix on OpenStack
PDF
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
PDF
Build Scalable Internet of Things Apps using Cloud Foundry, Bluemix & Cloudant
PPTX
Automated Lifecycle Management - CloudFoundry on OpenStack
PPTX
Docker OpenStack Cloud Foundry
Machine Learning Exchange (MLX)
KFServing Payload Logging for Trusted AI
KFServing and Feast
Defend against adversarial AI using Adversarial Robustness Toolbox
Trusted, Transparent and Fair AI using Open Source
AIF360 - Trusted and Fair AI
AI & Machine Learning Pipelines with Knative
Fabric for Deep Learning
Microservices, Kubernetes and Istio - A Great Fit!
How to build a Distributed Serverless Polyglot Microservices IoT Platform us...
How to build an event-driven, polyglot serverless microservices framework on ...
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...
Finding and-organizing Great Cloud Foundry User Groups
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
Building a PaaS Platform like Bluemix on OpenStack
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
Build Scalable Internet of Things Apps using Cloud Foundry, Bluemix & Cloudant
Automated Lifecycle Management - CloudFoundry on OpenStack
Docker OpenStack Cloud Foundry

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
MYSQL Presentation for SQL database connectivity
PPT
Teaching material agriculture food technology
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25-Week II
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A comparative analysis of optical character recognition models for extracting...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
Spectroscopy.pptx food analysis technology
MYSQL Presentation for SQL database connectivity
Teaching material agriculture food technology
Accuracy of neural networks in brain wave diagnosis of schizophrenia
“AI and Expert System Decision Support & Business Intelligence Systems”

End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage

  • 1. Jupyter Notebooks Workflow Building Pipelines Tools Serving Metadata Kale Fairing TFX KF Pipelines HP Tuning Tensorboard KFServing Seldon Core TFServing, + Training Operators Pytorch XGBoost, + Tensorflow Prometheus Kubeflow: End to End ML Platform Animesh Singh MPI MXNet
  • 2. © 2019 IBM Corporation Animesh Singh STSM and Chief Architect - Data and AI Open Source Platform o  CTO, IBM RedHat Data and AI Open Source Alignment o  IBM Kubeflow Engagement Lead, Kubeflow Committer o  Chair, Linux Foundation AI - Trusted AI o  Chair, CD Foundation MLOps Sig o  Ambassador, CNCF o  Member of IBM Academy of Technology (IBM AoT) Kubeflow github.com/kubeflow Your Speaker Today: CODAIT 2
  • 3. Prepared and Analyzed Data Trained Model Deployed Model Prepared Data Untrained Model Kubeflow: Current IBM Contributors Christian Kadner Weiqiang Zhuang Tommy Li Andrew Butler Jin Chi He Feng Li Ke Zhu Kevin Yu
  • 4. IBM is the 2nd Largest Contributor
  • 5. IBM is the 2nd Largest Contributor
  • 6. IBMers contributing across projects in Kubeflow
  • 7. Kubeflow Services High Level Services Low Level APIs / Services Katib Pipelines Notebooks TFJob PyTorchJob Jupyter CR Seldon CR Kubebench Pipelines CR Argo Study Job MPIJob Spark Job KFServing TFX Developed By Kubeflow Developed Outside Kubeflow Adapted from Kubeflow Contributor Summit 2019 talk: Kubeflow and ML Landscape (Not all components are shown) Kubernetes API Server Istio Mesh and Gateway kubectl apply -f tfjob
  • 11. Develop (Kubeflow Jupyter Notebooks) –  Data Scientist –  Self-service Jupyter Notebooks provide faster model experimentation –  Simplified configuration of CPU/GPU, RAM, Persistent Volumes –  Faster model creation with training operators, TFX, magics, workflow automation (Kale, Fairing) –  Simplify access to external data sources (using stored secrets) –  Easier protection, faster restoration & sharing of “complete” notebooks –  IT Operator –  Profile Controller, Istio, Dex enable secure RBAC to notebooks, data & resources –  Smaller base container images for notebooks, fewer crashes, faster to recover
  • 12. Develop (Kubeflow Jupyter Notebooks) 12
  • 15. Distributed Tensorflow Operator •  A distributed Tensorflow Job is collection of the following processes o  Chief – The chief is responsible for orchestrating training and performing tasks like checkpointing the model o  Ps – The ps are parameters servers; the servers provide a distributed data store for the model parameters to access o  Worker – The workers do the actual work of training the model. In some cases, worker 0 might also act as the chief o  Evaluator - The evaluators can be used to compute evaluation metrics as the model is trained
  • 16. Distributed MPI Operator - AllReduce •  AllReduce is an operation that reduces many arrays spread across multiple processes into a single array which can be returned to all the processes •  This ensures consistency between distributed processes while allowing all of them to take on different workloads •  The operation used to reduce the multiple arrays back into a single array can vary and that is what makes the different options for AllReduce
  • 17. Hyper Parameter Optimization and Neural Architecture Search - Katib •  Katib: Kubernetes Native System for Automated tuning of machine learning model’s Hyperparameter Turning and Neural Architecture Search. •  Github Repository: https://github.com/kubeflow/katib •  Hyperparameter Tuning q  Random Search q  Tree of Parzen Estimators (TPE) q  Grid Search q  Hyperband q  Bayesian Optimization q  CMA Evolution Strategy •  Neural Architecture Search q  Efficient Neural Architecture Search (ENAS) q  Differentiable Architecture Search (DARTS)
  • 19. ❑  Rollouts: Is this rollout safe? How do I roll back? Can I test a change without swapping traffic? ❑  Protocol Standards: How do I make a prediction? GRPC? HTTP? Kafka? ❑  Cost: Is the model over or under scaled? Are resources being used efficiently? ❑  Monitoring: Are the endpoints healthy? What is the performance profile and request trace? Prepared and Analyzed Data Trained Model Deployed Model Prepared Data Untrained Model ❑  Frameworks: How do I serve on Tensorflow? XGBoost? Scikit Learn? Pytorch? Custom Code? ❑  Features: How do I explain the predictions? What about detecting outliers and skew? Bias detection? Adversarial Detection? ❑  How do I wire up custom pre and post processing ML Lifecycle: Production Model Serving ❑  How do I handle batch predictions? ❑  How do I leverage standardized Data Plane protocol so that I can move my model across MLServing platforms?
  • 20. ●  Seldon Core was pioneering Graph Inferencing. ●  IBM and Bloomberg were exploring serverless ML lambdas. IBM gave a talk on the ML Serving with Knative at last KubeCon in Seattle ●  Google had built a common Tensorflow HTTP API for models. ●  Microsoft Kubernetizing their Azure ML Stack Experts fragmented across industry
  • 21. ●  Kubeflow created the conditions for collaboration. ●  A promise of open code and open community. ●  Shared responsibilities and expertise across multiple companies. ●  Diverse requirements from different customer segments Putting the pieces together
  • 22. ●  Founded by Google, Seldon, IBM, Bloomberg and Microsoft ●  Part of the Kubeflow project ●  Focus on 80% use cases - single model rollout and update ●  Kfserving 1.0 goals: ○  Serverless ML Inference ○  Canary rollouts ○  Model Explanations ○  Optional Pre/Post processing Model Serving - KFServing
  • 23. Manages the hosting aspects of your models •  InferenceService - manages the lifecycle of models •  Configuration - manages history of model deployments. Two configurations for default and canary. •  Revision - A snapshot of your model version •  Route - Endpoint and network traffic management Route Default Configuration Revision 1 Revision M 90 % KFService Canary Configuration Revision 1 Revision N 10 % KFServing: Default and Canary Configurations
  • 24. Model Servers - TensorFlow - Nvidia TRTIS - PyTorch - XGBoost - SKLearn - ONNX Components: •  - Predictor, Explainer, Transformer (pre-processor, post-processor) Storage - AWS/S3 - GCS - Azure Blob - PVC Supported Frameworks, Components and Storage Subsystems
  • 25. GPU Autoscaling - KNative solution Ingress Activator (buffers requests) Autoscaler Queue Proxy Model server when scale == 0 or handling burst capacity when scale > 0 metrics ●  Scale based on # in-flight requests against expected concurrency ●  Simple solution for heterogeneous ML inference autoscaling scale metrics 0...N Replicas API Requests
  • 26. But the Data Scientist Sees... ●  A pointer to a Serialized Model File ●  9 lines of YAML ●  A live model at an HTTP endpoint = http ●  Scale to Zero ●  GPU Autoscaling ●  Safe Rollouts ●  Optimized Serving Containers ●  Network Policy and Auth ●  HTTP APIs (gRPC soon) ●  Tracing ●  Metrics apiVersion: "serving.kubeflow.org/v1alpha2" kind: "InferenceService" metadata: name: "flowers-sample" spec: default: predictor: tensorflow: storageUri: "gs://kfserving-samples/models/tensorflow/flowers" Production users include: Bloomberg
  • 28. KFServing – Existing Features q  Crowd sourced capabilities – Contributions by AWS, Bloomberg, Google, Seldon, IBM, NVidia and others. q  Support for multiple runtimes pre-integrated (TFServing, Nvdia Triton (GPU optimization), ONNX Runtime, SKLearn, PyTorch, XGBoost, Custom models. q  Serverless ML Inference and Autoscaling: Scale to zero (with no incoming traffic) and Request queue based autoscaling q  Canary and Pinned rollouts: Control traffic percentage and direction, pinned rollouts q  Pluggable pre-processor/post-processor via Transformer: Gives capabilities to plug in pre-processing/post-processing implementation, control routing and placement (e.g. pre-processor on CPU, predictor on GPU) q  Pluggable analysis algorithms: Explainability, Drift Detection, Anomaly Detection, Adversarial Detection (contributed by Seldon) enabled by Payload Logging (built using CloudEvents standardized eventing protocol) q  Batch Predictions: Batch prediction support for ML frameworks (TensorFlow, PyTorch, ...) q  Integration with existing monitoring stack around Knative/Istio ecosystem: Kiali (Service placements, traffic and graphs), Jaeger (request tracing), Grafana/Prometheus plug-ins for Knative) q  Multiple clients: kubectl, Python SDK, Kubeflow Pipelines SDK q  Standardized Data Plane V2 protocol for prediction/explainability et all: Already implemented by Nvidia Triton
  • 29. q  MMS: Multi-Model-Serving for serving multiple models per custom KFService instance q  More Data Plane v2 API Compliant Servers: SKLearn, XGBoost, PyTorch… q  Multi-Model-Graphs and Pipelines: Support chaining multiple models together in a Pipelines q  PyTorch support via AWS TorchServe q  gRPC Support for all Model Servers q  Support for multi-armed-bandits q  Integration with IBM AIX360 for Explainability, AIF360 for Bias detection and ART for Adversarial detection KFServing – Upcoming Features
  • 31. Kubeflow Pipelines §  Containerized implementations of ML Tasks §  Pre-built components: Just provide params or code snippets (e.g. training code) §  Create your own components from code or libraries §  Use any runtime, framework, data types §  Attach k8s objects - volumes, secrets §  Specification of the sequence of steps §  Specified via Python DSL §  Inferred from data dependencies on input/output §  Input Parameters §  A “Run” = Pipeline invoked w/ specific parameters §  Can be cloned with different parameters §  Schedules §  Invoke a single run or create a recurring scheduled pipeline
  • 32. Define Pipeline with Python SDK @dsl.pipeline(name='Taxi Cab Classification Pipeline Example’) def taxi_cab_classification( output_dir, project, Train_data = 'gs://bucket/train.csv', Evaluation_data = 'gs://bucket/eval.csv', Target = 'tips', Learning_rate = 0.1, hidden_layer_size = '100,50’, steps=3000): tfdv = TfdvOp(train_data, evaluation_data, project, output_dir) preprocess = PreprocessOp(train_data, evaluation_data, tfdv.output[“schema”], project, output_dir) training = DnnTrainerOp(preprocess.output, tfdv.schema, learning_rate, hidden_layer_size, steps, target, output_dir) tfma = TfmaOp(training.output, evaluation_data, tfdv.schema, project, output_dir) deploy = TfServingDeployerOp(training.output) Compile and Submit Pipeline Run dsl.compile(taxi_cab_classification, 'tfx.tar.gz') run = client.run_pipeline( 'tfx_run', 'tfx.tar.gz', params={'output': ‘gs://dpa22’, 'project': ‘my-project-33’})
  • 33. Visualize the state of various components
  • 38. Kubeflow Pipelines can train, deploy and serve Open Source Dojo 38
  • 39. Kubernetes Ready ML and AI Platform Operator Hub - operatorhub.io
  • 41. Watson AI Pipelines •  Demonstrate that Watson can be used for end-end AI lifecycledata prep/model training/model risk validation/model deployment/monitoring/updating models •  Demonstrate that the full lifecycle can be operated programmatically, and have Tekton as a backend instead of Argo
  • 42. Pipeline: Train the model and monitor with OpenScale
  • 43. Tekton q  A PipelineResource defines an object that is an input (such as a git repository) or an output (such as a docker image) of the pipeline. q  A PipelineRun defines an execution of a pipeline. It references the Pipeline to run and the PipelineResources to use as inputs and outputs. q  A Pipeline defines the set of Tasks that compose a pipeline. q  A Task defines a set of build Steps such as compiling code, running tests, and building and deploying images. TASK STEP POD STEP TASK STEP STEP POD Container Container Container Container TEKTON q  The Tekton Pipelines project provides Kubernetes-style resources for declaring CI/CD- style pipelines. q  Tekton introduces several new CRDs including Task, Pipeline, TaskRun, and PipelineRun. q  A PipelineRun represents a single running instance of a Pipeline and is responsible for creating a Pod for each of its Tasks and as many containers within each Pod as it has Steps.
  • 44. KFP API Server Components Pipelines Object Store KFP UI Relational DB Argo Pipeline Yaml Tekton Pipeline Yaml KFP – Tekton Phase One Pluggable Components Watson Studio WML Open Scale Spark Kubeflow Training Seldon AIF360 ART KATIB KFSERVING ! ! ! ! ! ! ! … …! COMPILE KFP SDK TASK STEP POD STEP STEP POD POD POD STEP TASK STEP STEP STEP POD Container Container Container Container ARGO TEKTON
  • 45. KFP – Tekton Phase Two Pluggable Components Watson Studio WML Open Scale Spark Kubeflow Training Seldon AIF360 ART KATIB KFSERVING ! ! ! ! ! ! ! … …! TASK STEP POD STEP STEP POD POD POD STEP TASK STEP STEP STEP POD Container Container Container Container ARGO TEKTON KFP API Server Components Pipelines Object Store KFP UI Relational DB Argo Pipeline Yaml Tekton Pipeline Yaml COMPILE KFP SDK
  • 46. KFP – Tekton Challenges 46 Multiple Moving parts, with different stakeholders Tekton Community: Argo with version 2.6 much more mature than Tekton v0.11 (alpha) when the work started around 5 months ago • Multiple features and capabilities lacking in Tekton when we kick started • The team had to default to a spreadsheet to start tracking and mapping KFP DSL features, and areas where Tekton needed to bring features and functions. Overall 50 DSL capabilities identified and corresponding Tekton features started getting mapped. • Multiple features like Kubernetes resources support to create/patch/update/delete them, image pull secrets, loops, conditionals, support for system params didn’t exist. Or existed partially • Tekton started moving from alpha to beta as the work progressed, and few features left behind in alpha mode • Multiple issues opened on Tekton. Required ramping up the team of Tekton contributors to help drive these issues . Formed a virtual team of IBM Open tech developers (Andrea Frittoli, Priti Desai), IBM Systems team (Vincent Pli) DevOps team (Simon Kaegi), RedHat (Vincent Demeester etc.) to drive Tekton requirements Kubeflow Pipeline and TFX Community: Open source team needed to be formed for the specific mission. And trained. Additionally Google needed to be brought up on the same page, and convinced the validity of integration. • Multiple design reviews established with Google, and jointly agreed on a direction after they were convinced why we were doing it, and why it makes sense. • Convincing to accelerate the IR (Intermediate Representation) strategy with TFX, so as to be able to drive this the right way • Huge dependency in Kubeflow Pipeline code on Argo, including the API backend and UI all written with Argo dependency • Internal IBM team divided to attack different areas: Compiler (Christian Kadner), API (Tommy Li), UI (Andrew), Feng Li (IBM Systems, China) • Inability of Kubeflow Pipeline backend to take multiple CRDs, which is the default model Tekton follows. So everything needed to be bundled in one Pipeline Spec • Type check, workflow utils, and parameter replacement are heavily tied with Argo API. In addition, the persistent agent is watching the resources using the Argo API type. • MLOps Sig in CD Foundation leveraged to bring Kubeflow Pipelines and Tekton team together
  • 47. KFP – Tekton: Delivered Pluggable Components Watson Studio WML Open Scale Spark Kubeflow Training Seldon AIF360 ART KATIB KFSERVING ! ! ! ! ! ! ! … …! TASK STEP POD STEP TASK STEP STEP POD Container Container Container Container TEKTON KFP API Server Components Pipelines Object Store KFP UI Relational DB Tekton Pipeline Yaml COMPILE KFP SDK
  • 48. Same KFP Experience: DAG, backed by Tekton YAML 48
  • 49. Same KFP Exp: Logs, Lineage Tracking and Artifact Tracking 49
  • 50. 50 End to end Kubeflow Components : With KFP-Tekton
  • 52. Telstra AI Lab - (TAIL) - Configuration •  Kubernetes – 1.15 •  Spectrum Scale CSI Driver •  MetalLB for Load Balancing •  Istio 1.3.1 for ingress •  Kubeflow – 1.0.1 •  Jupyter Notebook images are IBM’s multiarchitecture powerai images ( https://hub.docker.com/r/ibmcom/powerai/tags) Telstra: Collaborating with IBM to build an Open Source based OneAnalytics Platform leveraging Kubeflow THINK 2020 Session: End-to-End Data Science and Machine Learning for Telcos: Telstra's Use Case https://www.ibm.com/events/think/watch/replay/126561688
  • 53. Telstra AI Lab - (TAIL) – Future state •  RedHat Openshift – 4.3 •  GPU Operator •  Kubeflow Operator •  Extending the compute •  Integrate feature stores and streaming technologies •  Integrate with CI/CD tools (Tekton Pipelines)
  • 54. Yara – Working with IBM to build a Data Science Platform for Digital Farming ML use cases based on Kubeflow 54 THINK 2020 Session: Enable Smart Farming using Kubeflow https://www.ibm.com/events/think/watch/replay/126494864
  • 55. Watson STT: Kubeflow Pipelines running Operations
  • 56. Watson SpeechToText training Kubeflow pipeline
  • 58. 'Upstream' is about extracting oil and natural gas from the ground; 'midstream' is about safely moving them thousands of miles; and 'downstream' is converting these resources into the fuels and finished products we all depend on. Upstream, Midstream and Downstream
  • 59. Upstream, Midstream and Downstream 'Upstream' is about extracting oil and natural gas from the ground; 'midstream' is about safely moving them thousands of miles; and 'downstream' is converting these resources into the fuels and finished products we all depend on.
  • 60. Data Platform Operator Hub - operatorhub.io OpenShift Ready
  • 61. OPEN DATA HUB - Ecosystem 61
  • 62. Red Hat OpenShift Container Platform OPEN DATA HUB REFERENCE ARCHITECTURE Storage Metadata Management Data Analysis AI and ML Security and Governance Monitoring and Orchestratio n Data in Motion Data Lake In Memory Relational Databases Streaming Data Object Storage Data Log Data Big Data Processing Streaming Data Exploration Interactive Notebooks Model Lifecycle ML Applications Business Applications Metastore
  • 63. Red Hat OpenShift Container Platform OPEN DATA HUB REFERENCE IMPLEMENTATION Storage Metadata Management Data Analysis AI and ML Security and Governance OpenShift Oauth OpenShift Single SignOn (Keycloak) RedHat Ceph Object Gateway RedHat 3scale Monitoring and Orchestratio n Prometheus Grafana Kubeflow Pipelines Jenkins CI/CD Data in Motion Data Lake RedHat Ceph Storage In Memory RedHat Data Grid (Infinispan) Relational Databases PostgreSQL MySQL Streaming Data RedHat AMQ Streams Kafka Connect Object Storage Data RedHat Ceph S3 API Log Data FluentD Logstash Big Data Processing Spark SparkSQL Thrift Streaming Kafka Streams Elastic Search Data Exploration Hue Kibana Interactive Notebooks JupyterHub Hue Model Lifecycle Kubeflow Seldon MLFlow ML Applications OpenDataHub AI Library Business Applications Superset Metastore Hive
  • 65. Initial Goals: OpenDataHub and Kubeflow Initial Goals: •  Kubeflow has a great traction, Make it available for OpenShift users Done in https://github.com/opendatahub-io/manifests •  Offer ODH users components installed by KF •  And offer components from ODH (Kafka, Apache SuperSet, Hive…) to KF community •  Decide if we can leverage KF project and community as upstream for ODH •  Think Kubernetes -> OpenShift •  Frees up ODH maintainers time to make sure KF keeps running well on OpenShift
  • 66. Kubeflow Operator – Contributed by IBM to Kubeflow community to help enable OpenDataHub •  https://operatorhub.io/operator/kubeflow •  Deploy, manage and monitor Kubeflow •  On various environments q  IBM Cloud q  GCP q  AWS q  Azure q  OpenShift q  Other K8S
  • 67. Outcome: Kubeflow an Upstream for OpenDataHub ●  A version of the Operator based on Kubeflow Architecture released: https://developers.redhat.com/blog/2020/05/07/open- data-hub-0-6-brings-component-updates-and-kubeflow- architecture/?sc_cid=7013a000002DTqEAAW ●  Most of the components converted: https://github.com/opendatahub-io/odh-manifests ●  Still a separate deployment – needs to do both ODH and Kubeflow in one go. Future •  KF 1.0 on OpenShift •  Disconnected deployment •  Open Data Hub CI/CD •  Kubeflow on OpenShift CI •  UBI based ODH & KF •  Multitenancy model •  Mixing KF & ODH
  • 68. OPEN DATA HUB 0.6.x
  • 69. Open Data Hub in OpenShift 69
  • 70. Apache Superset 70 Think 2020 / DOC ID / Month XX, 2020 / © 2020 IBM Corporation
  • 71. Spark with Open Data Hub 71 •  Open Data Hub will also deploy the Spark Operator to manage Spark as an application. •  Two versions of Spark – Spark in dedicated mode and Spark on K8s •  Currently moving towards Spark on K8s Operator from Google for serverless Spark. IBM Hummingbird team investigating this
  • 72. Airflow integration with Open Data Hub 72 •  Open Data Hub will also deploy the Airflow Operator to manage Airflow as an application. •  Using the Airflow Operator originally developed in the GoogleCloudPlatform repository and later donated to Apache. •  The Operator creates a controller-manager pod which will be created as a part of the Open Data Hub deployment. •  Users can then install the Airflow components they need from the available options (eg: CeleryExecutor or KubernetesExecutor, Postgres deployment or MySQL deployment etc. )
  • 73. Apache Hive with OpenDataHub •  Hive was one of the first abstraction engines to be built on top of MapReduce. •  Started at Facebook to enable data analysts to analyse data in Hadoop by using familiar SQL syntax without having to learn how to write MapReduce. •  Hive an essential tool in the Hadoop ecosystem that provides an SQL dialect for querying data stored in HDFS, other file systems that integrate with Hadoop such as MapR-FS and Amazon’s S3 and databases like HBase(the Hadoop database) and Cassandra. •  Hive is a Hadoop based system for querying and analysing large volumes of structured data which is stored on HDFS. •  Hive is a query engine built to work on top of Hadoop that can compile queries into MapReduce jobs and run them on the cluster.
  • 74. Data Platform Operator Hub - operatorhub.io OpenShift Ready
  • 75. Kubernetes Ready ML and AI Platform Operator Hub - operatorhub.io
  • 76. Kubernetes Ready Upstream Kubeflow Midstream OpenDataHub OpenShift Ready Operator Hub - operatorhub.io Kubeflow OpenDataHub Open Source End To End Data and AI Platform RedHat MarketPlace https://marketplace.redhat.com/en-us
  • 77. Coming Next: Kubeflow Dojo https://github.com/kubeflow https://github.com/opendatahub-io https://github.com/IBM/ KubeflowDojo
  • 78. Kubeflow Dojo: Prerequisites •  Knowledge of Kubernetes, watch the dojo for Kubernetes project with the IBM internal link or external link •  Access to a Kubernetes cluster, either minikube or remote hosted •  Source code control and development with git and github, watch the presentation with the IBM internal link or external link for git and external link for pull requests •  Get familiar with golang language, watch the introduction dojo with the IBM internal link or external link •  (optional) Knowledge of Istio and knative •  If you have more time, o  Read Kubeflow document to learn more about Kubeflow project o  Browse through Kubeflow community github
  • 79. Kubeflow Dojo: Tips for success •  Access to a Kubernetes cluster •  minimal spec: 8vcpu, 16gb ram and at least 50gb disk for docker registry •  On IBM Kubernetes Service, provision the cluster with machine type b2c.4x16 and 2 worker nodes •  Follow Kubeflow document to have your cluster prepared •  On IKS cluster, follow this link to install the IBM Cloud CLI and helm followed by setting up IBM Cloud Block Storage as the default storage class