SlideShare a Scribd company logo
Managing Flink on
Kubernetes
Anand Swaminathan (@anand12100)
Ketan Umare (@ketanumare)
April 2, 2019
Kubernetes Primer1
Agenda
Background2
Solution3
Demo4
Quick Introduction to concepts in Kubernetes
Summary of Lyft’s legacy Flink Deployment
Flink Kubernetes Operator
Ecosystem5
Roadmap6
About us
Kubernetes Primer
History
● Google’s internal infrastructure is containerized and runs on Borg/Omega
● K8s was open sourced in 2014, re-incarnation of the internal infrastructure
● Kubernetes automates - deployment, scaling and management of
containerized apps.
● Containers are scheduled based on CPU/GPU/Memory/Disk etc
Kubernetes Primer
Pods
● A Pod is a group of one or more
Containers as one unit
● Pods have no durability guarantees
● Each Pod has a unique IP Address
● Containers in a Pod can communicate
using localhost
● Multiple Pods can be located on the
same node - machine
Kubernetes Primer
Other Concepts
● Deployments abstraction that enables rolling out
changes to a set of pods
● Service abstraction to access sets of pods - like a load
balancer within a k8s cluster
● Ingress abstraction to expose a service to the outside
world (HTTP/HTTPS)
● Controller A reconciliation loop that drives current state
towards desired state
Architecture
Kubernetes Primer
Kubernetes Primer
Control Loops
● Control loops are fundamental building block of
industrial control systems
● Desired State refers to the intended state as
requested
● Current/Observed State is the state of the
system as observed by the controller
● Controller runs control loops
● Drive Current State -> Desired State
● This is the cornerstone of Kubernetes
Kubernetes Primer
Custom Resources
● Custom Resource Definitions (CRD) allow extending Kubernetes API
● Custom resources are optional extensions
● Custom resources can be added/removed dynamically
● They can be manipulated using known tools - kubectl & kube clients
● State stored in etcd
● Custom control loops (controllers) are used to manage the state of the
resource.
● CRD is essentially the desired state.
Kubernetes Primer
Operators
● Controller + CRD = Kubernetes Operator
● Term coined by CoreOS - 2017
● Manages a complex applications lifecycle on
Kubernetes.
● Core library to author operators @
SIG/controller-runtime
Background
OK how does this relate
● @Lyft we started working on Flyte - a modern take at Pipelines/Workflows
● Orchestration is pervasive throughout various sectors of our Industry
○ Machine learning
○ Data engineering and processing
○ ETL
● Kubernetes has a solution to many of our problems
○ Deployment, Versioning, cluster management etc
● In parallel Streaming Platform started working on Flink for streaming
applications
Background
Legacy deployment of Flink @Lyft
● Hosted on AWS
● Separate AutoScalingGroups for Task Managers and Job Managers
● Machines provisioned and bootstrapped by SaltStack
● Every deployment needs provisioning of machines
● Users started running multiple jobs in the same Flink Cluster
● Multi-tenancy hell !
Introducing
Flink-k8s-operator
● Abstract out the complexity from application developers
○ Hosting
○ Configuration
○ Management
● Separate Flink cluster for each Flink application.
● Deploy and rollback support
● Support Flink application updates - scaling
● Simplified interface for instituting best practices
● Scale to 100s of flink applications
Goals
Solution
Flink Operator - CRD
● Each custom resource corresponds
to a Flink application
● Each Flink application runs a single
Flink job
● Docker image should be runnable
Solution
Architecture
Solution
Operator Walkthrough
New
Creates a new
Flink cluster in
K8s
Starting
Waits for all the pods
to come up
Ready
Polls Flink jobmanager
REST API & submits a
new job
Running
Monitors the running
job & checks if the
application has
changed
Solution
Operator Walkthrough
Running
Operator
detects the
update to CRD
Updating
If needed,
updates cluster,
cancels Job with
savepoint
Savepointing
Waits for the savepoint
to succeed, and
updates savepoint
location in CRD
New
Brings up a new
cluster and tries to
transition to Running
Demo
Ecosystem
Deployment @Lyft
● Jenkins based deployment
● Each stage creates or updates the resource in Kubernetes
Ecosystem
Future Extensions
Roadmap
Open Source
● Last week of April*
● Project status: Alpha
● @Lyft:
○ Active development and testing in staging.
● Future
○ Flink Job failure handling
○ Tooling to manage CRD
Coming soon: https://github.com/lyft/flinkk8soperator
We’re Hiring! Apply at www.lyft.com/careers
Data Engineering
Engineering Manager
San Francisco
Software Engineer
San Francisco, Seattle, &
New York City
Data Infrastructure
Engineering Manager
San Francisco
Software Engineer
San Francisco & Seattle
Experimentation
Software Engineer
San Francisco
Streaming
Software Engineer
San Francisco
Observability
Software Engineer
San Francisco
Thank you
Questions please!
Background
Example of Deployment
● User requests for a Deployment @
master
● Master accepts the request
● Desired State: 1 Pod running
● Current State: 0 Pods running
Background
Kubernetes 101
1. Master requests Pod creation
○ Current State: Deployment unhealthy
2. Master receives pod created event
○ Current State: Deployment healthy
3. Now if the pod crashes/dies etc
○ Current State: Deployment unhealthy
4. Goto 1

More Related Content

PDF
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
PDF
Flink Forward San Francisco 2018 keynote: Anand Iyer - "Apache Flink + Apach...
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
PPTX
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
PDF
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
PDF
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
PDF
Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018 keynote: Anand Iyer - "Apache Flink + Apach...
Streaming your Lyft Ride Prices - Flink Forward SF 2019
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp

What's hot (20)

PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
PPTX
KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified ...
PDF
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
PPTX
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
PPTX
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
PPTX
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
PPTX
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
PPTX
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
PDF
KFServing and Feast
PDF
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
PDF
Flink Forward Berlin 2018: Viktor Klang - Keynote "The convergence of stream ...
PDF
KFServing and Kubeflow Pipelines
PPTX
End to-end example: consumer loan acceptance scoring using kubeflow
PDF
Flink Connector Development Tips & Tricks
PPTX
Do Flink on Web with FLOW
PDF
44CON 2014 - Binary Protocol Analysis with CANAPE, James Forshaw
PDF
Kubeflow repos
PPTX
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
PPTX
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
KFServing and Feast
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward Berlin 2018: Viktor Klang - Keynote "The convergence of stream ...
KFServing and Kubeflow Pipelines
End to-end example: consumer loan acceptance scoring using kubeflow
Flink Connector Development Tips & Tricks
Do Flink on Web with FLOW
44CON 2014 - Binary Protocol Analysis with CANAPE, James Forshaw
Kubeflow repos
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Ad

Similar to Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOperator - Anand Swaminathan & Ketan Umare (20)

PDF
Introduction to kubernetes
PPTX
Introduction+to+Kubernetes-Details-D.pptx
PDF
Introduction to Kubernetes Workshop
PPTX
Migrating a Large Fortune 100 Healthcare Company to Kubernetes in 7 months
PDF
Pydata 2020 containers meetup
PDF
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
PDF
Kubernetes intro
PDF
Kubernetes: The Next Research Platform
PDF
Machine learning on kubernetes
PDF
Intro to Kubernetes
PDF
Implementing Flux for Scale with Soft Multi-tenancy
PDF
Greenplum for Kubernetes - Greenplum Summit 2019
PDF
Kubernetes and CoreOS @ Athens Docker meetup
PDF
Kubernetes for Beginners
PDF
Deploying Anything as a Service (XaaS) Using Operators on Kubernetes
PPTX
Container Orchestration using kubernetes
PDF
Nugwc k8s session-16-march-2021
PDF
Introduction to containers, k8s, Microservices & Cloud Native
PDF
An intro to Kubernetes operators
Introduction to kubernetes
Introduction+to+Kubernetes-Details-D.pptx
Introduction to Kubernetes Workshop
Migrating a Large Fortune 100 Healthcare Company to Kubernetes in 7 months
Pydata 2020 containers meetup
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Kubernetes intro
Kubernetes: The Next Research Platform
Machine learning on kubernetes
Intro to Kubernetes
Implementing Flux for Scale with Soft Multi-tenancy
Greenplum for Kubernetes - Greenplum Summit 2019
Kubernetes and CoreOS @ Athens Docker meetup
Kubernetes for Beginners
Deploying Anything as a Service (XaaS) Using Operators on Kubernetes
Container Orchestration using kubernetes
Nugwc k8s session-16-march-2021
Introduction to containers, k8s, Microservices & Cloud Native
An intro to Kubernetes operators
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Getting Started with Data Integration: FME Form 101
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25-Week II
Getting Started with Data Integration: FME Form 101
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Programs and apps: productivity, graphics, security and other tools
Heart disease approach using modified random forest and particle swarm optimi...
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
cloud_computing_Infrastucture_as_cloud_p
A comparative study of natural language inference in Swahili using monolingua...
Encapsulation_ Review paper, used for researhc scholars

Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOperator - Anand Swaminathan & Ketan Umare

  • 1. Managing Flink on Kubernetes Anand Swaminathan (@anand12100) Ketan Umare (@ketanumare) April 2, 2019
  • 2. Kubernetes Primer1 Agenda Background2 Solution3 Demo4 Quick Introduction to concepts in Kubernetes Summary of Lyft’s legacy Flink Deployment Flink Kubernetes Operator Ecosystem5 Roadmap6
  • 4. Kubernetes Primer History ● Google’s internal infrastructure is containerized and runs on Borg/Omega ● K8s was open sourced in 2014, re-incarnation of the internal infrastructure ● Kubernetes automates - deployment, scaling and management of containerized apps. ● Containers are scheduled based on CPU/GPU/Memory/Disk etc
  • 5. Kubernetes Primer Pods ● A Pod is a group of one or more Containers as one unit ● Pods have no durability guarantees ● Each Pod has a unique IP Address ● Containers in a Pod can communicate using localhost ● Multiple Pods can be located on the same node - machine
  • 6. Kubernetes Primer Other Concepts ● Deployments abstraction that enables rolling out changes to a set of pods ● Service abstraction to access sets of pods - like a load balancer within a k8s cluster ● Ingress abstraction to expose a service to the outside world (HTTP/HTTPS) ● Controller A reconciliation loop that drives current state towards desired state
  • 8. Kubernetes Primer Control Loops ● Control loops are fundamental building block of industrial control systems ● Desired State refers to the intended state as requested ● Current/Observed State is the state of the system as observed by the controller ● Controller runs control loops ● Drive Current State -> Desired State ● This is the cornerstone of Kubernetes
  • 9. Kubernetes Primer Custom Resources ● Custom Resource Definitions (CRD) allow extending Kubernetes API ● Custom resources are optional extensions ● Custom resources can be added/removed dynamically ● They can be manipulated using known tools - kubectl & kube clients ● State stored in etcd ● Custom control loops (controllers) are used to manage the state of the resource. ● CRD is essentially the desired state.
  • 10. Kubernetes Primer Operators ● Controller + CRD = Kubernetes Operator ● Term coined by CoreOS - 2017 ● Manages a complex applications lifecycle on Kubernetes. ● Core library to author operators @ SIG/controller-runtime
  • 11. Background OK how does this relate ● @Lyft we started working on Flyte - a modern take at Pipelines/Workflows ● Orchestration is pervasive throughout various sectors of our Industry ○ Machine learning ○ Data engineering and processing ○ ETL ● Kubernetes has a solution to many of our problems ○ Deployment, Versioning, cluster management etc ● In parallel Streaming Platform started working on Flink for streaming applications
  • 12. Background Legacy deployment of Flink @Lyft ● Hosted on AWS ● Separate AutoScalingGroups for Task Managers and Job Managers ● Machines provisioned and bootstrapped by SaltStack ● Every deployment needs provisioning of machines ● Users started running multiple jobs in the same Flink Cluster ● Multi-tenancy hell !
  • 14. ● Abstract out the complexity from application developers ○ Hosting ○ Configuration ○ Management ● Separate Flink cluster for each Flink application. ● Deploy and rollback support ● Support Flink application updates - scaling ● Simplified interface for instituting best practices ● Scale to 100s of flink applications Goals
  • 15. Solution Flink Operator - CRD ● Each custom resource corresponds to a Flink application ● Each Flink application runs a single Flink job ● Docker image should be runnable
  • 17. Solution Operator Walkthrough New Creates a new Flink cluster in K8s Starting Waits for all the pods to come up Ready Polls Flink jobmanager REST API & submits a new job Running Monitors the running job & checks if the application has changed
  • 18. Solution Operator Walkthrough Running Operator detects the update to CRD Updating If needed, updates cluster, cancels Job with savepoint Savepointing Waits for the savepoint to succeed, and updates savepoint location in CRD New Brings up a new cluster and tries to transition to Running
  • 19. Demo
  • 20. Ecosystem Deployment @Lyft ● Jenkins based deployment ● Each stage creates or updates the resource in Kubernetes
  • 22. Roadmap Open Source ● Last week of April* ● Project status: Alpha ● @Lyft: ○ Active development and testing in staging. ● Future ○ Flink Job failure handling ○ Tooling to manage CRD Coming soon: https://github.com/lyft/flinkk8soperator
  • 23. We’re Hiring! Apply at www.lyft.com/careers Data Engineering Engineering Manager San Francisco Software Engineer San Francisco, Seattle, & New York City Data Infrastructure Engineering Manager San Francisco Software Engineer San Francisco & Seattle Experimentation Software Engineer San Francisco Streaming Software Engineer San Francisco Observability Software Engineer San Francisco
  • 25. Background Example of Deployment ● User requests for a Deployment @ master ● Master accepts the request ● Desired State: 1 Pod running ● Current State: 0 Pods running
  • 26. Background Kubernetes 101 1. Master requests Pod creation ○ Current State: Deployment unhealthy 2. Master receives pod created event ○ Current State: Deployment healthy 3. Now if the pod crashes/dies etc ○ Current State: Deployment unhealthy 4. Goto 1