SlideShare a Scribd company logo
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How Zalando runs Kubernetes
clusters at scale on AWS
Henning Jacobs
OPN211
Senior Principal
Zalando SE
3
THE EUROPEAN
ONLINE PLATFORM
FOR FASHION
4
~ 5.4billion EUR
revenue 2018
> 300
million
visits
per
month
~ 14,000
employees in
Europe
> 80%
of visits via
mobile devices
> 28
million
active customers
> 400,000
product choices
> 2,000
brands
17
countries
as of June 2019
ZALANDO AT A GLANCE
5
2015: JOURNEY INTO THE CLOUD
AWS
STUPS
DOCKER
DEPLOY
SSH
ACCESS
AUDIT
REPORTS
FULL AWS
ACCESS
Teams have
admin access
& full
responsibility
6
2015: ISOLATED AWS ACCOUNTS
Internet
*.abc.example.org *.xyz.example.org
Team ABC Team XYZ
EC2EC2
ELBELB
EC2
7
INFRASTRUCTURE @ ZALANDO
STUPS
(toolset around AWS)
Kubernetes
AWS accounts per team.
All instances must run the same AMI.
PowerUser access to Production.
Clusters per product (multiple teams).
Instances are not managed by teams.
Hands off approach.
You build it, you run EVERYTHING. A lot of stuff out of the box.
8
2019: SCALE
140Clusters
396Accounts
9
2019: DEVELOPERS USING KUBERNETES
10
Platform
> 1100
developers
> 200
development teams
11
YOU BUILD IT, YOU RUN IT
The traditional model is that you take your software to the
wall that separates development and operations, and
throw it over and then forget about it. Not at Amazon.
You build it, you run it. This brings developers into
contact with the day-to-day operation of their software. It
also brings them into day-to-day contact with the
customer.
- A Conversation with Werner Vogels, ACM Queue, 2006
12
ON-CALL: YOU OWN IT, YOU RUN IT
When things are broken,
we want people with the best
context trying to fix things.
- Blake Scrivener, Netflix SRE Manager
13
GOALS
• No manual operations
• No pet clusters
• Reliability
• Autoscaling
• Latest Kubernetes
• Cost efficient
14
ARCHITECTURE
Pairs of clusters, each cluster in isolated account
AWS Acc. foobar-test
Cluster
foobar-test
AWS Acc. foobar
Cluster
foobar
15
CloudFormation stacks, node pools w/ self-baked Ubuntu AMI
ARCHITECTURE
etcd
Master
Nodes
Worker Nodes
16
ARCHITECTURE
Master
Nodes
Worker
Nodes
https://cluster-id.example.org
AWS ELB
AZ a AZ b AZ c
17
CLUSTER METADATA (CLUSTER-REGISTRY)
clusters:
- id: “cluster-id”
api_server_url: “https://cluster-id.example.org”
config_items:
Key: “value”
environment: “test”
region: “eu-central-1”
lifecycle_status: “ready”
node_pools:
- name: “worker-pool”
instance_type: “m5.large”
min_size: 3
max_size: 20
18
CLUSTER CONFIGURATION
github.com/zalando-incubator/kubernetes-on-aws
cluster
├── cluster.yaml # Kubernetes cluster stack
├── etcd-cluster.yaml # etcd cluster stack
├── manifests
│ ├── ...
└── node-pools # master/worker nodes
├── ...
19
KUBERNETES CLUSTER MANIFESTS
github.com/zalando-incubator/kubernetes-on-aws
20
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
21
CLUSTER UPGRADE
FLOW
22
CLUSTER CHANNELS
github.com/zalando-incubator/kubernetes-on-aws
Channel Description Clusters
dev Development and playground clusters 3
alpha Main infrastructure cluster (important to us) 1
beta Non-prod clusters for the rest of the org 65+
stable Production clusters. 65+
23
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
24
E2E TESTS
Conformance Tests
Upstream Kubernetes e2e conformance tests
✓
159
Zalando Tests (custom)
Custom tests for ingress, external-dns, PSP
etc.
17
StatefulSet Tests
Rolling update of stateful sets including volume
mounting
✓
2
✓
25
RUNNING E2E TESTS
Control plane
nodenode
Control plane
nodenode
branch: alpha (base) branch: dev (head)
Create Cluster Update Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
26
UPGRADING NODES
27
NAÏVE NODE UPGRADE STRATEGY
Auto Scaling Group
Min: 3
Max: 9
Current: 5
Desired: 5
28
NAÏVE NODE UPGRADE STRATEGY
Auto Scaling Group
Min: 6
Max: 6
Current: 5
Desired: 6
Set ASG size to current + 1
29
NAÏVE NODE UPGRADE STRATEGY
Auto Scaling Group
Min: 6
Max: 6
Current: 6
Desired: 6
drain
Get a new instance
drain
30
PROBLEMS WITH THE NAÏVE STRATEGY
What about stateful applications like Postgres?
Node
master Node
Node
replica
replica
drain
Postgres cluster unavailable :(
31
STATEFUL WORKLOADS
(POSTGRES)
32
POSTGRES OPERATOR
github.com/zalando-incubator/postgres-operator
Node
pg
role=master
Node
pg
role=replica
Node
pg
role=replica
Node
postgres
operator
Evict
✘
evict
pg
role=replica
promote
role=masterrole=replica
drain
✓
33
POSTGRES OPERATOR
github.com/zalando-incubator/postgres-operator
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: "postgres-cluster"
spec:
minAvailable: 1
selector:
matchLabels:
application: “postgres-cluster”
role: “master”
34
ROLLING UPGRADE OF NODES
Node Pool
az-1a
PVs
PreferNoSchedule
drain
az-1b
PVs
az-1c
PVs
PreferNoSchedule
PreferNoSchedule PreferNoSchedule
35
POSTGRES OPERATOR
Application to manage
PostgreSQL clusters on
Kubernetes
>500
clusters running
on Kubernetes
github.com/zalando/postgres-operator
Elasticsearch in Kubernetes
Elasticsearch
2.500 vCPUs
1 TB RAM
github.com/zalando-incubator/es-operator/
37
SLAS FOR CLUSTER UPDATES
• Respect PodDisruptionBudgets
• Force-terminate Pods after 3 days (or 8h on test)
• Cluster updates can be blocked anytime!
zkubectl cluster-update block [+ REASON]
38
DEPLOY & USER
INTERFACE
39
APP DEPLOYMENT CONFIGURATION
├── deploy/apply
│ ├── deployment.yaml
│ ├── credentials.yaml # Zalando IAM
│ ├── ingress.yaml
│ └── service.yaml
└── delivery.yaml # Zalando CI/CD
40
APP INGRESS.YAML
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your application should be exposed on
- host: "myapp.foo.example.org"
http:
paths:
- backend:
serviceName: "myapp"
servicePort: 80
41
CONTINUOUS DELIVERY PLATFORM
42
CDP: DEPLOY
"glorified kubectl apply"
43
EMERGENCY ACCESS SERVICE
Emergency access by referencing Incident
zkubectl cluster-access request 
--emergency -i INC REASON
Privileged production access via 4-eyes
zkubectl cluster-access request REASON
zkubectl cluster-access approve USERNAME
44
KUBERNETES WEB VIEW
kubectl get
pods,stacks,deploys,..
45
SEARCHING ACROSS 140+ CLUSTERS
codeberg.org/hjacobs/kube-web-view
codeberg.org/hjacobs/kube-web-view
47
UPGRADE TO KUBERNETES 1.14
"Found 1223 rows for 1 resource type in 148 clusters in 3.301 seconds."
48
SOME USE CASES
All Pending Pods across all clusters
49
AVOIDING
CONFIGURATION DRIFT
50
CLUSTER CONFIGURATION
Clusters look mostly the same, except:
• secrets, e.g. credentials for external logging provider
• node pools and their instance sizes
Cluster-specific config items are stored in Cluster Registry
51
CLUSTER AUTOSCALER
52
VERTICAL POD AUTOSCALER
• Prometheus
• External DNS
• Heapster / Metrics Server
• our ALB Ingress Controller
CPU/memory
53
VERTICAL POD AUTOSCALER
54
MONITORING &
COST EFFICIENCY
55
MONITORING SYSTEM - ZMON
• Dynamic entity registration
(clusters, pods, ..)
• Generic checks on entity attributes,
e.g. for all production clusters
"Less than 60% of worker nodes are ready"
• OpsGenie alerts
56
OPENTRACING
57
KUBERNETES RESOURCE REPORT
github.com/hjacobs/kube-resource-report
58
RESOURCE REPORT: TEAMS
Sorting teams by
Slack Costs
github.com/hjacobs/kube-resource-report
59
KUBERNETES APPLICATION DASHBOARD
60
VERTICAL POD AUTOSCALER
limit/requests adapted by VPA
61
DOWNSCALING DURING OFF-HOURS
github.com/hjacobs/kube-downscaler
Weekend
62
KUBERNETES JANITOR
● TTL and expiry date annotations, e.g.
○ set time-to-live for your test deployment
● Custom rules, e.g.
○ delete everything without "app" label after 7 days
github.com/hjacobs/kube-janitor
63
EC2 SPOT NODES
72% savings
64
OUR SETUP VS
VANILLA KUBERNETES
65
HOW MUCH DO WE DIVERGE?
• API access via Zalando OAuth
• CPU throttling disabled via Kubelet flag
• No memory overcommit (requests == limits)
• Ingress: External DNS, Skipper, AWS ALB
• Custom CRDs: Zalando OAuth, Postgres, StackSet
• Kubernetes Downscaler
• DNS setup (CoreDNS DaemonSet, ndots: 2)
66
INGRESS: ALB + SKIPPER
NODE Skipper
:9999
MyApp
10.2.1.2:8080
NODE Skipper
:9999
MyApp
10.2.0.2:8080
Service
(list of pod IPs -
endpoints)
MyApp
10.2.0.3:8080
ALB
:443
:80 - redirect
K8S network
EC2 network
TLS
HTTP
github.com/zalando/skipper
github.com/zalando-incubator/kube-ingress-aws-controller
67
DNS: COREDNS AS DAEMONSET
github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
68
NON-PROD VS PROD
• Non-production similar to plain hosted Kubernetes
• Production:
• No write access (only via CI/CD)
• Compliance webhooks
• Require production-ready Docker images
69
COMPLIANCE FOR PRODUCTION
• Pods require application label pointing to application registry
⇒ establishes link to owning team
• Docker images must be built from master via CDP
NOTE: teams can freely choose their namespace(s)
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
71
MONTHLY DEVELOPER NEWSLETTER
72
SUMMARY
• Seamless updates
• Avoid pet clusters
• Small disruptions are normal
• Automated cluster e2e tests
• Documentation & communication
73
FUTURE
• API version updates (1.16+)
• Improved Autoscaling
• Improved StackSet, Gradual Rollout
• Migrations
• Cost efficiency
• Looking at VPC CNI, AWS IAM, EKS, ...
74
KUBERNETES FAILURE STORIES
• Zalando's Failure Stories - KubeCon EU 2019
• Build Errors of Continuous Delivery Platform
• Total DNS outage in Kubernetes cluster
https://k8s.af
75
COMMON PITFALLS
• Insufficient e2e tests
• Readiness & Liveness Probes
• Resource Requests & Limits
• DNS
76
OPEN SOURCE & MORE
Cluster Config
github.com/zalando-incubator/kubernetes-on-aws
Skipper HTTP Router & Ingress controller
github.com/zalando/skipper
Ingress Controller for AWS
github.com/zalando-incubator/kube-ingress-aws-controller
Kubernetes Web View
codeberg.org/hjacobs/kube-web-view
More Zalando Tech Talks
github.com/zalando/public-presentations
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Henning Jacobs
@try_except_

More Related Content

PPTX
Docker and kubernetes
PDF
CNCF Meetup - OpenShift Overview
PPTX
Docker, LinuX Container
PPTX
Kubernetes 101
PDF
Kubernetes 101
PDF
Hands-On Introduction to Kubernetes at LISA17
PDF
Introduction to kubernetes
PDF
Kubernetes a comprehensive overview
Docker and kubernetes
CNCF Meetup - OpenShift Overview
Docker, LinuX Container
Kubernetes 101
Kubernetes 101
Hands-On Introduction to Kubernetes at LISA17
Introduction to kubernetes
Kubernetes a comprehensive overview

What's hot (20)

PDF
Kubernetes - introduction
PPTX
Kubernetes 101 for Beginners
PDF
Automation with ansible
PDF
Ansible - Hands on Training
PPTX
Kubernetes Introduction
PPTX
Introduction to Kubernetes
PPTX
A brief study on Kubernetes and its components
PDF
Getting Started with Kubernetes
ODP
ansible why ?
PDF
Ansible
PDF
Autoscaling Kubernetes
PPTX
Automating with Ansible
PDF
Terraform introduction
PDF
Kubernetes 101
PDF
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
PDF
Patroni - HA PostgreSQL made easy
PDF
Kubernetes - A Comprehensive Overview
PDF
2019.06.27 Intro to Ceph
PDF
PPTX
Introduction to helm
Kubernetes - introduction
Kubernetes 101 for Beginners
Automation with ansible
Ansible - Hands on Training
Kubernetes Introduction
Introduction to Kubernetes
A brief study on Kubernetes and its components
Getting Started with Kubernetes
ansible why ?
Ansible
Autoscaling Kubernetes
Automating with Ansible
Terraform introduction
Kubernetes 101
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Patroni - HA PostgreSQL made easy
Kubernetes - A Comprehensive Overview
2019.06.27 Intro to Ceph
Introduction to helm
Ad

Similar to How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent (20)

PDF
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
PPTX
How do we use Kubernetes
PDF
Automatic Ingress in Kubernetes
PDF
Developer Experience at Zalando - CNCF End User SIG-DX
PDF
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
PDF
Kubernetes at Zalando - CNCF End User Committee Presentation
PPTX
ITGM#14 - How do we use Kubernetes in Zalando
PDF
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
PDF
Kubernetes on AWS @ Zalando Tech
PDF
Kubernetes on AWS at Europe's Leading Online Fashion Platform
PDF
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
PDF
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
PDF
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
PDF
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
PDF
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
PDF
Elastic Kubernetes Services (EKS)
PDF
Kubernetes: Learning from Zero to Production
PDF
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
PPTX
How Zalando integrates Kubernetes with AWS
PDF
Journey to a multi-tenant e commerce solution in the cloud with Kubernetes - ...
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
How do we use Kubernetes
Automatic Ingress in Kubernetes
Developer Experience at Zalando - CNCF End User SIG-DX
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes at Zalando - CNCF End User Committee Presentation
ITGM#14 - How do we use Kubernetes in Zalando
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
Kubernetes on AWS @ Zalando Tech
Kubernetes on AWS at Europe's Leading Online Fashion Platform
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Elastic Kubernetes Services (EKS)
Kubernetes: Learning from Zero to Production
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
How Zalando integrates Kubernetes with AWS
Journey to a multi-tenant e commerce solution in the cloud with Kubernetes - ...
Ad

More from Henning Jacobs (20)

PDF
Open Source at Zalando - OSB Open Source Day 2019
PDF
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
PDF
Kubernetes + Python = ❤ - Cloud Native Prague
PDF
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
PDF
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
PDF
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
PDF
Kubernetes Failure Stories - KubeCon Europe Barcelona
PDF
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
PDF
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
PDF
Let's talk about Failures with Kubernetes - Hamburg Meetup
PDF
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
PDF
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
PDF
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
PDF
API First with Connexion - PyConWeb 2018
PDF
Developer Journey at Zalando - Idea to Production with Containers in the Clou...
PDF
Plan B: Service to Service Authentication with OAuth
PDF
Docker Berlin Meetup Nov 2015: Zalando Intro
PDF
STUPS @ AWS Enterprise Web Day Oktober 2015
PDF
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
PDF
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
Open Source at Zalando - OSB Open Source Day 2019
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
Kubernetes + Python = ❤ - Cloud Native Prague
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Kubernetes Failure Stories - KubeCon Europe Barcelona
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Let's talk about Failures with Kubernetes - Hamburg Meetup
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
API First with Connexion - PyConWeb 2018
Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Plan B: Service to Service Authentication with OAuth
Docker Berlin Meetup Nov 2015: Zalando Intro
STUPS @ AWS Enterprise Web Day Oktober 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Group 1 Presentation -Planning and Decision Making .pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Tartificialntelligence_presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx

How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent