Patroni: Kubernetes-native PostgreSQL companion

Patroni:
Kubernetes-native
PostgreSQL companion
PGConf APAC 2018
Singapore
ALEXANDER KUKUSHKIN
23-03-2018

2
ABOUT ME
Alexander Kukushkin
Database Engineer @ZalandoTech
Email: alexander.kukushkin@zalando.de
Twitter: @cyberdemn

3
ZALANDO
15 markets
6 fulfillment centers
20 million active customers
3.6 billion € net sales 2016
165 million visits per month
12,000 employees in Europe

4
FACTS & FIGURES
> 300 databases
on premise
> 150
on AWS EC2
> 200
on K8S

5
Bot pattern and Patroni
Postgres-operator
Patroni on Kubernetes, first attempt
Kubernetes-native Patroni
Live-demo
AGENDA

6
● small python daemon
● implements “bot” pattern
● runs next to PostgreSQL
● decides on promotion/demotion
● uses DCS to run leader election and keep cluster state
Bot pattern and Patroni

7
● Distributed Consensus/Configuration Store (Key-Value)
● Uses RAFT (Etcd, Consul) or ZAB (ZooKeeper)
● Write succeed only if majority of nodes acknowledge it
(quorum)
● Supports Atomic operations (CompareAndSet)
● Can expire objects after TTL
http://thesecretlivesofdata.com/raft/
DCS

8
Bot pattern: leader alive
Primary
NODE A
Standby
NODE B
Standby
NODE C
UPDATE(“/leader”, “A”, ttl=30,
prevValue=”A”)Success
WATCH (/leader)
WATCH (/leader)
/leader: “A”, ttl: 30

9
Bot pattern: master dies, leader key holds
Primary
Standby
Standby
WATCH (/leader)
WATCH (/leader)
NODE A
NODE B
NODE C

10
Bot pattern: leader key expires
Standby
Standby
Notify (/leader, expired=true)
Notify (/leader, expired=true)
NODE B
NODE C

11
Bot pattern: who will be the next master?
Standby
Standby
Node B:
GET A:8008/patroni -> failed/timeout
GET C:8008/patroni -> wal_position: 100
Node C:
GET A:8008/patroni -> failed/timeout
GET B:8008/patroni -> wal_position: 100
NODE B
NODE C

12
Bot pattern: leader race among equals
Standby
Standby
/leader: “C”, ttl: 30
CREATE (“/leader”, “C”,
ttl=30, prevExists=False)
CREATE (“/leader”, “B”,
ttl=30, prevExists=False)
FAIL
SUCCESS
NODE B
NODE C

13
Bot pattern: promote and continue
replication
Standby
Primary
/leader: “C”, ttl: 30WATCH(/leader
)
promote
NODE B
NODE C

14
DCS STRUCTURE
● /service/cluster-name/
○ config {"postgresql":{"parameters":{"max_connections":300}}}
○ initialize ”6303731710761975832” (database system identifier)
○ members/
■ dbnode1 {"role":"replica","state":"running”,"conn_url":"postgres://172.17.0.2:5432/postgres"}
■ dbnode2 {"role":"master","state":"running”,"conn_url":"postgres://172.17.0.3:5432/postgres"}
○ leader dbnode2
○ optime/
■ leader “67393608” # ← absolute wal positition

16
“Kubernetes is an open-source system for automating deployment, scaling,
and management of containerized applications.
It groups containers that make up an application into logical units (Pods) for
easy management and discovery. Kubernetes builds upon 15 years of
experience of running production workloads at Google, combined with
best-of-breed ideas and practices from the community.”
kubernetes.io
KUBERNETES

17
Spilo & Patroni on K8S v1
Node
Pod: demo-0
role: replica
PersistentVolume
PersistentVolume
Node
Pod: demo-1
role: master
StatefulSet: demo
Secret: demoUPDATE()
WATCH()
Service: demo-replica
labelSelector: role=replica
Service: demo
labelSelector: role=master

18
Spilo & Patroni on K8S v1
● We will deploy Etcd on Kubernetes
● Depoy Spilo with PetSet (old name for StatefulSet)
● And quickly hack a callback script for Patroni, which will
label the Pod we are running in with the current role
(master, replica)
● And use Services with labelSelectors for traffic routing

19
Can we get rid from Etcd?
● Use labelSelector to find all Kubernetes objects
associated with the given cluster
○ Pods - cluster members
○ ConfigMaps or Endpoints to keep configuration
● Every iteration of HA loop we will update labels and
metadata on the objects (the same way as we updating
keys in Etcd)
● It is even possible to do CAS operation using K8S API

20
No K8S API for expiring objects
How to do leader election?

21
Do it on the client side!
● Leader should periodically update ConfigMap or Endpoint
○ Update must happen as CAS operation
○ Demote to read-only in case of failure
● All other members should check that leader ConfigMap (or
Endpoint) is being updated
○ If there are no updates during TTL => do leader election

22
Kubernetes-native Patroni
Node
Pod: demo-0
role: replica
PersistentVolume
PersistentVolume
Node
Pod: demo-1
role: master
StatefulSet: demo
Endpoint: demo Service: demo
Secret: demo
UPDATE()
W
ATCH()
Endpoint: demo-config
Service: demo-replica
labelSelector: role=replica

24
● No dependency on Etcd
● When using Endpoint for leader
election we can also maintain
subsets with the IP of the
leader Pod
● 100% Kubernetes-native
solution
Kubernetes API as DCS
CONSPROS
● Can’t tolerante arbitrary clock
skew rate
● OpenShift doesn’t allow to put
IP from the Pods rage into the
Endpoint
● SLA for K8S API on GCE
prommiss only 99.5% availability

26
How to deploy it
● kubectl create -f your-cluster.yaml
● Use Patroni Helm Chart + Spilo
● Use postgres-operator

27
POSTGRES-OPERATOR
● Creates CustomResourceDefinition Postgresql and watches it
● When new Postgresql object is created - deploys a new cluster
○ Creates Secrets, Endpoints, Services and StatefulSet
● When Postgresql object is updated - updates StatefulSet
○ and does a rolling upgrade
● Periodically syncs running clusters with the manifests
● When Postgresql object is deleted - cleans everything up

30
PostgreSQL
manifest
Stateful set
Spilo pod
Kubernetes cluster
PATRONI
Postgres
operator
pod
Endpoint
Service
Client
application
Postgres
operator
config mapCluster
secrets
Database
deployer
create
create
create
watch
deploy
Update with
actual master
role

31
Monitoring & Backups
● Things to monitor:
○ Pods status (via K8S API)
○ Patroni & PostgreSQL state
○ Replication state and lag
● Always do Backups!
○ And always test them!
GET http://$POD_IP:8008/patroni
for every Pod in the cluster, check
that state=running and compare
xlog_position with the master

32
Our learnings
● We run Kubernetes on top of AWS infrastructure
○ Availability of K8S API in our case is very close to 100%
○ PersistentVolume (EBS) attach/detach sometimes buggy and slow
● Kubernetes cluster upgrade
○ Require rotating all nodes and can cause multiple switchovers
■ Thanks to postgres-operator it is solved, now we need only one
● Kubernetes node autoscaler
○ Sometimes terminates the nodes were Spilo/Patroni/PostgreSQL runs
■ Patroni handles it gracefully, by doing a switchover

33
LINKS
● Patroni: https://github.com/zalando/patroni
● Patroni Documentation: https://patroni.readthedocs.io
● Spilo: https://github.com/zalando/spilo
● Helm chart: https://github.com/unguiculus/charts/tree/feature/patroni/incubator/patroni
● Postgres-operator: https://github.com/zalando-incubator/postgres-operator

Patroni: Kubernetes-native PostgreSQL companion

More Related Content

What's hot (20)

Similar to Patroni: Kubernetes-native PostgreSQL companion (20)

Recently uploaded (20)

Patroni: Kubernetes-native PostgreSQL companion