GitLab PostgresMortem:
Lessons Learned
Alexey Lesovsky
alexey.lesovsky@dataegret.com
dataegret.com
31 January Events
Failure's key points
Preventative measures
https://goo.gl/GO5rYJ
02
03
01
31 January
events
01
17:20 - an LVM snapshot of the production db was taken.
19:00 - database load increased due to spam.
23:00 - secondary's replication process started to lag behind.
23:30 - PostgreSQL database directory was wiped.
31 January events01
dataegret.com
Failure's
key points
02
1.LVM snapshots and staging provisioning.
2.When a replica start to lag.
3.Do pg_basebackup properly – part 1.
4.max_wal_senders was exceeded, but how?
5.max_connections = 8000.
6.pg_basebackup «stuck» – do pg_basebackup properly – part 2.
7.strace: good thing in wrong place.
8.rm or not rm?
9.A bit about backup.
10.Different PG versions on the production.
11.Broken mail.
31 January events02
dataegret.com
Snapshot impact on underlying storage.
Provisioning from backup.
Staging based on LVM snapshots02
dataegret.com
Re-initialize the standby.
Monitoring with pg_stat_replication.
Use wal_keep_segments while troubleshooting.
Use WAL archive.
When a replica started to lag02
dataegret.com
Do pg_basebackup into clean directory.
Remove «unnecessary» directory.
Use mv instead of rm.
Do pg_basebackup properly. Part 102
dataegret.com
There was only one standby (which was failed).
Increase max_wal_senders.
Check who has stolen connections.
The limit was exceeded by concurrent pg_basebackups.
max_wal_senders was exceeded.02
dataegret.com
More than 500 is bad idea.
Use pgbouncer to reduce the number of server connections.
max_connections = 800002
dataegret.com
Don't run more than one pg_basebackups.
It didn't stuck, it waited for the checkpoint.
Use «-c» option to make fast checkpoint.
Do pg_basebackup properly. Part 202
dataegret.com
Strace isn't a good tool in that case.
Use strace for system errors tracing.
Check stack trace from /proc/<pid>/stack or GDB.
Good things in wrong place.02
dataegret.com
Data directory was cleaned with rm.
Use mv instead of rm.
rm or not rm02
dataegret.com
Daily pg_dump.
Daily LVM snapshot.
Daily Azure snapshot.
PostgreSQL streaming replication.
Basebackup with WAL archive.
A bit about backup02
dataegret.com
Clean out old packages after major upgrade.
Different versions on a production02
dataegret.com
Setup cron, but forgot notifications.
Use reliable notification systems.
Different versions on a production02
dataegret.com
Preventative
measures
03
1. Update PS1 across all hosts to more clearly differentiate between hosts and environments.
2. Prometheus monitoring for backups.
3. Set PostgreSQL's max_connections to a sane value.
4. Investigate Point in time recovery & continuous archiving for PostgreSQL.
5. Hourly LVM snapshots of the production databases.
6. Azure disk snapshots of production databases.
7. Move staging to the ARM environment.
8. Recover production replica(s).
9. Automated testing of recovering PostgreSQL database backups.
10.Improve PostgreSQL replication documentation/runbooks.
11.Investigate pgbarman for creating PostgreSQL backups.
12.Investigate using WAL-E as a means of Database Backup and Realtime Replication.
13.Build Streaming Database Restore.
14.Assign an owner for data durability.
Different versions on a production03
dataegret.com
1. Update PS1 across all hosts.
Looks OK.
2. Prometheus monitoring for backups.
Size, number, age and recovery status.
3. Set PostgreSQL's max_connections to a sane value.
Better use pgbouncer.
4. Investigate PITR & continuous archiving for PostgreSQL.
Yes, as the part of the backup.
Preventative measures03
dataegret.com
5. Hourly LVM snapshots of the production databases.
Looks unnecessary.
6. Azure disk snapshots of production databases.
Looks unnecessary.
7. Move staging to the ARM environment.
Very and very suspicious.
8. Recover production replica(s).
Do that asap.
Preventative measures03
dataegret.com
9. Automated testing of recovering database backups.
YES!
10. Improve documentation/runbooks.
You need a bureaucrat.
11. Investigate pgbarman.
Looks OK, Barman is stable and reliable.
12. Investigate using WAL-E.
Looks OK, WAL-E is the «setup and forget».
Preventative measures03
dataegret.com
13. Build Streaming Database Restore.
Corresponds with p.9.
14. Assign an owner for data durability.
Hire a DBA.
Preventative measures03
dataegret.com
Check and monitor backups.
Create an emergency instructions.
Learn to use tools properly.
Lessons learned03
dataegret.com
Postmortem of database outage of January 31
https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
PostgreSQL Statistics Collector: pg_stat_replication view
https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW
pg_basebackup utility
https://www.postgresql.org/docs/current/static/app-pgbasebackup.html
PostgreSQL Replication
https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html
PgBouncer
https://pgbouncer.github.io/
https://wiki.postgresql.org/wiki/PgBouncer
Barman
http://www.pgbarman.org/
Links03
dataegret.com
Thanks for watching!
dataegret.com alexey.lesovsky@dataegret.com

More Related Content

PDF
PostgreSQL Streaming Replication Cheatsheet
PDF
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PDF
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
PDF
Troubleshooting PostgreSQL with pgCenter
PDF
Out of the box replication in postgres 9.4
PDF
PostgreSQL Replication Tutorial
PDF
Streaming replication in practice
PDF
Nine Circles of Inferno or Explaining the PostgreSQL Vacuum
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
Troubleshooting PostgreSQL with pgCenter
Out of the box replication in postgres 9.4
PostgreSQL Replication Tutorial
Streaming replication in practice
Nine Circles of Inferno or Explaining the PostgreSQL Vacuum

What's hot (20)

PDF
Как PostgreSQL работает с диском
PDF
Managing PostgreSQL with PgCenter
PDF
Postgresql database administration volume 1
PDF
Advanced Postgres Monitoring
PDF
Troubleshooting PostgreSQL Streaming Replication
PDF
Deep dive into PostgreSQL statistics.
PDF
Mastering PostgreSQL Administration
 
PDF
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
PDF
Pgcenter overview
PDF
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
PDF
PostgreSQL and RAM usage
PDF
Deep dive into PostgreSQL statistics.
ODP
GUC Tutorial Package (9.0)
PDF
Deep dive into PostgreSQL statistics.
ODP
Logical replication with pglogical
PPTX
Streaming Replication Made Easy in v9.3
PDF
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
PDF
Advanced backup methods (Postgres@CERN)
PDF
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
PDF
Out of the Box Replication in Postgres 9.4(pgconfsf)
Как PostgreSQL работает с диском
Managing PostgreSQL with PgCenter
Postgresql database administration volume 1
Advanced Postgres Monitoring
Troubleshooting PostgreSQL Streaming Replication
Deep dive into PostgreSQL statistics.
Mastering PostgreSQL Administration
 
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Pgcenter overview
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
PostgreSQL and RAM usage
Deep dive into PostgreSQL statistics.
GUC Tutorial Package (9.0)
Deep dive into PostgreSQL statistics.
Logical replication with pglogical
Streaming Replication Made Easy in v9.3
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
Advanced backup methods (Postgres@CERN)
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
Out of the Box Replication in Postgres 9.4(pgconfsf)
Ad

Viewers also liked (19)

PDF
PostgreSQL Vacuum: Nine Circles of Hell
PDF
Linux tuning to improve PostgreSQL performance
PDF
Linux tuning for PostgreSQL at Secon 2015
PDF
Tuning Linux for your database FLOSSUK 2016
PDF
Best Practices for Becoming an Exceptional Postgres DBA
 
PDF
Tuning Linux for Databases.
PDF
Streaming replication in practice
ODP
PostgreSQL Administration for System Administrators
PDF
Use Case: PostGIS and Agribotics
PDF
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
PDF
5 Steps to PostgreSQL Performance
PPT
Opslag van long tail producten in e-warehouse
PDF
Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...
PDF
5 мифов о производительности баз данных и Python
PDF
Python и высокая нагрузка
PDF
Big Data aggregation techniques
PPT
Gtd Dev Labs2010 Part Ii
ODP
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
PPT
Gtd Dev Labs2010 Part I
PostgreSQL Vacuum: Nine Circles of Hell
Linux tuning to improve PostgreSQL performance
Linux tuning for PostgreSQL at Secon 2015
Tuning Linux for your database FLOSSUK 2016
Best Practices for Becoming an Exceptional Postgres DBA
 
Tuning Linux for Databases.
Streaming replication in practice
PostgreSQL Administration for System Administrators
Use Case: PostGIS and Agribotics
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
5 Steps to PostgreSQL Performance
Opslag van long tail producten in e-warehouse
Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...
5 мифов о производительности баз данных и Python
Python и высокая нагрузка
Big Data aggregation techniques
Gtd Dev Labs2010 Part Ii
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
Gtd Dev Labs2010 Part I
Ad

Similar to GitLab PostgresMortem: Lessons Learned (20)

PDF
Deploying postgre sql on amazon ec2
PPTX
Google file system
PDF
On The Building Of A PostgreSQL Cluster
PDF
LFCS Questions and Answers pdf dumps.pdf
PPT
advanced Google file System
PDF
Google File System
PDF
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
PDF
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...
PPT
Advance google file system
PPTX
SQL Server On SANs
PPTX
Scaling an ELK stack at bol.com
PDF
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
PPTX
Let the Tiger Roar! - MongoDB 3.0 + WiredTiger
ODP
PostgreSQL Replication in 10 Minutes - SCALE
PDF
MySQL Galera 集群
PPT
Mysql replication @ gnugroup
PDF
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
PPTX
Upgrade to 2008 Best of PASS
PDF
Hadoop availability
PDF
Lxbrand
Deploying postgre sql on amazon ec2
Google file system
On The Building Of A PostgreSQL Cluster
LFCS Questions and Answers pdf dumps.pdf
advanced Google file System
Google File System
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...
Advance google file system
SQL Server On SANs
Scaling an ELK stack at bol.com
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Let the Tiger Roar! - MongoDB 3.0 + WiredTiger
PostgreSQL Replication in 10 Minutes - SCALE
MySQL Galera 集群
Mysql replication @ gnugroup
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
Upgrade to 2008 Best of PASS
Hadoop availability
Lxbrand

More from Alexey Lesovsky (8)

PDF
Отладка и устранение проблем в PostgreSQL Streaming Replication.
PDF
Call of Postgres: Advanced Operations (part 5)
PDF
Call of Postgres: Advanced Operations (part 4)
PDF
Call of Postgres: Advanced Operations (part 3)
PDF
Call of Postgres: Advanced Operations (part 2)
PDF
Call of Postgres: Advanced Operations (part 1)
PDF
PostgreSQL Streaming Replication
PDF
Highload 2014. PostgreSQL: ups, DevOps.
Отладка и устранение проблем в PostgreSQL Streaming Replication.
Call of Postgres: Advanced Operations (part 5)
Call of Postgres: Advanced Operations (part 4)
Call of Postgres: Advanced Operations (part 3)
Call of Postgres: Advanced Operations (part 2)
Call of Postgres: Advanced Operations (part 1)
PostgreSQL Streaming Replication
Highload 2014. PostgreSQL: ups, DevOps.

Recently uploaded (20)

PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
introduction to high performance computing
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Amdahl’s law is explained in the above power point presentations
PDF
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
Module 8- Technological and Communication Skills.pptx
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
737-MAX_SRG.pdf student reference guides
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Categorization of Factors Affecting Classification Algorithms Selection
Visual Aids for Exploratory Data Analysis.pdf
August 2025 - Top 10 Read Articles in Network Security & Its Applications
III.4.1.2_The_Space_Environment.p pdffdf
introduction to high performance computing
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Current and future trends in Computer Vision.pptx
Amdahl’s law is explained in the above power point presentations
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Soil Improvement Techniques Note - Rabbi
Module 8- Technological and Communication Skills.pptx
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
737-MAX_SRG.pdf student reference guides

GitLab PostgresMortem: Lessons Learned

  • 2. dataegret.com 31 January Events Failure's key points Preventative measures https://goo.gl/GO5rYJ 02 03 01
  • 4. 17:20 - an LVM snapshot of the production db was taken. 19:00 - database load increased due to spam. 23:00 - secondary's replication process started to lag behind. 23:30 - PostgreSQL database directory was wiped. 31 January events01 dataegret.com
  • 6. 1.LVM snapshots and staging provisioning. 2.When a replica start to lag. 3.Do pg_basebackup properly – part 1. 4.max_wal_senders was exceeded, but how? 5.max_connections = 8000. 6.pg_basebackup «stuck» – do pg_basebackup properly – part 2. 7.strace: good thing in wrong place. 8.rm or not rm? 9.A bit about backup. 10.Different PG versions on the production. 11.Broken mail. 31 January events02 dataegret.com
  • 7. Snapshot impact on underlying storage. Provisioning from backup. Staging based on LVM snapshots02 dataegret.com
  • 8. Re-initialize the standby. Monitoring with pg_stat_replication. Use wal_keep_segments while troubleshooting. Use WAL archive. When a replica started to lag02 dataegret.com
  • 9. Do pg_basebackup into clean directory. Remove «unnecessary» directory. Use mv instead of rm. Do pg_basebackup properly. Part 102 dataegret.com
  • 10. There was only one standby (which was failed). Increase max_wal_senders. Check who has stolen connections. The limit was exceeded by concurrent pg_basebackups. max_wal_senders was exceeded.02 dataegret.com
  • 11. More than 500 is bad idea. Use pgbouncer to reduce the number of server connections. max_connections = 800002 dataegret.com
  • 12. Don't run more than one pg_basebackups. It didn't stuck, it waited for the checkpoint. Use «-c» option to make fast checkpoint. Do pg_basebackup properly. Part 202 dataegret.com
  • 13. Strace isn't a good tool in that case. Use strace for system errors tracing. Check stack trace from /proc/<pid>/stack or GDB. Good things in wrong place.02 dataegret.com
  • 14. Data directory was cleaned with rm. Use mv instead of rm. rm or not rm02 dataegret.com
  • 15. Daily pg_dump. Daily LVM snapshot. Daily Azure snapshot. PostgreSQL streaming replication. Basebackup with WAL archive. A bit about backup02 dataegret.com
  • 16. Clean out old packages after major upgrade. Different versions on a production02 dataegret.com
  • 17. Setup cron, but forgot notifications. Use reliable notification systems. Different versions on a production02 dataegret.com
  • 19. 1. Update PS1 across all hosts to more clearly differentiate between hosts and environments. 2. Prometheus monitoring for backups. 3. Set PostgreSQL's max_connections to a sane value. 4. Investigate Point in time recovery & continuous archiving for PostgreSQL. 5. Hourly LVM snapshots of the production databases. 6. Azure disk snapshots of production databases. 7. Move staging to the ARM environment. 8. Recover production replica(s). 9. Automated testing of recovering PostgreSQL database backups. 10.Improve PostgreSQL replication documentation/runbooks. 11.Investigate pgbarman for creating PostgreSQL backups. 12.Investigate using WAL-E as a means of Database Backup and Realtime Replication. 13.Build Streaming Database Restore. 14.Assign an owner for data durability. Different versions on a production03 dataegret.com
  • 20. 1. Update PS1 across all hosts. Looks OK. 2. Prometheus monitoring for backups. Size, number, age and recovery status. 3. Set PostgreSQL's max_connections to a sane value. Better use pgbouncer. 4. Investigate PITR & continuous archiving for PostgreSQL. Yes, as the part of the backup. Preventative measures03 dataegret.com
  • 21. 5. Hourly LVM snapshots of the production databases. Looks unnecessary. 6. Azure disk snapshots of production databases. Looks unnecessary. 7. Move staging to the ARM environment. Very and very suspicious. 8. Recover production replica(s). Do that asap. Preventative measures03 dataegret.com
  • 22. 9. Automated testing of recovering database backups. YES! 10. Improve documentation/runbooks. You need a bureaucrat. 11. Investigate pgbarman. Looks OK, Barman is stable and reliable. 12. Investigate using WAL-E. Looks OK, WAL-E is the «setup and forget». Preventative measures03 dataegret.com
  • 23. 13. Build Streaming Database Restore. Corresponds with p.9. 14. Assign an owner for data durability. Hire a DBA. Preventative measures03 dataegret.com
  • 24. Check and monitor backups. Create an emergency instructions. Learn to use tools properly. Lessons learned03 dataegret.com
  • 25. Postmortem of database outage of January 31 https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ PostgreSQL Statistics Collector: pg_stat_replication view https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW pg_basebackup utility https://www.postgresql.org/docs/current/static/app-pgbasebackup.html PostgreSQL Replication https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html PgBouncer https://pgbouncer.github.io/ https://wiki.postgresql.org/wiki/PgBouncer Barman http://www.pgbarman.org/ Links03 dataegret.com