SlideShare a Scribd company logo
Huy Nguyen
CTO, Cofounder - Holistics Software
Cofounder, Grokking Vietnam
PostgreSQL Internals 101
/post:gres:Q:L/
About Me
Education:
● Pho Thong Nang Khieu, Tin 04-07
● National University of Singapore (NUS), Computer Science Major.
Work:
● Software Engineer Intern, SenseGraphics (Stockholm, Sweden)
● Software Engineer Intern, Facebook (California, US)
● Data Infrastructure Engineer, Viki (Singapore)
Now:
● Co-founder & CTO, Holistics Software
● Co-founder, Grokking Vietnam
huy@holistics.io facebook.com/huy bit.ly/huy-linkedin
● This talk covers a very small part of
PostgreSQL concepts/internals
● As with any RDBMS, PostgreSQL is a
complex system, and it’s still evolving.
● Mainly revolve around explaining
“Uber’s MySQL vs PostgreSQL”
article.
● Not Covered: Memory Management,
Query Planning, Replication, etc...
Agenda
● Uber’s Article
● Table Heap
● B-Tree Index
● MVCC
● MySQL Structure
● PostgreSQL vs MySQL
(Uber Use-case)
● Index-only Scan
● Heap-only Tuple (HOT)
Uber migrating from PostgreSQL to MySQL
Uber’s Use Case
● Table with lots of indexes (cover almost/all columns)
● Lots of UPDATEs
⇒ MySQL handles this better than PostgreSQL
● Read more here
● Everything is under base
directory ($PGDATA).
/var/lib/postgresql/
9.x/main
● Each database is a folder
name after its oid
Physical Structure
http://www.interdb.jp/pg/pgsql01.html
demodb=# select oid, relname, relfilenode
from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416854
(1 row)
Physical Structure
Each table’s data is in 1 or multiple files (max 1GB each)
TRUNCATE table;
vs
DELETE FROM table;
demodb=# select oid, relname, relfilenode from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416854
(1 row)
demodb=# truncate test;
TRUNCATE TABLE
INSERT 0 1
demodb=# select oid, relname, relfilenode from pg_class where relname = 'test';
oid | relname | relfilenode
--------+---------+-------------
416854 | test | 416857
(1 row)
Tuple Address (ctid)
ctid id name
(0, 2) 1 Alice
(0, 5) 2 Bob
(1, 3) 3 Charlie
ctid (tuple ID): a pair of (block,
location) to position the tuple in the
data file.
Heap Table Structure
Page: a block of content, default to 8KB
each.
Line pointers: 4-byte number address,
holds pointer to each tuple.
For tuple with size > 2KB, a special
storage method called TOAST is used.
● Problem: Someone reading data, while someone else is
writing to it
● Reader might see inconsistent piece of data
● MVCC: Allow reads and writes to happen concurrently
MVCC - Multi-version Concurrency Control
MVCC - Table
xmin xmax id name
1 5 1 Alice
2 3 2 Bob
3 2 Robert
4 3 Charlie
1. INSERT Alice
2. INSERT Bob
3. UPDATE Bob → Robert
4. INSERT Charlie
5. DELETE Alice
● xmin: transaction ID that inserts this tuple
● xmax: transaction that removes this tuple
INSERT
1
http://www.interdb.jp/pg/pgsql05.html
DELETE
1
http://www.interdb.jp/pg/pgsql05.html
UPDATE
http://www.interdb.jp/pg/pgsql05.html
Because each UPDATE creates new tuple (and marks old tuple
deleted), lots of UPDATEs will soon increase the table’s physical
size.
Table Bloat
Index (B-tree)
H
B
A C
Balanced search tree.
Root node and inner nodes
contain keys and pointers to lower
level nodes
Leaf nodes contain keys and
pointers to the heap (ctid)
When table has new tuples, new
tuple is added to index tree.
Heap
ctid
D
A1
…. ….
Write Amplifications
● Each UPDATE inserts new
tuple.
● New index tuples
● ⇒ multiple writes
● Extra overhead to
Write-ahead Log (WAL)
● Carried over through
network
● Applied on Slave
H
B
A C
Heap
ctid
D
A1
…. ….
MySQL / InnoDB
● MVCC: Inline update of tuples
● Table Layout: B+ tree on Primary Key
● Index: points to primary key
MySQL data is B+ Tree (on
primary key)
Leaf nodes contain actual rows
data
MySQL Table (B+ tree)
H
B
A C
row
data
...
primary key
MySQL Index
● MySQL: the node’s value
store primary key
● A lookup on secondary
index requires 2 index
traversals: secondary index
+ primary index.
H
B
A C
Table
D
A1
…. ….
primary key
https://blog.jcole.us/2013/01/10/btree-index-structures-in-innodb/
PostgreSQL vs MySQL (Uber case)
PostgreSQL MySQL
MVCC New Tuple Per UPDATE Inline update of tuple (with
rollback segments)
Index Lookup Store physical address (ctid) By primary key
Table Layout Heap-table structure Primary-key table structure
PostgreSQL vs MySQL (Uber case)
PostgreSQL MySQL
select on primary key log(N) + heap read log(n) + direct read
update Update all indexes;
1 data write
Do not update indexes;
2 data writes
select on index key log(n) + O(1) heap read log(n) + log(n) primary index
read
sequential scan Page sequential scan Index-order scan
Index-only Scan (Covering Index)
Index on (product_id, revenue)
SELECT SUM(revenue) FROM table WHERE product_id = 123
If the index itself has all the data needed,
no Heap Table lookup is required.
Visibility Map
Per table’s page
VM[i] is set: all tuples in page i are
visible to current transactions
VM is only updated by VACUUM
https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and-xid
Heap-only Tuple (HOT)
● No new index needs to be updated
Conditions:
● Must not update a column that’s
indexed
● New tuple must be in the same
page
http://slideplayer.com/slide/9883483/
● Clean up dead tuples
● Freeze old tuples (prevent
transactions wraparound)
● VACUUM only frees old tuples
● VACUUM FULL reclaims old disk
spaces, but blocks writes
VACUUM
● Add a new column (safe)
● Add a column with a default (unsafe)
● Add a column that is non-nullable (unsafe)
● Drop a column (safe)
● Add a default value to an existing column (safe)
● Add an index (unsafe)
Safe & Unsafe Operations In PostgreSQL
http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
References
● Why Uber Engineering switched from PostgreSQL to MySQL -
https://eng.uber.com/mysql-migration/
● PostgreSQL Documentations -
https://www.postgresql.org/docs/current/static/
● The Internals of PostgreSQL
http://www.interdb.jp/pg/
● http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
● http://slideplayer.com/slide/9883483/
● https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and
-xid
Huy Nguyen
Physical Structure
https://www.postgresql.org/docs/current/static/storage-file-layout.html
Transaction Isolation
BEGIN TRANSACTION;
SELECT * FROM table;
SELECT pg_sleep(10);
SELECT * FROM table;
COMMIT;
under READ COMMITTED, the second SELECT may return any data. A
concurrent transaction may update the record, delete it, insert new records.
The second select will always see the new data.
under REPEATABLE READ the second SELECT is guaranteed to see the
rows that has seen at first select unchanged. New rows may be added by a
concurrent transaction in that one minute, but the existing rows cannot be
deleted nor changed.
under SERIALIZABLE reads the second select is guaranteed to see exactly
the same rows as the first. No row can change, nor deleted, nor new rows
could be inserted by a concurrent transaction.
https://stackoverflow.com/questions/4034976/difference-between-read-commit-and-repeatable-read
PostgreSQL Processes
There are multiple processes handling different
use cases.
● postmaster process: handles database
cluster management.
● Many backend processes (one for each
connection)
● Background processes: stats collector,
autovacuum, checkpoint, WAL writer, etc.
http://www.interdb.jp/pg/pgsql02.html
Database Cluster
● database cluster: a database
instance in a single machine.
● A database contains many
database objects (schema, table,
index, view, function, etc)
● Each object is represented by an
oid
Database Cluster
Database 1 Database 2 Database n...
tables indexes
views,
materialized
views
functions
schema
sequences
...
role
(user/group
Ad

Recommended

Get to know PostgreSQL!
Get to know PostgreSQL!
Oddbjørn Steffensen
 
PostgreSQL
PostgreSQL
Reuven Lerner
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performance
Vladimir Sitnikov
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Altinity Ltd
 
PostgreSQL Performance Tuning
PostgreSQL Performance Tuning
elliando dias
 
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
NeoClova
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
Mastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
rpolat
 
Introduction VAUUM, Freezing, XID wraparound
Introduction VAUUM, Freezing, XID wraparound
Masahiko Sawada
 
Mvcc in postgreSQL 권건우
Mvcc in postgreSQL 권건우
PgDay.Seoul
 
PL/SQL Fundamentals I
PL/SQL Fundamentals I
Nick Buytaert
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Trace
oysteing
 
Full Page Writes in PostgreSQL PGCONFEU 2022
Full Page Writes in PostgreSQL PGCONFEU 2022
Grant McAlister
 
PostgreSQL Deep Internal
PostgreSQL Deep Internal
EXEM
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
MariaDB plc
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기
NHN FORWARD
 
PostgreSQL Terminology
PostgreSQL Terminology
Showmax Engineering
 
PostgreSQL 15の新機能を徹底解説
PostgreSQL 15の新機能を徹底解説
Masahiko Sawada
 
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Reactive.IO
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxData
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
InfluxData
 
PostgreSQL and RAM usage
PostgreSQL and RAM usage
Alexey Bashtanov
 
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Jaime Crespo
 
Наш ответ Uber’у
Наш ответ Uber’у
IT Event
 
Migrating To PostgreSQL
Migrating To PostgreSQL
Grant Fritchey
 

More Related Content

What's hot (20)

ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
Mastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
rpolat
 
Introduction VAUUM, Freezing, XID wraparound
Introduction VAUUM, Freezing, XID wraparound
Masahiko Sawada
 
Mvcc in postgreSQL 권건우
Mvcc in postgreSQL 권건우
PgDay.Seoul
 
PL/SQL Fundamentals I
PL/SQL Fundamentals I
Nick Buytaert
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Trace
oysteing
 
Full Page Writes in PostgreSQL PGCONFEU 2022
Full Page Writes in PostgreSQL PGCONFEU 2022
Grant McAlister
 
PostgreSQL Deep Internal
PostgreSQL Deep Internal
EXEM
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
MariaDB plc
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기
NHN FORWARD
 
PostgreSQL Terminology
PostgreSQL Terminology
Showmax Engineering
 
PostgreSQL 15の新機能を徹底解説
PostgreSQL 15の新機能を徹底解説
Masahiko Sawada
 
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Reactive.IO
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxData
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
InfluxData
 
PostgreSQL and RAM usage
PostgreSQL and RAM usage
Alexey Bashtanov
 
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Jaime Crespo
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
Mastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
rpolat
 
Introduction VAUUM, Freezing, XID wraparound
Introduction VAUUM, Freezing, XID wraparound
Masahiko Sawada
 
Mvcc in postgreSQL 권건우
Mvcc in postgreSQL 권건우
PgDay.Seoul
 
PL/SQL Fundamentals I
PL/SQL Fundamentals I
Nick Buytaert
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Trace
oysteing
 
Full Page Writes in PostgreSQL PGCONFEU 2022
Full Page Writes in PostgreSQL PGCONFEU 2022
Grant McAlister
 
PostgreSQL Deep Internal
PostgreSQL Deep Internal
EXEM
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
MariaDB plc
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기
NHN FORWARD
 
PostgreSQL 15の新機能を徹底解説
PostgreSQL 15の新機能を徹底解説
Masahiko Sawada
 
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
Reactive.IO
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxData
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
InfluxData
 
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Jaime Crespo
 

Similar to Grokking TechTalk #20: PostgreSQL Internals 101 (20)

Наш ответ Uber’у
Наш ответ Uber’у
IT Event
 
Migrating To PostgreSQL
Migrating To PostgreSQL
Grant Fritchey
 
Our answer to Uber
Our answer to Uber
Alexander Korotkov
 
12 in 12 – A closer look at twelve or so new things in Postgres 12
12 in 12 – A closer look at twelve or so new things in Postgres 12
BasilBourque1
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
anandology
 
PostgreSQL 9.0 & The Future
PostgreSQL 9.0 & The Future
Aaron Thul
 
An evening with Postgresql
An evening with Postgresql
Joshua Drake
 
Introduction to PostgreSQL
Introduction to PostgreSQL
Jim Mlodgenski
 
PostgreSQL as NoSQL
PostgreSQL as NoSQL
Himanchali -
 
Bn 1016 demo postgre sql-online-training
Bn 1016 demo postgre sql-online-training
conline training
 
Pg big fast ugly acid
Pg big fast ugly acid
Federico Campoli
 
PostgreSQL, your NoSQL database
PostgreSQL, your NoSQL database
Reuven Lerner
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
Mubashar Iqbal
 
Postgres for MySQL (and other database) people
Postgres for MySQL (and other database) people
Command Prompt., Inc
 
PostgreSQL Prologue
PostgreSQL Prologue
Md. Golam Hossain
 
PostgreSQL- An Introduction
PostgreSQL- An Introduction
Smita Prasad
 
Demystifying PostgreSQL
Demystifying PostgreSQL
NOLOH LLC.
 
Demystifying PostgreSQL (Zendcon 2010)
Demystifying PostgreSQL (Zendcon 2010)
NOLOH LLC.
 
Mathias test
Mathias test
Mathias Stjernström
 
A brief introduction to PostgreSQL
A brief introduction to PostgreSQL
Vu Hung Nguyen
 
Наш ответ Uber’у
Наш ответ Uber’у
IT Event
 
Migrating To PostgreSQL
Migrating To PostgreSQL
Grant Fritchey
 
12 in 12 – A closer look at twelve or so new things in Postgres 12
12 in 12 – A closer look at twelve or so new things in Postgres 12
BasilBourque1
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
anandology
 
PostgreSQL 9.0 & The Future
PostgreSQL 9.0 & The Future
Aaron Thul
 
An evening with Postgresql
An evening with Postgresql
Joshua Drake
 
Introduction to PostgreSQL
Introduction to PostgreSQL
Jim Mlodgenski
 
PostgreSQL as NoSQL
PostgreSQL as NoSQL
Himanchali -
 
Bn 1016 demo postgre sql-online-training
Bn 1016 demo postgre sql-online-training
conline training
 
PostgreSQL, your NoSQL database
PostgreSQL, your NoSQL database
Reuven Lerner
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
Mubashar Iqbal
 
Postgres for MySQL (and other database) people
Postgres for MySQL (and other database) people
Command Prompt., Inc
 
PostgreSQL- An Introduction
PostgreSQL- An Introduction
Smita Prasad
 
Demystifying PostgreSQL
Demystifying PostgreSQL
NOLOH LLC.
 
Demystifying PostgreSQL (Zendcon 2010)
Demystifying PostgreSQL (Zendcon 2010)
NOLOH LLC.
 
A brief introduction to PostgreSQL
A brief introduction to PostgreSQL
Vu Hung Nguyen
 
Ad

More from Grokking VN (20)

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking VN
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
Grokking VN
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking VN
 
Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
Grokking VN
 
Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
Grokking VN
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking VN
 
SOLID & Design Patterns
SOLID & Design Patterns
Grokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking VN
 
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking VN
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
Grokking VN
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking VN
 
Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
Grokking VN
 
Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
Grokking VN
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking VN
 
SOLID & Design Patterns
SOLID & Design Patterns
Grokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking VN
 
Ad

Recently uploaded (20)

Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 

Grokking TechTalk #20: PostgreSQL Internals 101

  • 1. Huy Nguyen CTO, Cofounder - Holistics Software Cofounder, Grokking Vietnam PostgreSQL Internals 101 /post:gres:Q:L/
  • 2. About Me Education: ● Pho Thong Nang Khieu, Tin 04-07 ● National University of Singapore (NUS), Computer Science Major. Work: ● Software Engineer Intern, SenseGraphics (Stockholm, Sweden) ● Software Engineer Intern, Facebook (California, US) ● Data Infrastructure Engineer, Viki (Singapore) Now: ● Co-founder & CTO, Holistics Software ● Co-founder, Grokking Vietnam [email protected] facebook.com/huy bit.ly/huy-linkedin
  • 3. ● This talk covers a very small part of PostgreSQL concepts/internals ● As with any RDBMS, PostgreSQL is a complex system, and it’s still evolving. ● Mainly revolve around explaining “Uber’s MySQL vs PostgreSQL” article. ● Not Covered: Memory Management, Query Planning, Replication, etc... Agenda ● Uber’s Article ● Table Heap ● B-Tree Index ● MVCC ● MySQL Structure ● PostgreSQL vs MySQL (Uber Use-case) ● Index-only Scan ● Heap-only Tuple (HOT)
  • 4. Uber migrating from PostgreSQL to MySQL
  • 5. Uber’s Use Case ● Table with lots of indexes (cover almost/all columns) ● Lots of UPDATEs ⇒ MySQL handles this better than PostgreSQL ● Read more here
  • 6. ● Everything is under base directory ($PGDATA). /var/lib/postgresql/ 9.x/main ● Each database is a folder name after its oid Physical Structure http://www.interdb.jp/pg/pgsql01.html
  • 7. demodb=# select oid, relname, relfilenode from pg_class where relname = 'test'; oid | relname | relfilenode --------+---------+------------- 416854 | test | 416854 (1 row) Physical Structure Each table’s data is in 1 or multiple files (max 1GB each)
  • 9. demodb=# select oid, relname, relfilenode from pg_class where relname = 'test'; oid | relname | relfilenode --------+---------+------------- 416854 | test | 416854 (1 row) demodb=# truncate test; TRUNCATE TABLE INSERT 0 1 demodb=# select oid, relname, relfilenode from pg_class where relname = 'test'; oid | relname | relfilenode --------+---------+------------- 416854 | test | 416857 (1 row)
  • 10. Tuple Address (ctid) ctid id name (0, 2) 1 Alice (0, 5) 2 Bob (1, 3) 3 Charlie ctid (tuple ID): a pair of (block, location) to position the tuple in the data file.
  • 11. Heap Table Structure Page: a block of content, default to 8KB each. Line pointers: 4-byte number address, holds pointer to each tuple. For tuple with size > 2KB, a special storage method called TOAST is used.
  • 12. ● Problem: Someone reading data, while someone else is writing to it ● Reader might see inconsistent piece of data ● MVCC: Allow reads and writes to happen concurrently MVCC - Multi-version Concurrency Control
  • 13. MVCC - Table xmin xmax id name 1 5 1 Alice 2 3 2 Bob 3 2 Robert 4 3 Charlie 1. INSERT Alice 2. INSERT Bob 3. UPDATE Bob → Robert 4. INSERT Charlie 5. DELETE Alice ● xmin: transaction ID that inserts this tuple ● xmax: transaction that removes this tuple
  • 17. Because each UPDATE creates new tuple (and marks old tuple deleted), lots of UPDATEs will soon increase the table’s physical size. Table Bloat
  • 18. Index (B-tree) H B A C Balanced search tree. Root node and inner nodes contain keys and pointers to lower level nodes Leaf nodes contain keys and pointers to the heap (ctid) When table has new tuples, new tuple is added to index tree. Heap ctid D A1 …. ….
  • 19. Write Amplifications ● Each UPDATE inserts new tuple. ● New index tuples ● ⇒ multiple writes ● Extra overhead to Write-ahead Log (WAL) ● Carried over through network ● Applied on Slave H B A C Heap ctid D A1 …. ….
  • 20. MySQL / InnoDB ● MVCC: Inline update of tuples ● Table Layout: B+ tree on Primary Key ● Index: points to primary key
  • 21. MySQL data is B+ Tree (on primary key) Leaf nodes contain actual rows data MySQL Table (B+ tree) H B A C row data ... primary key
  • 22. MySQL Index ● MySQL: the node’s value store primary key ● A lookup on secondary index requires 2 index traversals: secondary index + primary index. H B A C Table D A1 …. …. primary key
  • 24. PostgreSQL vs MySQL (Uber case) PostgreSQL MySQL MVCC New Tuple Per UPDATE Inline update of tuple (with rollback segments) Index Lookup Store physical address (ctid) By primary key Table Layout Heap-table structure Primary-key table structure
  • 25. PostgreSQL vs MySQL (Uber case) PostgreSQL MySQL select on primary key log(N) + heap read log(n) + direct read update Update all indexes; 1 data write Do not update indexes; 2 data writes select on index key log(n) + O(1) heap read log(n) + log(n) primary index read sequential scan Page sequential scan Index-order scan
  • 26. Index-only Scan (Covering Index) Index on (product_id, revenue) SELECT SUM(revenue) FROM table WHERE product_id = 123 If the index itself has all the data needed, no Heap Table lookup is required.
  • 27. Visibility Map Per table’s page VM[i] is set: all tuples in page i are visible to current transactions VM is only updated by VACUUM https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and-xid
  • 28. Heap-only Tuple (HOT) ● No new index needs to be updated Conditions: ● Must not update a column that’s indexed ● New tuple must be in the same page http://slideplayer.com/slide/9883483/
  • 29. ● Clean up dead tuples ● Freeze old tuples (prevent transactions wraparound) ● VACUUM only frees old tuples ● VACUUM FULL reclaims old disk spaces, but blocks writes VACUUM
  • 30. ● Add a new column (safe) ● Add a column with a default (unsafe) ● Add a column that is non-nullable (unsafe) ● Drop a column (safe) ● Add a default value to an existing column (safe) ● Add an index (unsafe) Safe & Unsafe Operations In PostgreSQL http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
  • 31. References ● Why Uber Engineering switched from PostgreSQL to MySQL - https://eng.uber.com/mysql-migration/ ● PostgreSQL Documentations - https://www.postgresql.org/docs/current/static/ ● The Internals of PostgreSQL http://www.interdb.jp/pg/ ● http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql ● http://slideplayer.com/slide/9883483/ ● https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and -xid
  • 34. Transaction Isolation BEGIN TRANSACTION; SELECT * FROM table; SELECT pg_sleep(10); SELECT * FROM table; COMMIT; under READ COMMITTED, the second SELECT may return any data. A concurrent transaction may update the record, delete it, insert new records. The second select will always see the new data. under REPEATABLE READ the second SELECT is guaranteed to see the rows that has seen at first select unchanged. New rows may be added by a concurrent transaction in that one minute, but the existing rows cannot be deleted nor changed. under SERIALIZABLE reads the second select is guaranteed to see exactly the same rows as the first. No row can change, nor deleted, nor new rows could be inserted by a concurrent transaction. https://stackoverflow.com/questions/4034976/difference-between-read-commit-and-repeatable-read
  • 35. PostgreSQL Processes There are multiple processes handling different use cases. ● postmaster process: handles database cluster management. ● Many backend processes (one for each connection) ● Background processes: stats collector, autovacuum, checkpoint, WAL writer, etc. http://www.interdb.jp/pg/pgsql02.html
  • 36. Database Cluster ● database cluster: a database instance in a single machine. ● A database contains many database objects (schema, table, index, view, function, etc) ● Each object is represented by an oid Database Cluster Database 1 Database 2 Database n... tables indexes views, materialized views functions schema sequences ... role (user/group