SlideShare a Scribd company logo
Distributed postgres.
XL, XTM, MultiMaster
Stas Kelvich
Started about a year ago.
Konstantin Knizhnik, Constantin Pan, Stas Kelvich
Cluster group in PgPro
2
Started to playing with Postgres-XC. 2ndQuadrant also had project
(finished now) to port XC to 9.5.
Fork is painful;
How can we bring functionality of XC in core?
Cluster group in PgPro
3
Distributed transactions - nothing in-core;
Distributed planner - fdw, pg_shard, greenplum planner (?);
HA/Autofailover - can be built on top of logical decoding.
Distributed postgres
4
Achieve proper isolation between tx for multi-node transactions.
Now in postgres on write tx start:
Aquire XID;
Get list of running tx’s;
Use that info in visibility checks.
Distributed transactions
5
transam/clog.c:
GetTransactionStatus
SetTransactionStatus
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
TransactionIdIsInProgress
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
XTM API:
vanilla
6
transam/clog.c:
GetTransactionStatus
SetTransactionStatus
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
TransactionIdIsInProgress
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
Transaction
Manager
XTM API:
after patch
7
transam/clog.c:
GetTransactionStatus
SetTransactionStatus
transam/varsup.c:
GetNewTransactionId
ipc/procarray.c:
TransactionIdIsInProgress
GetOldestXmin
GetSnapshotData
time/tqual.c:
XidInMVCCSnapshot
Transaction
Manager
pg_dtm.so
XTM API:
after tm load
8
Aquire XID centrally (DTMd, arbiter);
No local tx possible;
DTMd is a bottleneck.
XTM implementations
GTM or snapshot sharing
9
Paper from SAP HANA team;
Central daemon is needed, but only for multi-node tx;
Snapshots -> Commit Sequence Number;
DTMd is still a bottleneck.
XTM implementations
Incremental SI
10
XID/CSN are gathered from all nodes that participates in tx;
No central service;
local tx;
possible to reduce communication by using time (Spanner,
CockroachDB).
XTM implementations
Clock-SI or tsDTM
11
XTM implementations
tsDTM scalability
12
More nodes, higher probability of failure in system.
Possible problems with nodes:
Node stopped (and will not be back);
Node was down small amount of time (and we should bring it
back to operation);
Network partitions (avoid split-brain).
If we want to survive network partitions than we can have not more
than [N/2] - 1 failures.
HA/autofailover
13
Possible usage of such system:
Multimaster replication;
Tables with metainformation in sharded databases;
Sharding with redundancy.
HA/autofailover
14
By Multimaster we mean strongly coupled one, that acts as a single
database. With proper isolation and no merge conflicts.
Ways to build:
Global order to XLOG (Postgres-R, MySQL Galera);
Wrap each tx as distributed – allows parallelism while applying
tx.
Multimaster
15
Our implementation:
Built on top of pg_logical;
Make use of tsDTM;
Pool of workers for tx replay;
Raft-based storage for dealing with failures and distributed
deadlock detection.
Multimaster
16
Our implementation:
Approximately half of a speed of standalone postgres;
Same speed for reads;
Deals with nodes autorecovery;
Deals with network partitions (debugging right now).
Can work as an extension (if community accept XTM API in
core).
Multimaster
17

More Related Content

What's hot (20)

An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
Patrick McGarry
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
HBaseCon
 
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon
 
Bluestore
Bluestore
Patrick McGarry
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
Ceph Community
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
Benjamin Black
 
Cassandra at teads
Cassandra at teads
Romain Hardouin
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
Larry Lang
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
Debug generic process
Debug generic process
Vipin Varghese
 
OpenTSDB 2.0
OpenTSDB 2.0
HBaseCon
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
DataStax Academy
 
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
Wei Shan Ang
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
ScyllaDB
 
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
Michael Stack
 
Pgxc scalability pg_open2012
Pgxc scalability pg_open2012
Ashutosh Bapat
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
Sage Weil
 
CephFS update February 2016
CephFS update February 2016
John Spray
 
Latest performance changes by Scylla - Project optimus / Nolimits
Latest performance changes by Scylla - Project optimus / Nolimits
ScyllaDB
 
Tungsten University: Setup & Operate Tungsten Replicator
Tungsten University: Setup & Operate Tungsten Replicator
Continuent
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
Patrick McGarry
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
HBaseCon
 
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
Ceph Community
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
Benjamin Black
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
Larry Lang
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
OpenTSDB 2.0
OpenTSDB 2.0
HBaseCon
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
DataStax Academy
 
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
Wei Shan Ang
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
ScyllaDB
 
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
Michael Stack
 
Pgxc scalability pg_open2012
Pgxc scalability pg_open2012
Ashutosh Bapat
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
Sage Weil
 
CephFS update February 2016
CephFS update February 2016
John Spray
 
Latest performance changes by Scylla - Project optimus / Nolimits
Latest performance changes by Scylla - Project optimus / Nolimits
ScyllaDB
 
Tungsten University: Setup & Operate Tungsten Replicator
Tungsten University: Setup & Operate Tungsten Replicator
Continuent
 

Viewers also liked (20)

Postgres-XC Write Scalable PostgreSQL Cluster
Postgres-XC Write Scalable PostgreSQL Cluster
Mason Sharp
 
Flexible Indexing with Postgres
Flexible Indexing with Postgres
EDB
 
Postgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDB
Mason Sharp
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
EDB
 
1
1
Yahia Mahmoud
 
Postgres-XC: Symmetric PostgreSQL Cluster
Postgres-XC: Symmetric PostgreSQL Cluster
Pavan Deolasee
 
Best Practices for Database Schema Design
Best Practices for Database Schema Design
Iron Speed
 
5 data storage_and_indexing
5 data storage_and_indexing
Utkarsh De
 
Managing your tech career
Managing your tech career
Greg Jensen
 
1 introduction
1 introduction
Utkarsh De
 
4 the sql_standard
4 the sql_standard
Utkarsh De
 
6 relational schema_design
6 relational schema_design
Utkarsh De
 
Normalization
Normalization
Randy Riness @ South Puget Sound Community College
 
Webinar: Build an Application Series - Session 2 - Getting Started
Webinar: Build an Application Series - Session 2 - Getting Started
MongoDB
 
3 relational model
3 relational model
Utkarsh De
 
MySQL Replication: Pros and Cons
MySQL Replication: Pros and Cons
Rachel Li
 
Week3 Lecture Database Design
Week3 Lecture Database Design
Kevin Element
 
Database Design
Database Design
learnt
 
2 entity relationship_model
2 entity relationship_model
Utkarsh De
 
English gcse final tips
English gcse final tips
mrhoward12
 
Postgres-XC Write Scalable PostgreSQL Cluster
Postgres-XC Write Scalable PostgreSQL Cluster
Mason Sharp
 
Flexible Indexing with Postgres
Flexible Indexing with Postgres
EDB
 
Postgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDB
Mason Sharp
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
EDB
 
Postgres-XC: Symmetric PostgreSQL Cluster
Postgres-XC: Symmetric PostgreSQL Cluster
Pavan Deolasee
 
Best Practices for Database Schema Design
Best Practices for Database Schema Design
Iron Speed
 
5 data storage_and_indexing
5 data storage_and_indexing
Utkarsh De
 
Managing your tech career
Managing your tech career
Greg Jensen
 
1 introduction
1 introduction
Utkarsh De
 
4 the sql_standard
4 the sql_standard
Utkarsh De
 
6 relational schema_design
6 relational schema_design
Utkarsh De
 
Webinar: Build an Application Series - Session 2 - Getting Started
Webinar: Build an Application Series - Session 2 - Getting Started
MongoDB
 
3 relational model
3 relational model
Utkarsh De
 
MySQL Replication: Pros and Cons
MySQL Replication: Pros and Cons
Rachel Li
 
Week3 Lecture Database Design
Week3 Lecture Database Design
Kevin Element
 
Database Design
Database Design
learnt
 
2 entity relationship_model
2 entity relationship_model
Utkarsh De
 
English gcse final tips
English gcse final tips
mrhoward12
 
Ad

Similar to Distributed Postgres (10)

Introduction to Postrges-XC
Introduction to Postrges-XC
Ashutosh Bapat
 
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
Aleksander Alekseev
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
mason_s
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
Yugabyte
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTS
Samir Bessalah
 
The Challenges of Distributing Postgres: A Citus Story
The Challenges of Distributing Postgres: A Citus Story
Hanna Kelman
 
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
Citus Data
 
Blockchain meets database
Blockchain meets database
YongraeJo
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Labs
 
Introduction to Postrges-XC
Introduction to Postrges-XC
Ashutosh Bapat
 
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
Aleksander Alekseev
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
mason_s
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
Yugabyte
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTS
Samir Bessalah
 
The Challenges of Distributing Postgres: A Citus Story
The Challenges of Distributing Postgres: A Citus Story
Hanna Kelman
 
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
Citus Data
 
Blockchain meets database
Blockchain meets database
YongraeJo
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Labs
 
Ad

Recently uploaded (20)

Step by step guide to install Flutter and Dart
Step by step guide to install Flutter and Dart
S Pranav (Deepu)
 
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
Aarno Aukia
 
UPDASP a project coordination unit ......
UPDASP a project coordination unit ......
withrj1
 
GDG Douglas - Google AI Agents: Your Next Intern?
GDG Douglas - Google AI Agents: Your Next Intern?
felipeceotto
 
Making significant Software Architecture decisions
Making significant Software Architecture decisions
Bert Jan Schrijver
 
Generative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its Applications
SandeepKS52
 
Microsoft Business-230T01A-ENU-PowerPoint_01.pptx
Microsoft Business-230T01A-ENU-PowerPoint_01.pptx
soulamaabdoulaye128
 
Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWS
Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWS
BradBedford3
 
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
BradBedford3
 
wAIred_RabobankIgniteSession_12062025.pptx
wAIred_RabobankIgniteSession_12062025.pptx
SimonedeGijt
 
Application Modernization with Choreo - The AI-Native Internal Developer Plat...
Application Modernization with Choreo - The AI-Native Internal Developer Plat...
WSO2
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
Advanced Token Development - Decentralized Innovation
Advanced Token Development - Decentralized Innovation
arohisinghas720
 
Wondershare PDFelement Pro 11.4.20.3548 Crack Free Download
Wondershare PDFelement Pro 11.4.20.3548 Crack Free Download
Puppy jhon
 
dp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdf
pravkumarbiz
 
Reimagining Software Development and DevOps with Agentic AI
Reimagining Software Development and DevOps with Agentic AI
Maxim Salnikov
 
Looking for a BIRT Report Alternative Here’s Why Helical Insight Stands Out.pdf
Looking for a BIRT Report Alternative Here’s Why Helical Insight Stands Out.pdf
Varsha Nayak
 
How the US Navy Approaches DevSecOps with Raise 2.0
How the US Navy Approaches DevSecOps with Raise 2.0
Anchore
 
Migrating to Azure Cosmos DB the Right Way
Migrating to Azure Cosmos DB the Right Way
Alexander (Alex) Komyagin
 
Porting Qt 5 QML Modules to Qt 6 Webinar
Porting Qt 5 QML Modules to Qt 6 Webinar
ICS
 
Step by step guide to install Flutter and Dart
Step by step guide to install Flutter and Dart
S Pranav (Deepu)
 
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
Aarno Aukia
 
UPDASP a project coordination unit ......
UPDASP a project coordination unit ......
withrj1
 
GDG Douglas - Google AI Agents: Your Next Intern?
GDG Douglas - Google AI Agents: Your Next Intern?
felipeceotto
 
Making significant Software Architecture decisions
Making significant Software Architecture decisions
Bert Jan Schrijver
 
Generative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its Applications
SandeepKS52
 
Microsoft Business-230T01A-ENU-PowerPoint_01.pptx
Microsoft Business-230T01A-ENU-PowerPoint_01.pptx
soulamaabdoulaye128
 
Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWS
Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWS
BradBedford3
 
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
BradBedford3
 
wAIred_RabobankIgniteSession_12062025.pptx
wAIred_RabobankIgniteSession_12062025.pptx
SimonedeGijt
 
Application Modernization with Choreo - The AI-Native Internal Developer Plat...
Application Modernization with Choreo - The AI-Native Internal Developer Plat...
WSO2
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
Advanced Token Development - Decentralized Innovation
Advanced Token Development - Decentralized Innovation
arohisinghas720
 
Wondershare PDFelement Pro 11.4.20.3548 Crack Free Download
Wondershare PDFelement Pro 11.4.20.3548 Crack Free Download
Puppy jhon
 
dp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdf
pravkumarbiz
 
Reimagining Software Development and DevOps with Agentic AI
Reimagining Software Development and DevOps with Agentic AI
Maxim Salnikov
 
Looking for a BIRT Report Alternative Here’s Why Helical Insight Stands Out.pdf
Looking for a BIRT Report Alternative Here’s Why Helical Insight Stands Out.pdf
Varsha Nayak
 
How the US Navy Approaches DevSecOps with Raise 2.0
How the US Navy Approaches DevSecOps with Raise 2.0
Anchore
 
Porting Qt 5 QML Modules to Qt 6 Webinar
Porting Qt 5 QML Modules to Qt 6 Webinar
ICS
 

Distributed Postgres