SlideShare a Scribd company logo
Architecture of a Geo-Distributed SQL Database
CockroachDB
Peter Mattis (@petermattis), Co-founder & CTO
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
cockroachdb-distributed-sql/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
CockroachDB: Geo-distributed SQL Database
Make Data Easy
• Distributed
○ Horizontally scalable to grow with your application
• Geo-distributed
○ Handle datacenter failures
○ Place data near usage
○ Push computation near data
• SQL
○ Lingua-franca for rich data storage
○ Schemas, indexes, and transactions make app development easier
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
Distributed, Replicated, Transactional KV*
• Keys and values are strings
○ Lexicographically ordered by key
• Multi-version concurrency control (MVCC)
○ Values are never updated “in place”, newer versions shadow older versions
○ Tombstones are used to delete values
○ Provides snapshot to each transaction
• Monolithic key-space
* Not exposed for external usage
Monolithic Key Space
DOGS
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
Monolithic logical key space
● Ordered lexicographically by key
Ranges
DOGS
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
zee
Key space divided into contiguous ~64MB ranges
Ranges are small enough to
be moved/split quickly
Ranges are large enough to
amortize indexing overhead
Range Indexing
DOGS
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
Index structure used to
locate ranges
(very much like a B-tree)
1
2
3
carl - jack
lady - peetey
pinetop - zee
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
zee
Ordered Range Scans
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
Ordered keys enable
efficient range scans
dogs >= “muddy” AND <= “stella”
1
2
3
carl - jack
lady - peetey
pinetop - zee
carl
dagne
figment
jack peetey
lula
lady pinetop
sooshi
zee
muddy stella
stella
muddy
Transactional Updates
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
Transactions used to insert
records into ranges
1
2
3
carl - jack
lady - peetey
pinetop - zee
stella
muddy
INSERT[sunny]
INSERT[sunny]
Space available in range? - YES
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
✓?
Transactional Updates
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
1
2
3
carl - jack
lady - peetey
pinetop - zee
stella
muddy
INSERT[sunny]
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
✓
Transactions used to insert
records into ranges
INSERT[sunny]
Range Splits
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
1
2
3
carl - jack
lady - peetey
pinetop - zee
stella
muddy
INSERT[rudy]
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
BUT… what happens when
a range is full?
✓? INSERT[rudy]
Space available in range? - NO
Range Splits
DOGS
carl
dagne
figment
jack
pinetop
sooshi
zee
peetey
lula
lady
stella
muddy
INSERT[rudy]
carl
dagne
figment
jack
muddy
peetey
lula
lady pinetop
rudy
sooshi
pinetop
sooshi
stella
zee
muddy
peetey
lula
lady
Ranges are automatically
split, a new range index is
created & order maintained
✓ INSERT[rudy]
split range and insert
stella
sunny
zee
1
2
3
carl - jack
lady - peetey
pinetop - sooshi
4 stella - zee
Raft and Replication
Ranges (~64MB) are the unit of replication
Each range is a Raft group
(Raft is a consensus replication protocol)
Default to 3 replicas, though this is configurable
• Important system ranges default to 5 replicas
• Note: 2 replicas doesn’t make sense in consensus replication
Raft
group
Raft and Replication
Raft provides “atomic replication” of commands
Commands are proposed by the leaseholder replica
and distributed to the follower replicas, but only
accepted when a quorum of replicas have
acknowledged receipt
* Leaseholder == Raft leader
Raft
group
LEASEHOLDER
node1
node2
node4
node3
Range Leases
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
carl
dagne
figment
jack
Reads with consensus
Reads must talk to a quorum of replicas
READ[carl]
node1
node2
node4
node3
Range Leases
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
carl
dagne
figment
jack
Reads without consensus
One replica is chosen as the leaseholder
READ[carl]
leaseholder
node1
node2
node4
node3
Range Leases
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
carl
dagne
figment
jack
Reads without consensus
One replica is chosen as the leaseholder
● Coordinates writes (proposal, key locking)
● Performs reads
READ[carl]
leaseholder
node1
node2
node4
node3
Replica Placement
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
● Space
● Diversity
● Load
● Latency
carl
dagne
figment
jack
Each Range is a Raft state machine
A Range has 1 or more Replicas
node1
node2
node4
node3
Replica Placement: Diversity
muddy
peetey
lula
lady
carl
dagne
figment
jack
Diversity
optimizes placement of
replicas across “failure
domains”
● Disk
● Single machine
● Rack
● Datacenter
● Region
pinetop
sooshi
stella
zee
node1
node2
node6
node4
node5
Replica Placement: Load
muddy
peetey
lula
lady
pinetop
sooshi
stella
zee
carl
dagne
figment
jack
Load
Balances placement using
heuristics that considers
real-time usage metrics of
the data itself
This range is high load as it is
accessed more than others
While we show this for ranges within a
single table, this is also applicable across
all ranges across ALL tables, which is the
more typical situation
node1node3
Replica Placement: Latency & Geo-partitioning
muddy
peetey
lula
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
USE/muddy
USE/stella
USE/figment
USE/dagne
USW/jack
USW/lady
USW/peetey
USW/pinetop
EU/carl
EU/lula
EU/sooshi
EU/zee
We apply a constraint that indicates regional
placement so we can ensure low latency
access or jurisdictional control of data
Rebalancing Replicas
node1
node5
node4
node2
node3
NEW
Scale: Add a node
If we add a node to the cluster,
CockroachDB automatically
redistributed replicas to even load
across the cluster
Uses the replica placement
heuristics from previous slides to
decide which node to add to and
which to remove from
Rebalancing Replicas
node1
node5
node4
node2
node3
NEW
Scale: Add a node
If we add a node to the cluster,
CockroachDB automatically
redistributed replicas to even load
across the cluster
Uses the replica placement
heuristics from previous slides
Movement is decomposed into
adding a replica followed by
removing a replica
Rebalancing Replicas
node1
node5
node4
node2
node3
NEW
Scale: Add a node
If we add a node to the cluster,
CockroachDB automatically
redistributed replicas to even load
across the cluster
Uses the replica placement
heuristics from previous slides
Movement is decomposed into
adding a replica followed by
removing a replica
Rebalancing Replicas
node1
node5
node4
node2
node3
Loss of a node
Permanent Failure
If a node goes down, the Raft
group realizes a replica is missing
and replaces it with a new replica
on an active node
Uses the replica placement
heuristics from previous slides
Rebalancing Replicas
node1
node5
node4
node2
node3
Loss of a node
Permanent Failure
If a node goes down, the Raft
group realizes a replica is missing
and replaces it with a new replica
on an active node
Uses the replica placement
heuristics from previous slides
The failed replica is removed from the Raft group
and a new replica created. The leaseholder sends a
snapshot of the Range’s state to bring the new
replica up to date.
Rebalancing Replicas
node1
node5
node4
node2
Loss of a node
Temporary Failure
If a node goes down for a moment,
the leaseholder can “catch up” any
replica that is behind
The leaseholder can send commands to be replayed
OR it can send a snapshot of the current Range data.
We apply heuristics to decide which is most efficient
for a given failure.
node3
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
Transactions
Atomicity, Consistency, Isolation, Durability
Serializable Isolation
• As if the transactions are run in a serial order
• Gold standard isolation level
• Make Data Easy - weaker isolation levels are too great a burden
Transactions can span arbitrary ranges
Conversational
• The full set of operations is not required up front
Transactions
Raft provides atomic writes to individual ranges
Bootstrap transaction atomicity using Raft atomic writes
Transaction record atomically flipped from PENDING to COMMIT
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
INSERT INTO dogs
VALUES (sunny, ozzie)
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
BEGIN TXN1
WRITE[sunny]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
sunny
zee
ACK
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
ACK
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
Distributed Transactions
node1
carl
dagne
figment
jack
node2
node3
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
ACK
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
Distributed Transactions
node1
carl
dagne
figment
jack
node2
node3
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: PENDING
ACK
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
COMMIT
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
transactions
TXN1: COMMIT
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
Distributed Transactions
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node2
node3
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
node4
carl
dagne
figment
jack
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN TXN1
WRITE[sunny]
WRITE[ozzie]
COMMIT
GATEWAY
INSERT INTO dogs
VALUES (sunny, ozzie)
pinetop
sooshi
stella
sunny
zee
carl
peetey
dagne
lady
lula
muddy
ozzie
peetey
lady
ACK
Transactions: Pipelining
Serial Pipelined
Transactions: Pipelining
Serial Pipelined
sunny
sunny
BEGIN
WRITE[sunny]
txn:sunny (pending)
Transactions: Pipelining
Serial Pipelined
txn:sunny (pending)
sunny
ozzie
sunny
ozzie
BEGIN
WRITE[sunny]
WRITE[ozzie]
Transactions: Pipelining
Serial Pipelined
txn:sunny (pending)
sunny
ozzie
txn:sunny (commit)[keys: sunny, ozzie]
txn:sunny (staged)[keys: sunny, ozzie]
sunny
ozzie
BEGIN
WRITE[sunny]
WRITE[ozzie]
COMMIT
Transactions: Pipelining
Serial Pipelined
txn:sunny (pending)
sunny
ozzie
txn:sunny (commit)[keys: sunny, ozzie]
BEGIN
WRITE[sunny]
WRITE[ozzie]
COMMIT
Committed once all
operations complete
We replaced the
centralized commit marker
with a distributed one
t
sunny
ozzie
txn:sunny (staged)[keys: sunny, ozzie]
* “Proved” with TLA+
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
SQL
Structured Query Language
Declarative, not imperative
• These are the results I want vs perform these operations in this sequence
Relational data model
• Typed: INT, FLOAT, STRING, ...
• Schemas: tables, rows, columns, foreign keys
SQL: Tabular Data in a KV World
SQL data has columns and types?!?
How do we store typed and columnar data in a distributed, replicated,
transactional key-value store?
• The SQL data model needs to be mapped to KV data
• Reminder: keys and values are lexicographically sorted
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
Key Value
/1 “Bat”,1.11
/2 “Ball”,2.22
/3 “Glove”,3.33
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
Key Value
/<Table>/<Index>/1 “Bat”,1.11
/<Table>/<Index>/2 “Ball”,2.22
/<Table>/<Index>/3 “Glove”,3.33
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
Key Value
/inventory/primary/1 “Bat”,1.11
/inventory/primary/2 “Ball”,2.22
/inventory/primary/3 “Glove”,3.33
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT,
INDEX name_idx (name)
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
Key Value
/inventory/name_idx/”Bat”/1 ∅
/inventory/name_idx/”Ball”/2 ∅
/inventory/name_idx/”Glove”/3 ∅
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT,
INDEX name_idx (name)
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
4 Bat 4.44
Key Value
/inventory/name_idx/”Bat”/1 ∅
/inventory/name_idx/”Ball”/2 ∅
/inventory/name_idx/”Glove”/3 ∅
CREATE TABLE inventory (
id INT PRIMARY KEY,
name STRING,
price FLOAT,
INDEX name_idx (name)
)
SQL Data Mapping: Inventory Table
ID Name Price
1 Bat 1.11
2 Ball 2.22
3 Glove 3.33
4 Bat 4.44
Key Value
/inventory/name_idx/”Bat”/1 ∅
/inventory/name_idx/”Ball”/2 ∅
/inventory/name_idx/”Glove”/3 ∅
/inventory/name_idx/”Bat”/4 ∅
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
SQL Execution
Relational operators
• Projection (SELECT <columns>)
• Selection (WHERE <filter>)
• Aggregation (GROUP BY <columns>)
• Join (JOIN), union (UNION), intersect (INTERSECT)
• Scan (FROM <table>)
• Sort (ORDER BY)
○ Technically, not a relational operator
SQL Execution
• Relational expressions have input expressions and scalar expressions
○ For example, a “filter” expression has 1 input expression and a scalar expression that
filters the rows from the child
○ The scan expression has zero inputs
• Query plan is a tree of relational expressions
• SQL execution takes a query plan and runs the operations to completion
SQL Execution: Example
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
SQL Execution: Scan
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory
SQL Execution: Filter
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory
Filter
name >= “b” AND name < “c”
SQL Execution: Project
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory
Filter
name >= “b” AND name < “c”
Project
name
SQL Execution: Project
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory
Filter
name >= “b” AND name < “c”
Project
name
Results
SQL Execution: Index Scans
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory@name [“b” - “c”)
The filter gets pushed into the scan
SQL Execution: Index Scans
SELECT name
FROM inventory
WHERE name >= “b” AND name < “c”
Scan
inventory@name [“b” - “c”)
Project
name
Results
SQL Execution: Correctness
Correct SQL execution involves lots of bookkeeping
• User defined tables, and indexes
• Queries refer to table and column names
• Execution uses table and column IDs
• NULL handling
SQL Execution: Performance
Performant SQL execution
• Tight, well written code
• Operator specialization
○ hash group by, stream group by
○ hash join, merge join, lookup join, zig-zag join
• Distributed execution
SQL Execution: Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 1
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 1
Germany 1
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 1
Germany 1
France 1
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 1
Germany 1
France 2
SQL Execution: Hash Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
United States 2
Germany 1
France 2
SQL Execution: Group By Revisited
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Bob United States
Hans Germany
Jacques France
Marie France
Susan United States
SQL Execution: Sort on Grouping Column(s)
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 1
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 2
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 2
Germany 1
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 2
Germany 1
United States 1
SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country Name Country
Jacques France
Marie France
Hans Germany
Bob United States
Susan United States
France 2
Germany 1
United States 2
Distributed SQL Execution
Network latencies and
throughput are important
considerations in
geo-distributed setups
Push fragments of computation
as close to the data as possible
Distributed SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country
Scan
customers
Scan
customers
Scan
customers
scan
scan
scan
Distributed SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country
Scan
customers
Scan
customers
Scan
customers
Group-By
“country”
Group-By
“country”
Group-By
“country”
group-by
group-by
group-by
Distributed SQL Execution: Streaming Group By
SELECT COUNT(*), country
FROM customers
GROUP BY country
Scan
customers
Scan
customers
Scan
customers
Group-By
“country”
Group-By
“country”
Group-By
“country”
Group-By
“country”
group-by
AGENDA
● Introduction
● Ranges and Replicas
● Transactions
● SQL Data in a KV World
● SQL Execution
● SQL Optimization
SQL Optimization
An optimizer explores many plans that are logically equivalent to a given
query and chooses the best one
Parse ExecuteSearch
Memo
Prep
AST Plan
Fold Constants
Check Types
Resolve Names
Report Semantic Errors
Compute properties
Retrieve and attach stats
Cost-independent transformations
Cost-based transformationsParse SQL
SQL Optimization: Cost-Independent Transformations
• Some transformations always make sense
○ Constant folding
○ Filter push-down
○ Decorrelating subqueries*
○ ...
• These transformations are cost-independent
○ If the transformation can be applied to the query, it is applied
• Domain Specific Language for transformations
○ Compiled down to code which efficiently matches query fragments in the memo
○ ~200 transformations currently defined
* Actually cost-based, but we’re treating it as cost-independent right now
SQL Optimization: Filter Push-Down
SELECT * FROM a JOIN b WHERE x > 10
Scan
a@primary
Filter
x > 10
Results
Scan
b@primary
Join
Initial plan
SQL Optimization: Filter Push-Down
SELECT * FROM a JOIN b WHERE x > 10
Scan
a@primary
Filter
x > 10
Results
Scan
b@primary
Join
Filter
x > 10
After filter push-down
SQL Optimization: Cost-Based Transformations
• Some transformations are not universally good
○ Index selection
○ Join reordering
○ ...
• These transformations are cost-based
○ When should the transformation be applied?
○ Need to try both paths and maintain both the original and transformed query
○ State explosion: thousands of possible query plans
■ Memo data structure maintains a forest of query plans
○ Estimate cost of each query, select query with lowest cost
• Costing
○ Based on table statistics and estimating cardinality of inputs to relational expressions
SQL Optimization: Cost-based Index Selection
The index to use for a query is affected by multiple factors
• Filters and join conditions
• Required ordering (ORDER BY)
• Implicit ordering (GROUP BY)
• Covering vs non-covering (i.e. is an index-join required)
• Locality
SQL Optimization: Cost-based Index Selection
SELECT *
FROM a
WHERE x > 10
ORDER BY y
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Filter
x > 10
Sort
y
SELECT *
FROM a
WHERE x > 10
ORDER BY y
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Scan
a@x [10 - )
Filter
x > 10
Sort
y
Sort
y
SELECT *
FROM a
WHERE x > 10
ORDER BY y
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Scan
a@x [10 - )
Filter
x > 10
Scan
a@y
Sort
y
Sort
y
Filter
x > 10
SELECT *
FROM a
WHERE x > 10
ORDER BY y
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Scan
a@x [10 - )
Filter
x > 10
Scan
a@y
Sort
y
Sort
y
Filter
x > 10
SELECT *
FROM a
WHERE x > 10
ORDER BY y
10
100,000
10
10
Lowest
Cost
SQL Optimization: Cost-based Index Selection
Required orderings affect index selection
Sorting is expensive if there are a lot of rows
Sorting can be the better option if there are few rows
Scan
a@primary
Scan
a@x [10 - )
Filter
x > 10
Scan
a@y
Sort
y
Sort
y
Filter
x > 10
SELECT *
FROM a
WHERE x > 10
ORDER BY y
50,000
100,000
50,000
50,000
Lowest
Cost
Locality-Aware SQL Optimization
Network latencies and
throughput are important
considerations in
geo-distributed setups
Duplicate read-mostly data in
each locality
Plan queries to use data from
the same locality
Locality-Aware SQL Optimization
Three copies of the
postal_codes table data
Use replication constraints to
pin the copies to different
geographic regions (US-East,
US-West, EU)
CREATE TABLE postal_codes (
id INT PRIMARY KEY,
code STRING,
INDEX idx_eu (id) STORING (code),
INDEX idx_usw (id) STORING (code)
)
Locality-Aware SQL Optimization
Optimizer includes locality in
cost model
Automatically selects index
from same locality: primary,
idx_eu, or idx_usw
CREATE TABLE postal_codes (
id INT PRIMARY KEY,
code STRING,
INDEX idx_eu (id) STORING (code),
INDEX idx_usw (id) STORING (code)
)
SELECT * FROM postal_codes
Conclusion
● Distributed, replicated, transactional key-value store
● Monolithic key space
● Raft replication of ranges (~64MB)
● Replica placement signals: space, diversity, load, latency
● Pipelined transaction operations
● Mapping SQL data to KV storage
● Distributed SQL execution
● Distributed SQL optimization
www.cockroachlabs.com
github.com/cockroachdb/cockroach
Thank You
A Simple Transaction
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
node1
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
INSERT INTO DOGS (sunny);
A Simple Transaction: One Range
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
node1
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
zee
pinetop
sooshi
stella
sunny
zee
BEGIN
WRITE[sunny]
COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
NOTE: a gateway can be ANY CockroachDB instance. It can
find the leaseholder for any range and execute a transaction
A Simple Transaction: One Range
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
node1
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN
WRITE[sunny]
COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
ACK
A Simple Transaction: One Range
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
node1
carl
peetey
dagne
lady
lula
muddy
peetey
lady
node1
carl
dagne
figment
jack
carl
peetey
dagne
lady
lula
muddy
peetey
lady
carl
dagne
figment
jack
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
pinetop
sooshi
stella
sunny
zee
BEGIN
WRITE[sunny]
COMMIT
GATEWAY
INSERT INTO DOGS (sunny);
ACK
Ranges
CockroachDB implements order-preserving data distribution
• Automates sharding of key/value data into “ranges”
• Supports efficient range scans
• Requires an indexing structure
Foundational capability that enables efficient distribution
of data across nodes within a CockroachDB cluster
* This approach is also used by Bigtable (tablets), HBase (regions) & Spanner (ranges)
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
cockroachdb-distributed-sql/

More Related Content

PDF
Enterprise Architecture vs. Data Architecture
PPT
Consistency protocols
PPTX
NoSQL Graph Databases - Why, When and Where
PDF
Smart Sensors for Infrastructure and Structural Health Monitoring
PPTX
OpenTelemetry For Architects
PPTX
Replication in Distributed Systems
PPTX
Introduction to couchbase
PPTX
Azure DevOps
Enterprise Architecture vs. Data Architecture
Consistency protocols
NoSQL Graph Databases - Why, When and Where
Smart Sensors for Infrastructure and Structural Health Monitoring
OpenTelemetry For Architects
Replication in Distributed Systems
Introduction to couchbase
Azure DevOps

What's hot (20)

PPTX
CockroachDB
PDF
MinIO January 2020 Briefing
PDF
ProxySQL High Avalability and Configuration Management Overview
PPTX
PPTX
Ceph Performance and Sizing Guide
PPTX
PostgreSQL and CockroachDB SQL
PDF
Producer Performance Tuning for Apache Kafka
PDF
All about Zookeeper and ClickHouse Keeper.pdf
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PDF
Getting Started with Kubernetes
PDF
Getting Started with Confluent Schema Registry
PDF
Handle Large Messages In Apache Kafka
PDF
Apache Kafka Introduction
PPTX
Introduction to Apache Kafka
PDF
Yahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSS
PDF
Kafka 101 and Developer Best Practices
PDF
Implementing Observability for Kubernetes.pdf
PDF
Autoscaling Kubernetes
PPTX
Ceph Introduction 2017
PDF
Linux Container Technology 101
CockroachDB
MinIO January 2020 Briefing
ProxySQL High Avalability and Configuration Management Overview
Ceph Performance and Sizing Guide
PostgreSQL and CockroachDB SQL
Producer Performance Tuning for Apache Kafka
All about Zookeeper and ClickHouse Keeper.pdf
Apache Kafka Fundamentals for Architects, Admins and Developers
Getting Started with Kubernetes
Getting Started with Confluent Schema Registry
Handle Large Messages In Apache Kafka
Apache Kafka Introduction
Introduction to Apache Kafka
Yahoo! JAPANのプライベートRDBクラウドとマルチライター型 MySQL #dbts2017 #dbtsOSS
Kafka 101 and Developer Best Practices
Implementing Observability for Kubernetes.pdf
Autoscaling Kubernetes
Ceph Introduction 2017
Linux Container Technology 101
Ad

Similar to CockroachDB: Architecture of a Geo-Distributed SQL Database (20)

PPT
HPTS talk on micro-sharding with Katta
PDF
Using OpenStack Swift for Extreme Data Durability
PDF
Ippevent : openshift Introduction
PPTX
Scylla Summit 2018: Consensus in Eventually Consistent Databases
PDF
5 levels of high availability from multi instance to hybrid cloud
PPTX
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
PDF
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
PDF
MSR 2009
PPT
HPTS talk on micro sharding with Katta
PDF
Percon XtraDB Cluster in a nutshell
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PPTX
Leveraging Endpoint Flexibility in Data-Intensive Clusters
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
Using Kubernetes to deliver a “serverless” service
PDF
NetflixOSS Open House Lightning talks
PDF
DEVIEW 2013
PDF
Scalable Persistent Storage for Erlang: Theory and Practice
PDF
Elasticsearch
PPTX
Lessons learned from running Spark on Docker
PDF
Elastic Data Analytics Platform @Datadog
HPTS talk on micro-sharding with Katta
Using OpenStack Swift for Extreme Data Durability
Ippevent : openshift Introduction
Scylla Summit 2018: Consensus in Eventually Consistent Databases
5 levels of high availability from multi instance to hybrid cloud
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
MSR 2009
HPTS talk on micro sharding with Katta
Percon XtraDB Cluster in a nutshell
Taking Splunk to the Next Level - Architecture Breakout Session
Leveraging Endpoint Flexibility in Data-Intensive Clusters
Ten tools for ten big data areas 03_Apache Spark
Using Kubernetes to deliver a “serverless” service
NetflixOSS Open House Lightning talks
DEVIEW 2013
Scalable Persistent Storage for Erlang: Theory and Practice
Elasticsearch
Lessons learned from running Spark on Docker
Elastic Data Analytics Platform @Datadog
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
PDF
Next Generation Client APIs in Envoy Mobile
PDF
Software Teams and Teamwork Trends Report Q1 2020
PDF
Understand the Trade-offs Using Compilers for Java Applications
PDF
Kafka Needs No Keeper
PDF
High Performing Teams Act Like Owners
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
PDF
Service Meshes- The Ultimate Guide
PDF
Shifting Left with Cloud Native CI/CD
PDF
CI/CD for Machine Learning
PDF
Fault Tolerance at Speed
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
PDF
Build Your Own WebAssembly Compiler
PDF
User & Device Identity for Microservices @ Netflix Scale
PDF
Scaling Patterns for Netflix's Edge
PDF
Make Your Electron App Feel at Home Everywhere
PDF
The Talk You've Been Await-ing For
PDF
Future of Data Engineering
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Next Generation Client APIs in Envoy Mobile
Software Teams and Teamwork Trends Report Q1 2020
Understand the Trade-offs Using Compilers for Java Applications
Kafka Needs No Keeper
High Performing Teams Act Like Owners
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Service Meshes- The Ultimate Guide
Shifting Left with Cloud Native CI/CD
CI/CD for Machine Learning
Fault Tolerance at Speed
Architectures That Scale Deep - Regaining Control in Deep Systems
ML in the Browser: Interactive Experiences with Tensorflow.js
Build Your Own WebAssembly Compiler
User & Device Identity for Microservices @ Netflix Scale
Scaling Patterns for Netflix's Edge
Make Your Electron App Feel at Home Everywhere
The Talk You've Been Await-ing For
Future of Data Engineering
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Review of recent advances in non-invasive hemoglobin estimation
Reach Out and Touch Someone: Haptics and Empathic Computing
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

CockroachDB: Architecture of a Geo-Distributed SQL Database