SlideShare a Scribd company logo
1© 2018 All rights reserved.
Distributed PostgreSQL
with YugaByte DB
Karthik Ranganathan
PostgresConf Silicon Valley
Oct 16, 2018
2© 2018 All rights reserved.
CHECKOUT THIS REPO:
github.com/YugaByte/yb-sql-workshop
3© 2018 All rights reserved.
About Us
Kannan Muthukkaruppan, CEO
Nutanix ♦ Facebook ♦ Oracle
IIT-Madras, University of California-Berkeley
Karthik Ranganathan, CTO
Nutanix ♦ Facebook ♦ Microsoft
IIT-Madras, University of Texas-Austin
Mikhail Bautin, Software Architect
ClearStory Data ♦ Facebook ♦ D.E.Shaw
Nizhny Novgorod State University, Stony Brook
 Founded Feb 2016
 Apache HBase committers and early engineers on Apache Cassandra
 Built Facebook’s NoSQL platform powered by Apache HBase
 Scaled the platform to serve many mission-critical use cases
• Facebook Messages (Messenger)
• Operational Data Store (Time series Data)
 Reassembled the same Facebook team at YugaByte along with
engineers from Oracle, Google, Nutanix and LinkedIn
Founders
4© 2018 All rights reserved.
WORKSHOP AGENDA
• What is YugaByte DB? Why Another DB?
• Exercise 1: BI Tools on YugaByte PostgreSQL
• Exercise 2: Distributed PostgreSQL Architecture
• Exercise 3: Sharding and Scale Out in Action
• Exercise 4: Fault Tolerance in Action
5© 2018 All rights reserved.
WHAT IS
YUGABYTE DB?
6© 2018 All rights reserved.
A transactional, planet-scale database
for building high-performance cloud services.
7© 2018 All rights reserved.
NoSQL + SQL Cloud Native
8© 2018 All rights reserved.
WHY ANOTHER DB?
9© 2018 All rights reserved.
Typical Stack Today
Fragile infra with several moving parts
Datacenter 1
SQL Master SQL Slave
Application Tier (Stateless Microservices)
Datacenter 2
SQL for OLTP data
Manual sharding
Cost: dev team
Manual replication
Manual failover
Cost: ops team
NoSQL for other data
App aware of data silo
Cost: dev team
Cache for low latency
App does caching
Cost: dev team
Data inconsistency/loss
Fragile infra
Hours of debugging
Cost: dev + ops team
10© 2018 All rights reserved.
Does AWS change this?
Datacenter 1
SQL Master SQL Slave
Datacenter 2
Elasticache
Aurora
DynamoDB
Still Complex
it’s the same architecture
Application Tier (Stateless Microservices)
11© 2018 All rights reserved.
Not Portable
Not Portable
Open Source
Not Portable
Open Source
Open Source
High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale
High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale
System-of-Record DBs for Global Apps
12© 2018 All rights reserved.
TRANSACTIONAL PLANET-SCALEHIGH PERFORMANCE
Single Shard & Distributed ACID Txns
Document-Based, Strongly
Consistent Storage
Low Latency, Tunable Reads
High Throughput
OPEN SOURCE
Apache 2.0
Popular APIs Extended
Apache Cassandra, Redis and PostgreSQL (BETA)
Auto Sharding & Rebalancing
Global Data Distribution
Design Principles
CLOUD NATIVE
Built For The Container Era
Self-Healing, Fault-Tolerant
13© 2018 All rights reserved.
EXERCISE #1
BUSINESS INTELLIGENCE
14© 2018 All rights reserved.
EXERCISE #2
DISTRIBUTED POSTGRES:
ARCHITECTURE
15© 2018 All rights reserved.
ARCHITECTURE
Overview
16© 2018 All rights reserved.
YugaByte DB Process Overview
• Universe = cluster of nodes
• Two sets of processes: YB-Master & YB-TServer
• Example universe
4 nodes
rf=3
17© 2018 All rights reserved.
Sharding data
• User table split into tablets
18© 2018 All rights reserved.
One tablet for every key
19© 2018 All rights reserved.
Tablets and replication
• Tablet = set of tablet-peers in a RAFT group
• Num tablet-peers in tablet = replication factor (RF)
Tolerate 1 failure : RF=3
Tolerate 2 failures: RF=5
20© 2018 All rights reserved.
YB-TServer
• Process that does IO
• Hosts tablet for tables
• Hosts transaction manager
• Auto memory sizing
Block cache
Memstores
21© 2018 All rights reserved.
YB-Master
• Not in critical path
• System metadata store
Keyspaces, tables, tablets
Users/roles, permissions
• Admin operations
Create/alter/drop of tables
Backups
Load balancing (leader and data balancing)
Enforces data placement policy
22© 2018 All rights reserved.
HANDLING DDL STATEMENTS
23© 2018 All rights reserved.
DDL Statements in PostgreSQL
DDL Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
DISK
Create Table Data File
Update System Tables
24© 2018 All rights reserved.
DDL Statements in YugaByte DB PostgreSQL
DDL Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
Create sharded, replicated table as data source
Store Table Metadata in YB-Master (in works)
YugaByte
master3
…
YugaByte
master2
YugaByte
master1
25© 2018 All rights reserved.
YugaByte Query Layer (YQL)
• Stateless, runs in each YB-TServer process
GA Goal:
Distributed
Stateless
PostgreSQL Layer
Current Beta uses
a single Stateless
PostgreSQL Layer
26© 2018 All rights reserved.
HANDLING DML QUERIES
27© 2018 All rights reserved.
DDL Queries in PostgreSQL
QUERY Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
WAL Writer BG Writer…
DISK
FDW
Local Table Code Path
EXTERNAL
DATABASE
28© 2018 All rights reserved.
DML Queries in YugaByte DB PostgreSQL
DML Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
FDW
YugaByte DB Code Path
YB Gateway
EXTERNAL
DATABASE
YugaByte
node3
YugaByte
node4
…
YugaByte
node2
YugaByte
node1
Using FDW as a
Table Storage API
29© 2018 All rights reserved.
ARCHITECTURE
Data Persistence
30© 2018 All rights reserved.
Data Persistence in DocDB
• DocDB is YugaByte DB’s LSM storage engine
• Persistent key to document store
• Extends and enhances RocksDB
• Designed to support high data-densities per node
31© 2018 All rights reserved.
DocDB: Key-to-Document Store
• Document key
CQL/SQL/Redis primary key
• Document value
a CQL or SQL row
Redis data structure
• Fine-grained reads and writes
32© 2018 All rights reserved.
DocDB Data Format
Example Insert
Encoding
33© 2018 All rights reserved.
Some of the RocksDB enhancements
• WAL and MVCC enhancements
o Removed RocksDB WAL, re-uses Raft log
o MVCC at a higher layer
o Coordinate RocksDB memstore flushing and Raft log garbage collection
• File format changes
o Sharded (multi-level) indexes and Bloom filters
• Splitting data blocks & metadata into separate files for tiering support
• Separate queues for large and small compactions
34© 2018 All rights reserved.
More Enhancements to RocksDB
• Data model aware Bloom filters
• Per-SSTable key range metadata to optimize range queries
• Server-global block caches & memstore limits
• Scan-resistant block cache (single-touch and multi-touch)
35© 2018 All rights reserved.
ARCHITECTURE
Data Replication
36© 2018 All rights reserved.
Raft Replication for Consistency
37© 2018 All rights reserved.
How Raft Replication Works
38© 2018 All rights reserved.
How Raft Replication Works
39© 2018 All rights reserved.
How Raft Replication Works
40© 2018 All rights reserved.
How Raft Replication Works
41© 2018 All rights reserved.
Raft Related Enhancements
• Leader Leases
• Multiple Raft groups (1 per tablet)
• Leader Balancing
• Group Commits
• Observer Nodes / Read Replicas
42© 2018 All rights reserved.
ARCHITECTURE
Transactions
43© 2018 All rights reserved.
Single Shard Transactions
Raft Consensus Protocol
. . .
INSERT INTO table (k, v) VALUES (‘k1’, ‘v1’) Lock Manager
(in memory, on leader only)
Acquire a lock on x
DocDB / RocksDB
Read current value of x
Submit a Raft operation for replication:
Insert (k1, v1) at hybrid_time 100
Raft log
Tablet
follower
Tablet
follower
Replicate to
majority of
tablet peers
Apply to RocksDB and
release lock
k1,v1
@ht=100
1
2
5
3
4
44© 2018 All rights reserved.
MVCC for Lockless Reads
• Achieved through HybridTime (HT)
Monotonically increasing timestamp
• Allows reads at a particular HT without locking
• Multiple versions may exist temporarily
Reclaim older values during compactions
45© 2018 All rights reserved.
Single Shard Transactions
• Each tablet maintains a “safe time” for reads
o Highest timestamp such that the view as of that timestamp is fixed
o In the common case it is just before the hybrid time of the next
uncommitted record in the tablet
46© 2018 All rights reserved.
Distributed Transactions
• Fully decentralized architecture
• Every tablet server can act as a Transaction Manager
• A distributed Transaction Status table
Tracks state of active transactions
• Transactions can have 3 states:
pending, committed, aborted
47© 2018 All rights reserved.
Distributed Transactions – Write Path
48© 2018 All rights reserved.
Distributed Transactions – Write Path Step 1: Client request
49© 2018 All rights reserved.
Distributed Transactions – Write Path Step 2: Create status record
50© 2018 All rights reserved.
Distributed Transactions – Write Path Step 2: Create status record
51© 2018 All rights reserved.
Distributed Transactions – Write Path Step 3: Write provisional records
52© 2018 All rights reserved.
Distributed Transactions – Write Path Step 4: Atomic commit
53© 2018 All rights reserved.
Distributed Transactions – Write Path Step 5: Respond to client
54© 2018 All rights reserved.
Distributed Transactions – Write Path Step 6: Apply provisional records
55© 2018 All rights reserved.
Isolation Levels
• Currently Snapshot Isolation is supported
o Write-write conflicts detected when writing provisional records
• Serializable isolation (roadmap)
o Reads in RW txns also need provisional records
• Read-only transactions are always lock-free
56© 2018 All rights reserved.
Clock Skew and Read Restarts
• Need to ensure the read timestamp is high enough
o Committed records the client might have seen must be visible
• Optimistically use current Hybrid Time, re-read if necessary
o Reads are restarted if a record with a higher timestamp that the client
could have seen is encountered
o Read restart happens at most once per tablet
o Relying on bounded clock skew (NTP, AWS Time Sync)
• Only affects multi-row reads of frequently updated records
57© 2018 All rights reserved.
Distributed Transactions – Read Path
58© 2018 All rights reserved.
Distributed Transactions – Read Path Step 1: Client request; pick ht_read
59© 2018 All rights reserved.
Distributed Transactions – Read Path Step 2: Read from tablet servers
60© 2018 All rights reserved.
Distributed Transactions – Read Path Step 3: Resolve txn status
61© 2018 All rights reserved.
Distributed Transactions – Read Path Step 4: Respond to YQL Engine
62© 2018 All rights reserved.
Distributed Transactions – Read Path Step 5: Respond to client
63© 2018 All rights reserved.
Distributed Transactions – Conflicts & Retries
• Every transaction is assigned a random priority
• In a conflict, the higher-priority transaction wins
o The restarted transaction gets a new random priority
o Probability of success quickly increases with retries
• Restarting a transaction is the same as starting a new one
• A read-write transaction can be subject to read-restart
64© 2018 All rights reserved.
EXERCISE #3 and #4
SHARDING AND SCALE OUT
FAULT TOLERANCE
65© 2018 All rights reserved.
Questions?
Try it at
docs.yugabyte.com/latest/quick-start

More Related Content

What's hot (20)

Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.
 
Planning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera ClusterPlanning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera Cluster
Codership Oy - Creators of Galera Cluster
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
HostedbyConfluent
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Provisioning Datadog with Terraform
Provisioning Datadog with TerraformProvisioning Datadog with Terraform
Provisioning Datadog with Terraform
Matt Spurlin
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
Italo Santos
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Monitoring MongoDB Atlas with Datadog
Monitoring MongoDB Atlas with DatadogMonitoring MongoDB Atlas with Datadog
Monitoring MongoDB Atlas with Datadog
MongoDB
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
DataWorks Summit
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
ScyllaDB
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Ceph Community
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Monitor every app, in every stage, with free and open Elastic APM
Monitor every app, in every stage, with free and open Elastic APMMonitor every app, in every stage, with free and open Elastic APM
Monitor every app, in every stage, with free and open Elastic APM
Elasticsearch
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
HostedbyConfluent
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Provisioning Datadog with Terraform
Provisioning Datadog with TerraformProvisioning Datadog with Terraform
Provisioning Datadog with Terraform
Matt Spurlin
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
Italo Santos
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Monitoring MongoDB Atlas with Datadog
Monitoring MongoDB Atlas with DatadogMonitoring MongoDB Atlas with Datadog
Monitoring MongoDB Atlas with Datadog
MongoDB
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
DataWorks Summit
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
ScyllaDB
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Ceph Community
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Monitor every app, in every stage, with free and open Elastic APM
Monitor every app, in every stage, with free and open Elastic APMMonitor every app, in every stage, with free and open Elastic APM
Monitor every app, in every stage, with free and open Elastic APM
Elasticsearch
 

Similar to How YugaByte DB Implements Distributed PostgreSQL (20)

YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018
AlanCaldera
 
Scale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low LatencyScale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low Latency
Yugabyte
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
VMware Tanzu
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
Carlos Andrés García
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
Yugabyte
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performance
Oracle Korea
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
Bobby Curtis
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
SrirakshaSrinivasan2
 
Times ten 18.1_overview_meetup
Times ten 18.1_overview_meetupTimes ten 18.1_overview_meetup
Times ten 18.1_overview_meetup
Byung Ho Lee
 
Tuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceTuning Flink For Robustness And Performance
Tuning Flink For Robustness And Performance
Stefan Richter
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward
 
Avast Premium Security 24.12.9725 + License Key Till 2050
Avast Premium Security 24.12.9725 + License Key Till 2050Avast Premium Security 24.12.9725 + License Key Till 2050
Avast Premium Security 24.12.9725 + License Key Till 2050
asfadnew
 
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
hyby22543
 
FastStone Capture 10.4 Crack + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]FastStone Capture 10.4 Crack + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]
hyby22543
 
EASEUS Partition Master 18.8 Crack + License Code [2025]
EASEUS Partition Master 18.8 Crack + License Code [2025]EASEUS Partition Master 18.8 Crack + License Code [2025]
EASEUS Partition Master 18.8 Crack + License Code [2025]
drewgye
 
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
MiniTool Partition Wizard Crack 12.8 + Serial Key DownloadMiniTool Partition Wizard Crack 12.8 + Serial Key Download
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
drewgye
 
4K Video Downloader Crack (2025) + License Key Free
4K Video Downloader Crack (2025) + License Key Free4K Video Downloader Crack (2025) + License Key Free
4K Video Downloader Crack (2025) + License Key Free
boyjake527
 
Capcut Pro Crack For PC Latest 2025 Full
Capcut Pro Crack For PC Latest 2025 FullCapcut Pro Crack For PC Latest 2025 Full
Capcut Pro Crack For PC Latest 2025 Full
mushtaqcheema932
 
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
mushtaqcheema932
 
minitool partition wizard crack 12.8 latest
minitool partition wizard crack 12.8 latestminitool partition wizard crack 12.8 latest
minitool partition wizard crack 12.8 latest
qaha7432
 
YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018
AlanCaldera
 
Scale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low LatencyScale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low Latency
Yugabyte
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
VMware Tanzu
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
Carlos Andrés García
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
Yugabyte
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performance
Oracle Korea
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
Bobby Curtis
 
Times ten 18.1_overview_meetup
Times ten 18.1_overview_meetupTimes ten 18.1_overview_meetup
Times ten 18.1_overview_meetup
Byung Ho Lee
 
Tuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceTuning Flink For Robustness And Performance
Tuning Flink For Robustness And Performance
Stefan Richter
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward
 
Avast Premium Security 24.12.9725 + License Key Till 2050
Avast Premium Security 24.12.9725 + License Key Till 2050Avast Premium Security 24.12.9725 + License Key Till 2050
Avast Premium Security 24.12.9725 + License Key Till 2050
asfadnew
 
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
hyby22543
 
FastStone Capture 10.4 Crack + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]FastStone Capture 10.4 Crack + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]
hyby22543
 
EASEUS Partition Master 18.8 Crack + License Code [2025]
EASEUS Partition Master 18.8 Crack + License Code [2025]EASEUS Partition Master 18.8 Crack + License Code [2025]
EASEUS Partition Master 18.8 Crack + License Code [2025]
drewgye
 
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
MiniTool Partition Wizard Crack 12.8 + Serial Key DownloadMiniTool Partition Wizard Crack 12.8 + Serial Key Download
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
drewgye
 
4K Video Downloader Crack (2025) + License Key Free
4K Video Downloader Crack (2025) + License Key Free4K Video Downloader Crack (2025) + License Key Free
4K Video Downloader Crack (2025) + License Key Free
boyjake527
 
Capcut Pro Crack For PC Latest 2025 Full
Capcut Pro Crack For PC Latest 2025 FullCapcut Pro Crack For PC Latest 2025 Full
Capcut Pro Crack For PC Latest 2025 Full
mushtaqcheema932
 
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
mushtaqcheema932
 
minitool partition wizard crack 12.8 latest
minitool partition wizard crack 12.8 latestminitool partition wizard crack 12.8 latest
minitool partition wizard crack 12.8 latest
qaha7432
 
Ad

Recently uploaded (20)

Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native BarcelonaOpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps CyclesFrom Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
Marjukka Niinioja
 
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The SequelMarketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
BradBedford3
 
Code and No-Code Journeys: The Coverage Overlook
Code and No-Code Journeys: The Coverage OverlookCode and No-Code Journeys: The Coverage Overlook
Code and No-Code Journeys: The Coverage Overlook
Applitools
 
Software Testing & it’s types (DevOps)
Software  Testing & it’s  types (DevOps)Software  Testing & it’s  types (DevOps)
Software Testing & it’s types (DevOps)
S Pranav (Deepu)
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdfTop 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Trackobit
 
Generative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its ApplicationsGenerative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its Applications
SandeepKS52
 
AI and Deep Learning with NVIDIA Technologies
AI and Deep Learning with NVIDIA TechnologiesAI and Deep Learning with NVIDIA Technologies
AI and Deep Learning with NVIDIA Technologies
SandeepKS52
 
Best Inbound Call Tracking Software for Small Businesses
Best Inbound Call Tracking Software for Small BusinessesBest Inbound Call Tracking Software for Small Businesses
Best Inbound Call Tracking Software for Small Businesses
TheTelephony
 
Providing Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better DataProviding Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better Data
Safe Software
 
Porting Qt 5 QML Modules to Qt 6 Webinar
Porting Qt 5 QML Modules to Qt 6 WebinarPorting Qt 5 QML Modules to Qt 6 Webinar
Porting Qt 5 QML Modules to Qt 6 Webinar
ICS
 
14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework
Angelo Theodorou
 
dp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdfdp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdf
pravkumarbiz
 
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink TemplateeeeeeeeeeeeeeeeeeeeeeeeeeNeuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
alexandernoetzold
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
Insurance Tech Services
 
Bonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdfBonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdf
Herond Labs
 
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
BradBedford3
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native BarcelonaOpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps CyclesFrom Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
Marjukka Niinioja
 
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The SequelMarketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
BradBedford3
 
Code and No-Code Journeys: The Coverage Overlook
Code and No-Code Journeys: The Coverage OverlookCode and No-Code Journeys: The Coverage Overlook
Code and No-Code Journeys: The Coverage Overlook
Applitools
 
Software Testing & it’s types (DevOps)
Software  Testing & it’s  types (DevOps)Software  Testing & it’s  types (DevOps)
Software Testing & it’s types (DevOps)
S Pranav (Deepu)
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdfTop 11 Fleet Management Software Providers in 2025 (2).pdf
Top 11 Fleet Management Software Providers in 2025 (2).pdf
Trackobit
 
Generative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its ApplicationsGenerative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its Applications
SandeepKS52
 
AI and Deep Learning with NVIDIA Technologies
AI and Deep Learning with NVIDIA TechnologiesAI and Deep Learning with NVIDIA Technologies
AI and Deep Learning with NVIDIA Technologies
SandeepKS52
 
Best Inbound Call Tracking Software for Small Businesses
Best Inbound Call Tracking Software for Small BusinessesBest Inbound Call Tracking Software for Small Businesses
Best Inbound Call Tracking Software for Small Businesses
TheTelephony
 
Providing Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better DataProviding Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better Data
Safe Software
 
Porting Qt 5 QML Modules to Qt 6 Webinar
Porting Qt 5 QML Modules to Qt 6 WebinarPorting Qt 5 QML Modules to Qt 6 Webinar
Porting Qt 5 QML Modules to Qt 6 Webinar
ICS
 
14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework14 Years of Developing nCine - An Open Source 2D Game Framework
14 Years of Developing nCine - An Open Source 2D Game Framework
Angelo Theodorou
 
dp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdfdp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdf
pravkumarbiz
 
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink TemplateeeeeeeeeeeeeeeeeeeeeeeeeeNeuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
alexandernoetzold
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
Insurance Tech Services
 
Bonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdfBonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdf
Herond Labs
 
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
Maintaining + Optimizing Database Health: Vendors, Orchestrations, Enrichment...
BradBedford3
 
Ad

How YugaByte DB Implements Distributed PostgreSQL

  • 1. 1© 2018 All rights reserved. Distributed PostgreSQL with YugaByte DB Karthik Ranganathan PostgresConf Silicon Valley Oct 16, 2018
  • 2. 2© 2018 All rights reserved. CHECKOUT THIS REPO: github.com/YugaByte/yb-sql-workshop
  • 3. 3© 2018 All rights reserved. About Us Kannan Muthukkaruppan, CEO Nutanix ♦ Facebook ♦ Oracle IIT-Madras, University of California-Berkeley Karthik Ranganathan, CTO Nutanix ♦ Facebook ♦ Microsoft IIT-Madras, University of Texas-Austin Mikhail Bautin, Software Architect ClearStory Data ♦ Facebook ♦ D.E.Shaw Nizhny Novgorod State University, Stony Brook  Founded Feb 2016  Apache HBase committers and early engineers on Apache Cassandra  Built Facebook’s NoSQL platform powered by Apache HBase  Scaled the platform to serve many mission-critical use cases • Facebook Messages (Messenger) • Operational Data Store (Time series Data)  Reassembled the same Facebook team at YugaByte along with engineers from Oracle, Google, Nutanix and LinkedIn Founders
  • 4. 4© 2018 All rights reserved. WORKSHOP AGENDA • What is YugaByte DB? Why Another DB? • Exercise 1: BI Tools on YugaByte PostgreSQL • Exercise 2: Distributed PostgreSQL Architecture • Exercise 3: Sharding and Scale Out in Action • Exercise 4: Fault Tolerance in Action
  • 5. 5© 2018 All rights reserved. WHAT IS YUGABYTE DB?
  • 6. 6© 2018 All rights reserved. A transactional, planet-scale database for building high-performance cloud services.
  • 7. 7© 2018 All rights reserved. NoSQL + SQL Cloud Native
  • 8. 8© 2018 All rights reserved. WHY ANOTHER DB?
  • 9. 9© 2018 All rights reserved. Typical Stack Today Fragile infra with several moving parts Datacenter 1 SQL Master SQL Slave Application Tier (Stateless Microservices) Datacenter 2 SQL for OLTP data Manual sharding Cost: dev team Manual replication Manual failover Cost: ops team NoSQL for other data App aware of data silo Cost: dev team Cache for low latency App does caching Cost: dev team Data inconsistency/loss Fragile infra Hours of debugging Cost: dev + ops team
  • 10. 10© 2018 All rights reserved. Does AWS change this? Datacenter 1 SQL Master SQL Slave Datacenter 2 Elasticache Aurora DynamoDB Still Complex it’s the same architecture Application Tier (Stateless Microservices)
  • 11. 11© 2018 All rights reserved. Not Portable Not Portable Open Source Not Portable Open Source Open Source High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale System-of-Record DBs for Global Apps
  • 12. 12© 2018 All rights reserved. TRANSACTIONAL PLANET-SCALEHIGH PERFORMANCE Single Shard & Distributed ACID Txns Document-Based, Strongly Consistent Storage Low Latency, Tunable Reads High Throughput OPEN SOURCE Apache 2.0 Popular APIs Extended Apache Cassandra, Redis and PostgreSQL (BETA) Auto Sharding & Rebalancing Global Data Distribution Design Principles CLOUD NATIVE Built For The Container Era Self-Healing, Fault-Tolerant
  • 13. 13© 2018 All rights reserved. EXERCISE #1 BUSINESS INTELLIGENCE
  • 14. 14© 2018 All rights reserved. EXERCISE #2 DISTRIBUTED POSTGRES: ARCHITECTURE
  • 15. 15© 2018 All rights reserved. ARCHITECTURE Overview
  • 16. 16© 2018 All rights reserved. YugaByte DB Process Overview • Universe = cluster of nodes • Two sets of processes: YB-Master & YB-TServer • Example universe 4 nodes rf=3
  • 17. 17© 2018 All rights reserved. Sharding data • User table split into tablets
  • 18. 18© 2018 All rights reserved. One tablet for every key
  • 19. 19© 2018 All rights reserved. Tablets and replication • Tablet = set of tablet-peers in a RAFT group • Num tablet-peers in tablet = replication factor (RF) Tolerate 1 failure : RF=3 Tolerate 2 failures: RF=5
  • 20. 20© 2018 All rights reserved. YB-TServer • Process that does IO • Hosts tablet for tables • Hosts transaction manager • Auto memory sizing Block cache Memstores
  • 21. 21© 2018 All rights reserved. YB-Master • Not in critical path • System metadata store Keyspaces, tables, tablets Users/roles, permissions • Admin operations Create/alter/drop of tables Backups Load balancing (leader and data balancing) Enforces data placement policy
  • 22. 22© 2018 All rights reserved. HANDLING DDL STATEMENTS
  • 23. 23© 2018 All rights reserved. DDL Statements in PostgreSQL DDL Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor DISK Create Table Data File Update System Tables
  • 24. 24© 2018 All rights reserved. DDL Statements in YugaByte DB PostgreSQL DDL Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor Create sharded, replicated table as data source Store Table Metadata in YB-Master (in works) YugaByte master3 … YugaByte master2 YugaByte master1
  • 25. 25© 2018 All rights reserved. YugaByte Query Layer (YQL) • Stateless, runs in each YB-TServer process GA Goal: Distributed Stateless PostgreSQL Layer Current Beta uses a single Stateless PostgreSQL Layer
  • 26. 26© 2018 All rights reserved. HANDLING DML QUERIES
  • 27. 27© 2018 All rights reserved. DDL Queries in PostgreSQL QUERY Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor WAL Writer BG Writer… DISK FDW Local Table Code Path EXTERNAL DATABASE
  • 28. 28© 2018 All rights reserved. DML Queries in YugaByte DB PostgreSQL DML Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor FDW YugaByte DB Code Path YB Gateway EXTERNAL DATABASE YugaByte node3 YugaByte node4 … YugaByte node2 YugaByte node1 Using FDW as a Table Storage API
  • 29. 29© 2018 All rights reserved. ARCHITECTURE Data Persistence
  • 30. 30© 2018 All rights reserved. Data Persistence in DocDB • DocDB is YugaByte DB’s LSM storage engine • Persistent key to document store • Extends and enhances RocksDB • Designed to support high data-densities per node
  • 31. 31© 2018 All rights reserved. DocDB: Key-to-Document Store • Document key CQL/SQL/Redis primary key • Document value a CQL or SQL row Redis data structure • Fine-grained reads and writes
  • 32. 32© 2018 All rights reserved. DocDB Data Format Example Insert Encoding
  • 33. 33© 2018 All rights reserved. Some of the RocksDB enhancements • WAL and MVCC enhancements o Removed RocksDB WAL, re-uses Raft log o MVCC at a higher layer o Coordinate RocksDB memstore flushing and Raft log garbage collection • File format changes o Sharded (multi-level) indexes and Bloom filters • Splitting data blocks & metadata into separate files for tiering support • Separate queues for large and small compactions
  • 34. 34© 2018 All rights reserved. More Enhancements to RocksDB • Data model aware Bloom filters • Per-SSTable key range metadata to optimize range queries • Server-global block caches & memstore limits • Scan-resistant block cache (single-touch and multi-touch)
  • 35. 35© 2018 All rights reserved. ARCHITECTURE Data Replication
  • 36. 36© 2018 All rights reserved. Raft Replication for Consistency
  • 37. 37© 2018 All rights reserved. How Raft Replication Works
  • 38. 38© 2018 All rights reserved. How Raft Replication Works
  • 39. 39© 2018 All rights reserved. How Raft Replication Works
  • 40. 40© 2018 All rights reserved. How Raft Replication Works
  • 41. 41© 2018 All rights reserved. Raft Related Enhancements • Leader Leases • Multiple Raft groups (1 per tablet) • Leader Balancing • Group Commits • Observer Nodes / Read Replicas
  • 42. 42© 2018 All rights reserved. ARCHITECTURE Transactions
  • 43. 43© 2018 All rights reserved. Single Shard Transactions Raft Consensus Protocol . . . INSERT INTO table (k, v) VALUES (‘k1’, ‘v1’) Lock Manager (in memory, on leader only) Acquire a lock on x DocDB / RocksDB Read current value of x Submit a Raft operation for replication: Insert (k1, v1) at hybrid_time 100 Raft log Tablet follower Tablet follower Replicate to majority of tablet peers Apply to RocksDB and release lock k1,v1 @ht=100 1 2 5 3 4
  • 44. 44© 2018 All rights reserved. MVCC for Lockless Reads • Achieved through HybridTime (HT) Monotonically increasing timestamp • Allows reads at a particular HT without locking • Multiple versions may exist temporarily Reclaim older values during compactions
  • 45. 45© 2018 All rights reserved. Single Shard Transactions • Each tablet maintains a “safe time” for reads o Highest timestamp such that the view as of that timestamp is fixed o In the common case it is just before the hybrid time of the next uncommitted record in the tablet
  • 46. 46© 2018 All rights reserved. Distributed Transactions • Fully decentralized architecture • Every tablet server can act as a Transaction Manager • A distributed Transaction Status table Tracks state of active transactions • Transactions can have 3 states: pending, committed, aborted
  • 47. 47© 2018 All rights reserved. Distributed Transactions – Write Path
  • 48. 48© 2018 All rights reserved. Distributed Transactions – Write Path Step 1: Client request
  • 49. 49© 2018 All rights reserved. Distributed Transactions – Write Path Step 2: Create status record
  • 50. 50© 2018 All rights reserved. Distributed Transactions – Write Path Step 2: Create status record
  • 51. 51© 2018 All rights reserved. Distributed Transactions – Write Path Step 3: Write provisional records
  • 52. 52© 2018 All rights reserved. Distributed Transactions – Write Path Step 4: Atomic commit
  • 53. 53© 2018 All rights reserved. Distributed Transactions – Write Path Step 5: Respond to client
  • 54. 54© 2018 All rights reserved. Distributed Transactions – Write Path Step 6: Apply provisional records
  • 55. 55© 2018 All rights reserved. Isolation Levels • Currently Snapshot Isolation is supported o Write-write conflicts detected when writing provisional records • Serializable isolation (roadmap) o Reads in RW txns also need provisional records • Read-only transactions are always lock-free
  • 56. 56© 2018 All rights reserved. Clock Skew and Read Restarts • Need to ensure the read timestamp is high enough o Committed records the client might have seen must be visible • Optimistically use current Hybrid Time, re-read if necessary o Reads are restarted if a record with a higher timestamp that the client could have seen is encountered o Read restart happens at most once per tablet o Relying on bounded clock skew (NTP, AWS Time Sync) • Only affects multi-row reads of frequently updated records
  • 57. 57© 2018 All rights reserved. Distributed Transactions – Read Path
  • 58. 58© 2018 All rights reserved. Distributed Transactions – Read Path Step 1: Client request; pick ht_read
  • 59. 59© 2018 All rights reserved. Distributed Transactions – Read Path Step 2: Read from tablet servers
  • 60. 60© 2018 All rights reserved. Distributed Transactions – Read Path Step 3: Resolve txn status
  • 61. 61© 2018 All rights reserved. Distributed Transactions – Read Path Step 4: Respond to YQL Engine
  • 62. 62© 2018 All rights reserved. Distributed Transactions – Read Path Step 5: Respond to client
  • 63. 63© 2018 All rights reserved. Distributed Transactions – Conflicts & Retries • Every transaction is assigned a random priority • In a conflict, the higher-priority transaction wins o The restarted transaction gets a new random priority o Probability of success quickly increases with retries • Restarting a transaction is the same as starting a new one • A read-write transaction can be subject to read-restart
  • 64. 64© 2018 All rights reserved. EXERCISE #3 and #4 SHARDING AND SCALE OUT FAULT TOLERANCE
  • 65. 65© 2018 All rights reserved. Questions? Try it at docs.yugabyte.com/latest/quick-start