SlideShare a Scribd company logo
MyRocks deployment at Facebook
and Roadmaps
Yoshinori Matsunobu
Production Engineer, Facebook
Feb/2018
Agenda
▪ MySQL at Facebook
▪ MyRocks overview
▪ Production Deployment
▪ MyRocks configuration and monitoring
▪ Future Plans
MySQL “User Database (UDB)” at
Facebook▪ Storing Social Graph
▪ Massively Sharded
▪ Low latency
▪ Automated Operations
▪ Pure Flash Storage (Constrained by space, not by CPU/IOPS)
What is MyRocks
▪ MySQL on top of RocksDB (RocksDB storage engine)
▪ Open Source, distributed from MariaDB and Percona as
well
MySQL Clients
InnoDB RocksDB
Parser
Optimizer
Replication
etc
SQL/Connector
MySQL
http://myrocks.io/
MyRocks Initial Goal at Facebook
InnoDB in main database
90%
SpaceIOCPU
Machine limit
15%20%
MyRocks in main database
45%
SpaceIOCPU
Machine limit
15%21%
21%
15%
45%
MyRocks features
▪ Clustered Index (same as InnoDB)
▪ Bloom Filter and Column Family
▪ Transactions, including consistency between binlog and RocksDB
▪ Faster data loading, deletes and replication
▪ Dynamic Options
▪ TTL
▪ Online logical and binary backup
MyRocks vs InnoDB
▪ MyRocks pros
▪ Much smaller space (half compared to compressed InnoDB)
▪ Gives better cache hit rate
▪ Writes are faster = Faster Replication
▪ Much smaller bytes written
▪ MyRocks cons (improvements in progress)
▪ Lack of several features
▪ No SBR, Gap Lock, Foreign Key, Fulltext Index, Spatial Index support. Need to use case sensitive collation for
perf
▪ Reads are slower, especially if your data fits in memory
▪ More dependent on filesystem and OS. Lack of solid direct i/o. Must use newer 4.6 kernel
MyRocks migration -- Technical Challenges
▪ Initial Migration
▪ Creating MyRocks instances without downtime
▪ Loading into MyRocks tables within reasonable time
▪ Verifying data consistency between InnoDB and MyRocks
▪ Continuous Monitoring
▪ Resource Usage like space, iops, cpu and memory
▪ Query plan outliers
▪ Stalls and crashes
MyRocks migration -- Technical Challenges
(2)▪ When running MyRocks on master
▪ RBR (Row based binary logging)
▪ Removing queries relying on InnoDB Gap Lock
Creating first MyRocks instance without downtime
▪ Picking one of the InnoDB slave instances, then starting logical dump
and restore
▪ Stopping one slave does not affect services
Master (InnoDB)
Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks)
Stop & Dump & Load
Faster Data Loading
Normal Write Path in MyRocks/RocksDB
Write Requests
MemTableWAL
Level 0 SST
Level 1 SST
Level max SST
….
Flush
Compaction
Compaction
Faster Write Path
Write Requests
Level max SST
“SET SESSION rocksdb_bulk_load=1;”
Original data must be sorted by primary key
Data migration steps
▪ Dst) Create table … ENGINE=ROCKSDB; (creating MyRocks tables with proper column families)
▪ Dst) ALTER TABLE DROP INDEX; (dropping secondary keys)
▪ Src) STOP SLAVE;
▪ mysqldump –host=innodb-host --order-by-primary --rocksdb-bulk-load | mysql –host=myrocks-
host –init-command=“set sql_log_bin=0”
▪ Dst) ALTER TABLE ADD INDEX; (adding secondary keys)
▪ Src, Dst) START SLAVE;
Data Verification
▪ MyRocks/RocksDB is relatively new database technology
▪ Might have more bugs than robust InnoDB
▪ Ensuring data consistency helps avoid showing conflicting
results
Verification tests
▪ Index count check between primary key and secondary keys
▪ If any index is broken, it can be detected
▪ SELECT ‘PRIMARY’, COUNT(*) FROM t FORCE INDEX (PRIMARY)
UNION SELECT ‘idx1’, COUNT(*) FROM t FORCE INDEX (idx1)
▪ Can’t be used if there is no secondary key
▪ Index stats check
▪ Checking if “rows” show SHOW TABLE STATUS is not far different from actual row count
▪ Checksum tests w/ InnoDB
▪ Comparing between InnoDB instance and MyRocks instance
▪ Creating a transaction consistent snapshot at the same GTID position, scan, then compare
checksum
▪ Shadow correctness check
▪ Capturing read traffics
Shadow traffic tests
▪ We have a shadow test framework
▪ MySQL audit plugin to capture read/write queries from production instances
▪ Replaying them into shadow master instances
▪ Shadow master tests
▪ Client errors
▪ Rewriting queries relying on Gap Lock
▪ gap_lock_raise_error=1, gap_lock_write_log=1
Creating second MyRocks instance without downtime
Master (InnoDB)
Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (MyRocks) Slave4 (MyRocks)
myrocks_hotbackup
(Online binary backup)
Promoting MyRocks as a master
Master (MyRocks)
Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks)
Promoting MyRocks as a master
Master (MyRocks)
Slave1 (MyRocks) Slave2 (MyRocks) Slave3 (MyRocks) Slave4 (MyRocks)
Our MyRocks configurations
▪ Using the latest 4.6 Linux kernel that fixed many filesystem (XFS)
and memory allocation issues
▪ Building MySQL/MyRocks with jemalloc
▪ Using 16KB block size
▪ Five major column families, depending on data characteristics
▪ Using ZSTD compression in the bottommost level, LZ4 for the rest
▪ Not creating bloom filter in the bottommost level, to save space
and memory
Monitoring
▪ Column Family Statistics
▪ Lock contentions
▪ Bloom filter hit rates
▪ Tombstones (delete-marker)
SHOW ENGINE ROCKSDB STATUS
▪ Column Family Statistics, including size, read and write amp per
level
▪ Memory usage
*************************** 7. row ***************************
Type: CF_COMPACTION
Name: default
Status:
** Compaction Stats [default] **
Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 2/0 51.58 0.5 0.0 0.0 0.0 0.3 0.3 0.0 0.0 0.0 40.3 7 10 0.669 0 0
L3 6/0 109.36 0.9 0.7 0.7 0.0 0.6 0.6 0.0 0.9 43.8 40.7 16 3 5.172 7494K 297K
L4 61/0 1247.31 1.0 2.0 0.3 1.7 2.0 0.2 0.0 6.9 49.7 48.5 41 9 4.593 15M 176K
L5 989/0 12592.86 1.0 2.0 0.3 1.8 1.9 0.1 0.0 7.4 8.1 7.4 258 8 32.209 17M 726K
L6 4271/0 127363.51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
Sum 5329/0 141364.62 0.0 4.7 1.2 3.5 4.7 1.2 0.0 17.9 15.0 15.0 321 30 10.707 41M 1200K
Monitoring Lock Contentions
▪ “Snapshot Conflict” errors with Repeatable Read isolation
▪ Because of implementation differences in MyRocks (PostgreSQL Style
snapshot isolation)
▪ We use tx-isolation=READ-COMMITTED to solve the issue
▪ MyRocks supports basic lock contention metrics
▪ Queries hit deadlock errors
▪ The number of deadlocks and lock wait timeouts
▪ The number of lock waits and lock wait time
SHOW GLOBAL STATUS
mysql> show global status like 'rocksdb%';
+---------------------------------------+-------------+
| Variable_name | Value |
+---------------------------------------+-------------+
| rocksdb_rows_deleted | 216223 |
| rocksdb_rows_inserted | 1318158 |
| rocksdb_rows_read | 7102838 |
| rocksdb_rows_updated | 1997116 |
....
| rocksdb_bloom_filter_prefix_checked | 773124 |
| rocksdb_bloom_filter_prefix_useful | 308445 |
| rocksdb_bloom_filter_useful | 10108448 |
....
Lessons Learned
▪ Linux kernel VM allocation stalls
▪ We use buffered i/o for MyRocks/RocksDB
▪ Linux fixed vm allocation stalls in newer 4.6
▪ Don’t delete files too fast on Flash
▪ Avoid multi-second TRIM stalls
▪ We stopped mounting with discard option for binlog and WAL
▪ Hard to generate core files
▪ MyRocks + jemalloc uses too much VIRT size
Our current production status
We COMPLETED InnoDB to MyRocks migration
in UDB
We saved 50% space in UDB
compared to compressed InnoDB
We started working on migrating
other large database tiers
Development Roadmaps
▪ Helping MariaDB and Percona Server to release with stable MyRocks
▪ Matching read performance vs InnoDB
▪ https://smalldatum.blogspot.com
▪ Supporting Mixed Engines
▪ Better Replication
▪ Supporting Bigger Instance Size
Mixed Engines
▪ Currently our production use case is either “MyRocks only” or “InnoDB
only” instance
▪ There are several internal/external use cases that want to use InnoDB
and MyRocks within the same instance, though single transaction does
not overlap engines
▪ Online logical/binary Backup support and benchmarks are concerns
▪ Current plan is extending xtrabackup to integrate myrocks_hotbackup
▪ Considering to backporting gtid_pos_auto_engines from MariaDB
Better Replication
▪ Removing engine log
▪ Both internal and external benchmarks (e.g. Amazon, Alibaba) show that qps
improves significantly with binlog disabled
▪ Real Problem would be two logs – binlog and engine log, which requires 2pc
and ordered commits
▪ One Log - use one log as the source of truth for commits -- either binlog,
binlog-like service or RocksDB WAL
▪ We heavily rely on binlogs (for semisync, binlog consumers), TBD is how
much perf we gain by stopping writing to WAL
▪ Parallel replication apply
▪ Batching
▪ Skipping using transactions on slaves
Supporting Bigger Instance Size
▪ Problem Statement: Shared Nothing database is not general purpose database
▪ MySQL Cluster, Spider, Vitess
▪ Good if you have specific purposes. Might have issues if people lack of expertise about
atomic transactions, joins and secondary keys
▪ Suggestion: Now we have 256GB+ RAM and 10TB+ Flash on commodity
servers. Why not run one big instance and put everything there?
▪ Bigger instances may help general purpose small-mid applications
▪ They don’t have to worry about sharding. Atomic trans, joins and secondary keys just work
▪ e.g. Amazon Aurora (supporting up to 60TB instance) and Alibaba PolarDB (~100TB
instance)
Future Plans to support Bigger Instance (1)
▪ Parallel transactional mysqldump
▪ Parallel Query
▪ e.g. how to make mysqldump finish within 24 hours from 20TB table?
▪ Parallel binary copy
▪ e.g. how quickly can we create a 60TB replica instance in a remote region?
▪ Parallel DDL, Parallel Loading
▪ Resumable DDL
▪ e.g. if the DDL is expected to take 10 days, what will happen if mysqld restarts
after 8 days?
Future Plans to support Bigger Instance (2)
▪ Better join algorithm
▪ Much faster replication
▪ Can handle 10x connection requests and queries
▪ Good resource control
▪ H/W perspective: Shared Storage and Elastic Computing Units
▪ Can scale read replicas from the same shared storage
Summary
▪ We finished deploying MyRocks in our production user
database (UDB)
▪ You can start deploying slaves, with consistency check
▪ We have added many status counters for instance monitoring
▪ More interesting features will come this year
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

PPTX
MySQLの運用でありがちなこと
PDF
PostgreSQLレプリケーション10周年!徹底紹介!(PostgreSQL Conference Japan 2019講演資料)
PDF
Azure Kubernetes Service 2019 ふりかえり
PDF
Amazon Aurora Deep Dive (db tech showcase 2016)
PDF
kubernetes(GKE)環境におけるdatadog利用
PDF
NoSQL Database- cassandra column Base DB
PPTX
「おうちクラウド」が今熱い!
PPT
Explain that explain
MySQLの運用でありがちなこと
PostgreSQLレプリケーション10周年!徹底紹介!(PostgreSQL Conference Japan 2019講演資料)
Azure Kubernetes Service 2019 ふりかえり
Amazon Aurora Deep Dive (db tech showcase 2016)
kubernetes(GKE)環境におけるdatadog利用
NoSQL Database- cassandra column Base DB
「おうちクラウド」が今熱い!
Explain that explain

What's hot (20)

PPT
How to read linux kernel
PDF
データ分析基盤、どう作る?システム設計のポイント、教えます - Developers.IO 2019 (20191101)
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
Windowsコンテナ入門
PDF
AWS Lambda@Edge でできること!
PDF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
PDF
[db tech showcase Tokyo 2014] B26: PostgreSQLを拡張してみよう by SRA OSS, Inc. 日本支社 高塚遥
PDF
New Ways to Find Latency in Linux Using Tracing
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
PDF
JAWS-UG 情シス支部の皆様向け Amazon Elastic File System (Amazon EFS)
PPTX
Azure仮想マシンと仮想ネットワーク
PPSX
RAC - The Savior of DBA
PDF
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
PDF
Azure Network 概要
PDF
ETL With Cassandra Streaming Bulk Loading
PDF
【18-E-3】クラウド・ネイティブ時代の2016年だから始める Docker 基礎講座
PPTX
Disaster Recovery Planning for MySQL & MariaDB
PDF
Kubernetes雑にまとめてみた 2020年8月版
PDF
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
PDF
Database automation guide - Oracle Community Tour LATAM 2023
How to read linux kernel
データ分析基盤、どう作る?システム設計のポイント、教えます - Developers.IO 2019 (20191101)
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Windowsコンテナ入門
AWS Lambda@Edge でできること!
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
[db tech showcase Tokyo 2014] B26: PostgreSQLを拡張してみよう by SRA OSS, Inc. 日本支社 高塚遥
New Ways to Find Latency in Linux Using Tracing
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
JAWS-UG 情シス支部の皆様向け Amazon Elastic File System (Amazon EFS)
Azure仮想マシンと仮想ネットワーク
RAC - The Savior of DBA
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Azure Network 概要
ETL With Cassandra Streaming Bulk Loading
【18-E-3】クラウド・ネイティブ時代の2016年だから始める Docker 基礎講座
Disaster Recovery Planning for MySQL & MariaDB
Kubernetes雑にまとめてみた 2020年8月版
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
Database automation guide - Oracle Community Tour LATAM 2023
Ad

Similar to M|18 How Facebook Migrated to MyRocks (20)

PDF
RocksDB Performance and Reliability Practices
PDF
MyRocks introduction and production deployment
PPTX
Migrating from InnoDB and HBase to MyRocks at Facebook
PDF
MySQL highav Availability
PPTX
When is MyRocks good?
PDF
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
PDF
Demystifying MySQL Replication Crash Safety
PDF
MyRocks Deep Dive
PDF
MySQL Scalability and Reliability for Replicated Environment
PDF
M|18 How to use MyRocks with MariaDB Server
PDF
Top-10-Features-In-MySQL-8.0 - Vinoth Kanna RS - Mydbops Team
PDF
Ukoug 2011 mysql_arch_for_orcl_dba
PDF
MySQL Replication Basics -Ohio Linux Fest 2016
PDF
MyRocks in MariaDB | M18
PDF
NoSQL with MySQL
PDF
iloug2015.Mysql.for.oracle.dba.V2
ODP
Vote NO for MySQL
PDF
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
PDF
MySQL Replication Update -- Zendcon 2016
PPTX
User Camp High Availability Presentation
RocksDB Performance and Reliability Practices
MyRocks introduction and production deployment
Migrating from InnoDB and HBase to MyRocks at Facebook
MySQL highav Availability
When is MyRocks good?
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
Demystifying MySQL Replication Crash Safety
MyRocks Deep Dive
MySQL Scalability and Reliability for Replicated Environment
M|18 How to use MyRocks with MariaDB Server
Top-10-Features-In-MySQL-8.0 - Vinoth Kanna RS - Mydbops Team
Ukoug 2011 mysql_arch_for_orcl_dba
MySQL Replication Basics -Ohio Linux Fest 2016
MyRocks in MariaDB | M18
NoSQL with MySQL
iloug2015.Mysql.for.oracle.dba.V2
Vote NO for MySQL
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
MySQL Replication Update -- Zendcon 2016
User Camp High Availability Presentation
Ad

More from MariaDB plc (20)

PDF
MariaDB Berlin Roadshow Slides - 8 April 2025
PDF
MariaDB München Roadshow - 24 September, 2024
PDF
MariaDB Paris Roadshow - 19 September 2024
PDF
MariaDB Amsterdam Roadshow: 19 September, 2024
PDF
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
PDF
MariaDB Paris Workshop 2023 - Newpharma
PDF
MariaDB Paris Workshop 2023 - Cloud
PDF
MariaDB Paris Workshop 2023 - MariaDB Enterprise
PDF
MariaDB Paris Workshop 2023 - Performance Optimization
PDF
MariaDB Paris Workshop 2023 - MaxScale
PDF
MariaDB Paris Workshop 2023 - novadys presentation
PDF
MariaDB Paris Workshop 2023 - DARVA presentation
PDF
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
PDF
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
PDF
Einführung : MariaDB Tech und Business Update Hamburg 2023
PDF
Hochverfügbarkeitslösungen mit MariaDB
PDF
Die Neuheiten in MariaDB Enterprise Server
PDF
Global Data Replication with Galera for Ansell Guardian®
PDF
Introducing workload analysis
PDF
Under the hood: SkySQL monitoring
MariaDB Berlin Roadshow Slides - 8 April 2025
MariaDB München Roadshow - 24 September, 2024
MariaDB Paris Roadshow - 19 September 2024
MariaDB Amsterdam Roadshow: 19 September, 2024
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
Einführung : MariaDB Tech und Business Update Hamburg 2023
Hochverfügbarkeitslösungen mit MariaDB
Die Neuheiten in MariaDB Enterprise Server
Global Data Replication with Galera for Ansell Guardian®
Introducing workload analysis
Under the hood: SkySQL monitoring

Recently uploaded (20)

PPTX
modul_python (1).pptx for professional and student
PDF
Transcultural that can help you someday.
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Business Analytics and business intelligence.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
[EN] Industrial Machine Downtime Prediction
PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Managing Community Partner Relationships
PPTX
Leprosy and NLEP programme community medicine
PPT
Predictive modeling basics in data cleaning process
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Introduction to Data Science and Data Analysis
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
modul_python (1).pptx for professional and student
Transcultural that can help you someday.
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Analytics and business intelligence.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
[EN] Industrial Machine Downtime Prediction
DATA COLLECTION METHODS-ppt for nursing research
Optimise Shopper Experiences with a Strong Data Estate.pdf
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Database Infoormation System (DBIS).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Managing Community Partner Relationships
Leprosy and NLEP programme community medicine
Predictive modeling basics in data cleaning process
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Data Science and Data Analysis
Pilar Kemerdekaan dan Identi Bangsa.pptx
Data_Analytics_and_PowerBI_Presentation.pptx

M|18 How Facebook Migrated to MyRocks

  • 1. MyRocks deployment at Facebook and Roadmaps Yoshinori Matsunobu Production Engineer, Facebook Feb/2018
  • 2. Agenda ▪ MySQL at Facebook ▪ MyRocks overview ▪ Production Deployment ▪ MyRocks configuration and monitoring ▪ Future Plans
  • 3. MySQL “User Database (UDB)” at Facebook▪ Storing Social Graph ▪ Massively Sharded ▪ Low latency ▪ Automated Operations ▪ Pure Flash Storage (Constrained by space, not by CPU/IOPS)
  • 4. What is MyRocks ▪ MySQL on top of RocksDB (RocksDB storage engine) ▪ Open Source, distributed from MariaDB and Percona as well MySQL Clients InnoDB RocksDB Parser Optimizer Replication etc SQL/Connector MySQL http://myrocks.io/
  • 5. MyRocks Initial Goal at Facebook InnoDB in main database 90% SpaceIOCPU Machine limit 15%20% MyRocks in main database 45% SpaceIOCPU Machine limit 15%21% 21% 15% 45%
  • 6. MyRocks features ▪ Clustered Index (same as InnoDB) ▪ Bloom Filter and Column Family ▪ Transactions, including consistency between binlog and RocksDB ▪ Faster data loading, deletes and replication ▪ Dynamic Options ▪ TTL ▪ Online logical and binary backup
  • 7. MyRocks vs InnoDB ▪ MyRocks pros ▪ Much smaller space (half compared to compressed InnoDB) ▪ Gives better cache hit rate ▪ Writes are faster = Faster Replication ▪ Much smaller bytes written ▪ MyRocks cons (improvements in progress) ▪ Lack of several features ▪ No SBR, Gap Lock, Foreign Key, Fulltext Index, Spatial Index support. Need to use case sensitive collation for perf ▪ Reads are slower, especially if your data fits in memory ▪ More dependent on filesystem and OS. Lack of solid direct i/o. Must use newer 4.6 kernel
  • 8. MyRocks migration -- Technical Challenges ▪ Initial Migration ▪ Creating MyRocks instances without downtime ▪ Loading into MyRocks tables within reasonable time ▪ Verifying data consistency between InnoDB and MyRocks ▪ Continuous Monitoring ▪ Resource Usage like space, iops, cpu and memory ▪ Query plan outliers ▪ Stalls and crashes
  • 9. MyRocks migration -- Technical Challenges (2)▪ When running MyRocks on master ▪ RBR (Row based binary logging) ▪ Removing queries relying on InnoDB Gap Lock
  • 10. Creating first MyRocks instance without downtime ▪ Picking one of the InnoDB slave instances, then starting logical dump and restore ▪ Stopping one slave does not affect services Master (InnoDB) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks) Stop & Dump & Load
  • 11. Faster Data Loading Normal Write Path in MyRocks/RocksDB Write Requests MemTableWAL Level 0 SST Level 1 SST Level max SST …. Flush Compaction Compaction Faster Write Path Write Requests Level max SST “SET SESSION rocksdb_bulk_load=1;” Original data must be sorted by primary key
  • 12. Data migration steps ▪ Dst) Create table … ENGINE=ROCKSDB; (creating MyRocks tables with proper column families) ▪ Dst) ALTER TABLE DROP INDEX; (dropping secondary keys) ▪ Src) STOP SLAVE; ▪ mysqldump –host=innodb-host --order-by-primary --rocksdb-bulk-load | mysql –host=myrocks- host –init-command=“set sql_log_bin=0” ▪ Dst) ALTER TABLE ADD INDEX; (adding secondary keys) ▪ Src, Dst) START SLAVE;
  • 13. Data Verification ▪ MyRocks/RocksDB is relatively new database technology ▪ Might have more bugs than robust InnoDB ▪ Ensuring data consistency helps avoid showing conflicting results
  • 14. Verification tests ▪ Index count check between primary key and secondary keys ▪ If any index is broken, it can be detected ▪ SELECT ‘PRIMARY’, COUNT(*) FROM t FORCE INDEX (PRIMARY) UNION SELECT ‘idx1’, COUNT(*) FROM t FORCE INDEX (idx1) ▪ Can’t be used if there is no secondary key ▪ Index stats check ▪ Checking if “rows” show SHOW TABLE STATUS is not far different from actual row count ▪ Checksum tests w/ InnoDB ▪ Comparing between InnoDB instance and MyRocks instance ▪ Creating a transaction consistent snapshot at the same GTID position, scan, then compare checksum ▪ Shadow correctness check ▪ Capturing read traffics
  • 15. Shadow traffic tests ▪ We have a shadow test framework ▪ MySQL audit plugin to capture read/write queries from production instances ▪ Replaying them into shadow master instances ▪ Shadow master tests ▪ Client errors ▪ Rewriting queries relying on Gap Lock ▪ gap_lock_raise_error=1, gap_lock_write_log=1
  • 16. Creating second MyRocks instance without downtime Master (InnoDB) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (MyRocks) Slave4 (MyRocks) myrocks_hotbackup (Online binary backup)
  • 17. Promoting MyRocks as a master Master (MyRocks) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks)
  • 18. Promoting MyRocks as a master Master (MyRocks) Slave1 (MyRocks) Slave2 (MyRocks) Slave3 (MyRocks) Slave4 (MyRocks)
  • 19. Our MyRocks configurations ▪ Using the latest 4.6 Linux kernel that fixed many filesystem (XFS) and memory allocation issues ▪ Building MySQL/MyRocks with jemalloc ▪ Using 16KB block size ▪ Five major column families, depending on data characteristics ▪ Using ZSTD compression in the bottommost level, LZ4 for the rest ▪ Not creating bloom filter in the bottommost level, to save space and memory
  • 20. Monitoring ▪ Column Family Statistics ▪ Lock contentions ▪ Bloom filter hit rates ▪ Tombstones (delete-marker)
  • 21. SHOW ENGINE ROCKSDB STATUS ▪ Column Family Statistics, including size, read and write amp per level ▪ Memory usage *************************** 7. row *************************** Type: CF_COMPACTION Name: default Status: ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop --------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 2/0 51.58 0.5 0.0 0.0 0.0 0.3 0.3 0.0 0.0 0.0 40.3 7 10 0.669 0 0 L3 6/0 109.36 0.9 0.7 0.7 0.0 0.6 0.6 0.0 0.9 43.8 40.7 16 3 5.172 7494K 297K L4 61/0 1247.31 1.0 2.0 0.3 1.7 2.0 0.2 0.0 6.9 49.7 48.5 41 9 4.593 15M 176K L5 989/0 12592.86 1.0 2.0 0.3 1.8 1.9 0.1 0.0 7.4 8.1 7.4 258 8 32.209 17M 726K L6 4271/0 127363.51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 Sum 5329/0 141364.62 0.0 4.7 1.2 3.5 4.7 1.2 0.0 17.9 15.0 15.0 321 30 10.707 41M 1200K
  • 22. Monitoring Lock Contentions ▪ “Snapshot Conflict” errors with Repeatable Read isolation ▪ Because of implementation differences in MyRocks (PostgreSQL Style snapshot isolation) ▪ We use tx-isolation=READ-COMMITTED to solve the issue ▪ MyRocks supports basic lock contention metrics ▪ Queries hit deadlock errors ▪ The number of deadlocks and lock wait timeouts ▪ The number of lock waits and lock wait time
  • 23. SHOW GLOBAL STATUS mysql> show global status like 'rocksdb%'; +---------------------------------------+-------------+ | Variable_name | Value | +---------------------------------------+-------------+ | rocksdb_rows_deleted | 216223 | | rocksdb_rows_inserted | 1318158 | | rocksdb_rows_read | 7102838 | | rocksdb_rows_updated | 1997116 | .... | rocksdb_bloom_filter_prefix_checked | 773124 | | rocksdb_bloom_filter_prefix_useful | 308445 | | rocksdb_bloom_filter_useful | 10108448 | ....
  • 24. Lessons Learned ▪ Linux kernel VM allocation stalls ▪ We use buffered i/o for MyRocks/RocksDB ▪ Linux fixed vm allocation stalls in newer 4.6 ▪ Don’t delete files too fast on Flash ▪ Avoid multi-second TRIM stalls ▪ We stopped mounting with discard option for binlog and WAL ▪ Hard to generate core files ▪ MyRocks + jemalloc uses too much VIRT size
  • 25. Our current production status We COMPLETED InnoDB to MyRocks migration in UDB We saved 50% space in UDB compared to compressed InnoDB We started working on migrating other large database tiers
  • 26. Development Roadmaps ▪ Helping MariaDB and Percona Server to release with stable MyRocks ▪ Matching read performance vs InnoDB ▪ https://smalldatum.blogspot.com ▪ Supporting Mixed Engines ▪ Better Replication ▪ Supporting Bigger Instance Size
  • 27. Mixed Engines ▪ Currently our production use case is either “MyRocks only” or “InnoDB only” instance ▪ There are several internal/external use cases that want to use InnoDB and MyRocks within the same instance, though single transaction does not overlap engines ▪ Online logical/binary Backup support and benchmarks are concerns ▪ Current plan is extending xtrabackup to integrate myrocks_hotbackup ▪ Considering to backporting gtid_pos_auto_engines from MariaDB
  • 28. Better Replication ▪ Removing engine log ▪ Both internal and external benchmarks (e.g. Amazon, Alibaba) show that qps improves significantly with binlog disabled ▪ Real Problem would be two logs – binlog and engine log, which requires 2pc and ordered commits ▪ One Log - use one log as the source of truth for commits -- either binlog, binlog-like service or RocksDB WAL ▪ We heavily rely on binlogs (for semisync, binlog consumers), TBD is how much perf we gain by stopping writing to WAL ▪ Parallel replication apply ▪ Batching ▪ Skipping using transactions on slaves
  • 29. Supporting Bigger Instance Size ▪ Problem Statement: Shared Nothing database is not general purpose database ▪ MySQL Cluster, Spider, Vitess ▪ Good if you have specific purposes. Might have issues if people lack of expertise about atomic transactions, joins and secondary keys ▪ Suggestion: Now we have 256GB+ RAM and 10TB+ Flash on commodity servers. Why not run one big instance and put everything there? ▪ Bigger instances may help general purpose small-mid applications ▪ They don’t have to worry about sharding. Atomic trans, joins and secondary keys just work ▪ e.g. Amazon Aurora (supporting up to 60TB instance) and Alibaba PolarDB (~100TB instance)
  • 30. Future Plans to support Bigger Instance (1) ▪ Parallel transactional mysqldump ▪ Parallel Query ▪ e.g. how to make mysqldump finish within 24 hours from 20TB table? ▪ Parallel binary copy ▪ e.g. how quickly can we create a 60TB replica instance in a remote region? ▪ Parallel DDL, Parallel Loading ▪ Resumable DDL ▪ e.g. if the DDL is expected to take 10 days, what will happen if mysqld restarts after 8 days?
  • 31. Future Plans to support Bigger Instance (2) ▪ Better join algorithm ▪ Much faster replication ▪ Can handle 10x connection requests and queries ▪ Good resource control ▪ H/W perspective: Shared Storage and Elastic Computing Units ▪ Can scale read replicas from the same shared storage
  • 32. Summary ▪ We finished deploying MyRocks in our production user database (UDB) ▪ You can start deploying slaves, with consistency check ▪ We have added many status counters for instance monitoring ▪ More interesting features will come this year
  • 33. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0