SlideShare a Scribd company logo
Building an Analytic extension to MySQL with
ClickHouse
1
Vadim Tkachenko(Percona) and Kanthi Subramanian(Altinity)
2 March 2023
Who we are
Vadim Tkachenko
CTO Percona
Kanthi Subramanian
Open source contributor/Data
Engineer/Developer Advocate
2
©2023 Percona
MySQL
Strengths
- OLTP Database (Operational)
Handles up to 1mln transactions per second
- Thousands of concurrent transactions
3
©2023 Percona
MySQL is
good for
- 1. ACID transactions.
- 2. Excellent concurrency.
- 3. Very fast point lookups and short
transactions.
- 4. Excellent tooling for building OLTP
applications.
- It's very good for running interactive online
properties:
- - e-commerce
- - online gaming
- - social networks
4
©2023 Percona
Analytics
with MySQL
- Only for small data sets.
- Aggregation queries (GROUP BY) can be problematic
(slow) on 10mln+ rows
In summary: analyzing data over millions of small
transactions is not good use case for MySQL
Some examples (next slides):
5
©2023 Percona 6
Query comparison (MySQL/ClickHouse)
The number of flights delayed by more than 10 minutes,
grouped by the day of the week, for 2000-2008
SELECT DayOfWeek, count(*) AS c
FROM ontime_snapshot
WHERE DepDel15>10 AND Year>=2000 AND
Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;
176mln rows to process
MySQL ClickHouse
573 Seconds (9
minutes 7 seconds)
0.5 seconds
©2023 Percona 7
7
Query comparison(MySQL/ClickHouse)
7
The number of flights delayed by more than 10 minutes,
grouped by the day of the week, for 2000-2008
SELECT Year, avg(DepDelay>10)*100
FROM ontime
GROUP BY Year
ORDER BY Year;
176mln rows to process
MySQL ClickHouse
240 Seconds (4
minutes)
0.674 seconds
©2023 Percona
What gives
such
difference ?
8
MySQL features:
storing data in rows
single-threaded queries,
optimization for high concurrency
are exactly the opposite of those needed to run analytic queries that compute
aggregates on large datasets.
ClickHouse is designed for analytic processing:
- stores data in columns
- has optimizations to minimize I/O
- computes aggregates very efficiently
- parallelized query processing
©2023 Percona 9
Why choose ClickHouse as a complement to
MySQL?
The number of flights delayed by more than 10 minutes,
grouped by the day of the week, for 2000-2008
Read all columns in row (MySQL) Read only selected columns
(ClickHouse)
©2023 Percona
Signs that MySQL needs
Analytic Help
10
Read all
columns
59 GB
(100%)
MySQL, hypothetical query
©2023 Percona
Signs that MySQL needs
Analytic Help
11
21 MB (.035%)
2.6 MB
(.0044%)
1.7 GB
(3%)
Read 3
columns Read 3
compressed
columns
Read 3
compressed
columns
over 8
threads
21 MB (.035%)
ClickHouse, the same query
©2023 Percona
Why is MySQL
a natural
complement
to
ClickHouse?
12
MySQL
Transactional processing
Fast single row updates
High Concurrency. MySQL
support large amount of
concurrent queries
ClickHouse
Does not support ACID
transactions
Updating single row is
problematic. ClickHouse will
need to read and updated a
lot of data
ClickHouse can use a lot of
resources for a single query.
Not good use case for
concurrent access
13
Leveraging Analytical
Benefits of
ClickHouse
● Identify Databases/Tables in
MySQL to be replicated
● Create schema/Databases in
ClickHouse
● Transfer Data from MySQL to
ClickHouse
https://github.com/Altinity/clickhouse-sink-connector
Fully wired, continuous replication
14
Table Engine(s)
Initial Dump/Load
MySQL ClickHouse
OLTP App Analytic App
MySQL
Binlog
Debezium
Altinity Sink
Connector
Kafka*
Event
Stream
*Including Pulsar and RedPanda
ReplacingMergeTree
Replication Setup
Validate Data
Setup CDC Replication
Initial Dump/Load
1
2
3
1. Initial Dump/Load
Why do we need custom load/dump tools?
● Data Types limits and Data Types are not the same for
MySQL and ClickHouse
Date Max MySQL(9999-12-31), Date CH(2299-12-31)
● Translate/Read MySQL schema and create ClickHouse
schema. (Identify PK, partition and translate to ORDER BY
in CH(RMT))
● Faster transfer, leverage existing MySQL and ClickHouse
tools.
1. Initial Dump/Load (MySQL Shell)
https://dev.mysql.com/blog-archive/mysql-shell-8-0-21-
speeding-up-the-dump-process/
https://blogs.oracle.com/mysql/post/mysql-shell-dump-load-
and-compression
1. Initial Dump/Load
MySQL Shell: Multi-Threaded, Split large tables to smaller chunks, Compression,
Speeds(upto 3GB/s).
Clickhouse Client: Multi-Threaded, read compressed data.
1. Initial Dump/Load
Install mysql-shell (JS)
mysqlsh -uroot -proot -hlocalhost -e "util.dump_tables('test', ['employees'],
'/tmp/employees_12');" --verbose
python db_load/clickhouse_loader.py --clickhouse_host localhost --
clickhouse_database $DATABASE --dump_dir $HOME/dbdumps/$DATABASE --
clickhouse_user root --clickhouse_password root --threads 4 --
mysql_source_database $DATABASE --mysqlshell
1. Initial Dump/Load
CREATE TABLE IF NOT EXISTS `employees_predated` (
`emp_no` int NOT NULL,
`birth_date` Date32 NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` Date32 NOT NULL,
`salary` bigint unsigned DEFAULT NULL,
`num_years` tinyint unsigned DEFAULT NULL,
`bonus` mediumint unsigned DEFAULT NULL,
`small_value` smallint unsigned DEFAULT NULL,
`int_value` int unsigned DEFAULT NULL,
`discount` bigint DEFAULT NULL,
`num_years_signed` tinyint DEFAULT NULL,
`bonus_signed` mediumint DEFAULT NULL,
`small_value_signed` smallint DEFAULT NULL,
`int_value_signed` int DEFAULT NULL,
`last_modified_date_time` DateTime64(0) DEFAULT NULL,
`last_access_time` String DEFAULT NULL,
`married_status` char(1) DEFAULT NULL,
`perDiemRate` decimal(30,12) DEFAULT NULL,
`hourlyRate` double DEFAULT NULL,
`jobDescription` text DEFAULT NULL,
`updated_time` String NULL ,
`bytes_date` longblob DEFAULT NULL,
`binary_test_column` varbinary(255) DEFAULT NULL,
`blob_med` mediumblob DEFAULT NULL,
`blob_new` blob DEFAULT NULL,
`_sign` Int8 DEFAULT 1,
`_version` UInt64 DEFAULT 0,
) ENGINE = ReplacingMergeTree(_version) ORDER BY (`emp_no`)
SETTINGS index_granularity = 8192;
CREATE TABLE `employees_predated` (
`emp_no` int NOT NULL,
`birth_date` date NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` date NOT NULL,
`salary` bigint unsigned DEFAULT NULL,
`num_years` tinyint unsigned DEFAULT NULL,
`bonus` mediumint unsigned DEFAULT NULL,
`small_value` smallint unsigned DEFAULT NULL,
`int_value` int unsigned DEFAULT NULL,
`discount` bigint DEFAULT NULL,
`num_years_signed` tinyint DEFAULT NULL,
`bonus_signed` mediumint DEFAULT NULL,
`small_value_signed` smallint DEFAULT NULL,
`int_value_signed` int DEFAULT NULL,
`last_modified_date_time` datetime DEFAULT NULL,
`last_access_time` time DEFAULT NULL,
`married_status` char(1) DEFAULT NULL,
`perDiemRate` decimal(30,12) DEFAULT NULL,
`hourlyRate` double DEFAULT NULL,
`jobDescription` text,
`updated_time` timestamp NULL DEFAULT NULL,
`bytes_date` longblob,
`binary_test_column` varbinary(255) DEFAULT NULL,
`blob_med` mediumblob,
`blob_new` blob,
PRIMARY KEY (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY RANGE (`emp_no`)
(PARTITION p1 VALUES LESS THAN (1000) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
*/ |
MySQL
ClickHouse
2. Validate Data
Why is a basic count check not enough?
● Essential to validate the values, example decimal/floating
precision and datatype limits.
● Data Types are different between MySQL and ClickHouse.
Solution: md5 checksum of column data (Courtesy:
Sisense)
1. Take the MD5 of each column. Use a space for
NULL values.
2. Concatenate those results, and MD5 this result.
3. Split into 4 8-character hex strings.
4. Convert into 32-bit integers and sum.
python
db_compare/mysql_table_check
sum.py --mysql_host localhost --
mysql_user root --mysql_password
root --mysql_database menagerie
--tables_regex "^pet" --
debug_output
python
db_compare/clickhouse_table_c
hecksum.py --clickhouse_host
localhost --clickhouse_user root --
clickhouse_password root --
clickhouse_database menagerie --
tables_regex "^pet" --debug_output
diff out.pet.ch.txt out.pet.mysql.txt
| grep "<|>"
Credits: Arnaud
3. Setup CDC Replication
MySQL
binlog file: mysql.bin.00001
binlog position: 100002
Or
Gtid: 1233:223232323
Debezium
Altinity Sink
Connector
Kafka*
Event
Stream
ClickHouse
Setup Debezium to start from binlog file/position or Gtid
https://github.com/Altinity/clickhouse-sink-connector/blob/develop/doc/debezium_setup.md
Final step - Deploy
● Docker Compose (Debezium Strimzi, Sink Strimzi)
https://hub.docker.com/repository/docker/altinity/clickhouse-sink-connector
● Kubernetes (Docker images)
● JAR file
Simplified Architecture
MySQL
binlog file: mysql.bin.00001
binlog position: 100002
Or
Gtid: 1233:223232323
ClickHouse
Debezium
Altinity Sink
Connector
One executable
One service
Final step - Monitor
● Monitor Lag
● Connector Status
● Kafka monitoring
● CPU/Memory Stats
Challenges
- MySQL Master failover
- Schema Changes(DDL)
MySQL Master Replication
MySQL Master Failover
MySQL Master Failover - Snowflake ID
binlog timestamp
Alter Table support
30
ADD Column <col_name> varchar(1000)
NULL
ADD Column <col_name> Nullable(String)
ADD index type btree ADD index type minmax
MySQL ClickHouse
31
Replicating Schema Changes
32
Replicating Schema Changes
● Debezium does not provide events for all DDL Changes
● Complete DDL is only available in a separate topic(Not a
SinkRecord)
● Parallel Kafka workers might process messages out of order.
33
Replicating Schema Changes
Where can I get more information?
34
Altinity Sink Connector for ClickHouse
https://github.com/Altinity/clickhouse-sink-connector
https://github.com/ClickHouse/ClickHouse
https://github.com/mydumper/mydumper
35
Project roadmap and next Steps
- PostgreSQL, Mongo, SQL server support
- CH shards/replicas support
- Support Transactions.
36
Thank you!
Questions?
https://altinity.com https://percona.com

More Related Content

PDF
Adventures with the ClickHouse ReplacingMergeTree Engine
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
PDF
Altinity Quickstart for ClickHouse-2202-09-15.pdf
PDF
10 Good Reasons to Use ClickHouse
PDF
Your first ClickHouse data warehouse
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
SRE Conference 2022 - How to Build a Healthy On-Call Culture
Adventures with the ClickHouse ReplacingMergeTree Engine
ClickHouse Deep Dive, by Aleksei Milovidov
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Quickstart for ClickHouse-2202-09-15.pdf
10 Good Reasons to Use ClickHouse
Your first ClickHouse data warehouse
Optimizing Delta/Parquet Data Lakes for Apache Spark
SRE Conference 2022 - How to Build a Healthy On-Call Culture

What's hot (20)

PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Storing State Forever: Why It Can Be Good For Your Analytics
PDF
Diving into Delta Lake: Unpacking the Transaction Log
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
PDF
Altinity Quickstart for ClickHouse
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PDF
ClickHouse Intro
PDF
Cloud arch patterns
PDF
ClickHouse Keeper
PDF
PostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
PDF
All about Zookeeper and ClickHouse Keeper.pdf
PDF
Instana - ClickHouse presentation
PDF
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
PDF
My first 90 days with ClickHouse.pdf
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
PDF
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Storing State Forever: Why It Can Be Good For Your Analytics
Diving into Delta Lake: Unpacking the Transaction Log
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Quickstart for ClickHouse
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
ClickHouse Intro
Cloud arch patterns
ClickHouse Keeper
PostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
All about Zookeeper and ClickHouse Keeper.pdf
Instana - ClickHouse presentation
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
My first 90 days with ClickHouse.pdf
Deep Dive into the New Features of Apache Spark 3.0
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Ad

Similar to Building an Analytic Extension to MySQL with ClickHouse and Open Source (20)

PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
PDF
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
PDF
Low Cost Transactional and Analytics with MySQL + Clickhouse
PDF
Low Cost Transactional and Analytics with MySQL + Clickhouse
PDF
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
PDF
Webinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
PDF
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
PPTX
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
PDF
MariaDB and Clickhouse Percona Live 2019 talk
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
PDF
MySQL up and running 30 minutes.pdf
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
Analyzing MySQL Logs with ClickHouse, by Peter Zaitsev
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
PDF
ClickHouse new features and development roadmap, by Aleksei Milovidov
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
KEY
10x improvement-mysql-100419105218-phpapp02
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
Low Cost Transactional and Analytics with MySQL + Clickhouse
Low Cost Transactional and Analytics with MySQL + Clickhouse
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Webinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
MariaDB and Clickhouse Percona Live 2019 talk
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
MySQL up and running 30 minutes.pdf
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Analyzing MySQL Logs with ClickHouse, by Peter Zaitsev
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
ClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
10x improvement-mysql-100419105218-phpapp02
Ad

More from Altinity Ltd (20)

PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
PDF
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
PDF
Fun with ClickHouse Window Functions-2021-08-19.pdf
PDF
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
PDF
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
PDF
ClickHouse ReplacingMergeTree in Telecom Apps
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
PDF
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
PDF
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
PDF
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
PDF
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
PDF
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
PDF
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
PDF
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
PDF
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
PDF
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
PDF
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Fun with ClickHouse Window Functions-2021-08-19.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
ClickHouse ReplacingMergeTree in Telecom Apps
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Lecture1 pattern recognition............
PDF
Mega Projects Data Mega Projects Data
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Foundation of Data Science unit number two notes
PPTX
Database Infoormation System (DBIS).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
IB Computer Science - Internal Assessment.pptx
Introduction-to-Cloud-ComputingFinal.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
Mega Projects Data Mega Projects Data
Acceptance and paychological effects of mandatory extra coach I classes.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Foundation of Data Science unit number two notes
Database Infoormation System (DBIS).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
.pdf is not working space design for the following data for the following dat...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
IB Computer Science - Internal Assessment.pptx

Building an Analytic Extension to MySQL with ClickHouse and Open Source

  • 1. Building an Analytic extension to MySQL with ClickHouse 1 Vadim Tkachenko(Percona) and Kanthi Subramanian(Altinity) 2 March 2023
  • 2. Who we are Vadim Tkachenko CTO Percona Kanthi Subramanian Open source contributor/Data Engineer/Developer Advocate 2
  • 3. ©2023 Percona MySQL Strengths - OLTP Database (Operational) Handles up to 1mln transactions per second - Thousands of concurrent transactions 3
  • 4. ©2023 Percona MySQL is good for - 1. ACID transactions. - 2. Excellent concurrency. - 3. Very fast point lookups and short transactions. - 4. Excellent tooling for building OLTP applications. - It's very good for running interactive online properties: - - e-commerce - - online gaming - - social networks 4
  • 5. ©2023 Percona Analytics with MySQL - Only for small data sets. - Aggregation queries (GROUP BY) can be problematic (slow) on 10mln+ rows In summary: analyzing data over millions of small transactions is not good use case for MySQL Some examples (next slides): 5
  • 6. ©2023 Percona 6 Query comparison (MySQL/ClickHouse) The number of flights delayed by more than 10 minutes, grouped by the day of the week, for 2000-2008 SELECT DayOfWeek, count(*) AS c FROM ontime_snapshot WHERE DepDel15>10 AND Year>=2000 AND Year<=2008 GROUP BY DayOfWeek ORDER BY c DESC; 176mln rows to process MySQL ClickHouse 573 Seconds (9 minutes 7 seconds) 0.5 seconds
  • 7. ©2023 Percona 7 7 Query comparison(MySQL/ClickHouse) 7 The number of flights delayed by more than 10 minutes, grouped by the day of the week, for 2000-2008 SELECT Year, avg(DepDelay>10)*100 FROM ontime GROUP BY Year ORDER BY Year; 176mln rows to process MySQL ClickHouse 240 Seconds (4 minutes) 0.674 seconds
  • 8. ©2023 Percona What gives such difference ? 8 MySQL features: storing data in rows single-threaded queries, optimization for high concurrency are exactly the opposite of those needed to run analytic queries that compute aggregates on large datasets. ClickHouse is designed for analytic processing: - stores data in columns - has optimizations to minimize I/O - computes aggregates very efficiently - parallelized query processing
  • 9. ©2023 Percona 9 Why choose ClickHouse as a complement to MySQL? The number of flights delayed by more than 10 minutes, grouped by the day of the week, for 2000-2008 Read all columns in row (MySQL) Read only selected columns (ClickHouse)
  • 10. ©2023 Percona Signs that MySQL needs Analytic Help 10 Read all columns 59 GB (100%) MySQL, hypothetical query
  • 11. ©2023 Percona Signs that MySQL needs Analytic Help 11 21 MB (.035%) 2.6 MB (.0044%) 1.7 GB (3%) Read 3 columns Read 3 compressed columns Read 3 compressed columns over 8 threads 21 MB (.035%) ClickHouse, the same query
  • 12. ©2023 Percona Why is MySQL a natural complement to ClickHouse? 12 MySQL Transactional processing Fast single row updates High Concurrency. MySQL support large amount of concurrent queries ClickHouse Does not support ACID transactions Updating single row is problematic. ClickHouse will need to read and updated a lot of data ClickHouse can use a lot of resources for a single query. Not good use case for concurrent access
  • 13. 13 Leveraging Analytical Benefits of ClickHouse ● Identify Databases/Tables in MySQL to be replicated ● Create schema/Databases in ClickHouse ● Transfer Data from MySQL to ClickHouse https://github.com/Altinity/clickhouse-sink-connector
  • 14. Fully wired, continuous replication 14 Table Engine(s) Initial Dump/Load MySQL ClickHouse OLTP App Analytic App MySQL Binlog Debezium Altinity Sink Connector Kafka* Event Stream *Including Pulsar and RedPanda ReplacingMergeTree
  • 15. Replication Setup Validate Data Setup CDC Replication Initial Dump/Load 1 2 3
  • 16. 1. Initial Dump/Load Why do we need custom load/dump tools? ● Data Types limits and Data Types are not the same for MySQL and ClickHouse Date Max MySQL(9999-12-31), Date CH(2299-12-31) ● Translate/Read MySQL schema and create ClickHouse schema. (Identify PK, partition and translate to ORDER BY in CH(RMT)) ● Faster transfer, leverage existing MySQL and ClickHouse tools.
  • 17. 1. Initial Dump/Load (MySQL Shell) https://dev.mysql.com/blog-archive/mysql-shell-8-0-21- speeding-up-the-dump-process/ https://blogs.oracle.com/mysql/post/mysql-shell-dump-load- and-compression
  • 18. 1. Initial Dump/Load MySQL Shell: Multi-Threaded, Split large tables to smaller chunks, Compression, Speeds(upto 3GB/s). Clickhouse Client: Multi-Threaded, read compressed data.
  • 19. 1. Initial Dump/Load Install mysql-shell (JS) mysqlsh -uroot -proot -hlocalhost -e "util.dump_tables('test', ['employees'], '/tmp/employees_12');" --verbose python db_load/clickhouse_loader.py --clickhouse_host localhost -- clickhouse_database $DATABASE --dump_dir $HOME/dbdumps/$DATABASE -- clickhouse_user root --clickhouse_password root --threads 4 -- mysql_source_database $DATABASE --mysqlshell
  • 20. 1. Initial Dump/Load CREATE TABLE IF NOT EXISTS `employees_predated` ( `emp_no` int NOT NULL, `birth_date` Date32 NOT NULL, `first_name` varchar(14) NOT NULL, `last_name` varchar(16) NOT NULL, `gender` enum('M','F') NOT NULL, `hire_date` Date32 NOT NULL, `salary` bigint unsigned DEFAULT NULL, `num_years` tinyint unsigned DEFAULT NULL, `bonus` mediumint unsigned DEFAULT NULL, `small_value` smallint unsigned DEFAULT NULL, `int_value` int unsigned DEFAULT NULL, `discount` bigint DEFAULT NULL, `num_years_signed` tinyint DEFAULT NULL, `bonus_signed` mediumint DEFAULT NULL, `small_value_signed` smallint DEFAULT NULL, `int_value_signed` int DEFAULT NULL, `last_modified_date_time` DateTime64(0) DEFAULT NULL, `last_access_time` String DEFAULT NULL, `married_status` char(1) DEFAULT NULL, `perDiemRate` decimal(30,12) DEFAULT NULL, `hourlyRate` double DEFAULT NULL, `jobDescription` text DEFAULT NULL, `updated_time` String NULL , `bytes_date` longblob DEFAULT NULL, `binary_test_column` varbinary(255) DEFAULT NULL, `blob_med` mediumblob DEFAULT NULL, `blob_new` blob DEFAULT NULL, `_sign` Int8 DEFAULT 1, `_version` UInt64 DEFAULT 0, ) ENGINE = ReplacingMergeTree(_version) ORDER BY (`emp_no`) SETTINGS index_granularity = 8192; CREATE TABLE `employees_predated` ( `emp_no` int NOT NULL, `birth_date` date NOT NULL, `first_name` varchar(14) NOT NULL, `last_name` varchar(16) NOT NULL, `gender` enum('M','F') NOT NULL, `hire_date` date NOT NULL, `salary` bigint unsigned DEFAULT NULL, `num_years` tinyint unsigned DEFAULT NULL, `bonus` mediumint unsigned DEFAULT NULL, `small_value` smallint unsigned DEFAULT NULL, `int_value` int unsigned DEFAULT NULL, `discount` bigint DEFAULT NULL, `num_years_signed` tinyint DEFAULT NULL, `bonus_signed` mediumint DEFAULT NULL, `small_value_signed` smallint DEFAULT NULL, `int_value_signed` int DEFAULT NULL, `last_modified_date_time` datetime DEFAULT NULL, `last_access_time` time DEFAULT NULL, `married_status` char(1) DEFAULT NULL, `perDiemRate` decimal(30,12) DEFAULT NULL, `hourlyRate` double DEFAULT NULL, `jobDescription` text, `updated_time` timestamp NULL DEFAULT NULL, `bytes_date` longblob, `binary_test_column` varbinary(255) DEFAULT NULL, `blob_med` mediumblob, `blob_new` blob, PRIMARY KEY (`emp_no`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci /*!50100 PARTITION BY RANGE (`emp_no`) (PARTITION p1 VALUES LESS THAN (1000) ENGINE = InnoDB, PARTITION p2 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */ | MySQL ClickHouse
  • 21. 2. Validate Data Why is a basic count check not enough? ● Essential to validate the values, example decimal/floating precision and datatype limits. ● Data Types are different between MySQL and ClickHouse. Solution: md5 checksum of column data (Courtesy: Sisense) 1. Take the MD5 of each column. Use a space for NULL values. 2. Concatenate those results, and MD5 this result. 3. Split into 4 8-character hex strings. 4. Convert into 32-bit integers and sum. python db_compare/mysql_table_check sum.py --mysql_host localhost -- mysql_user root --mysql_password root --mysql_database menagerie --tables_regex "^pet" -- debug_output python db_compare/clickhouse_table_c hecksum.py --clickhouse_host localhost --clickhouse_user root -- clickhouse_password root -- clickhouse_database menagerie -- tables_regex "^pet" --debug_output diff out.pet.ch.txt out.pet.mysql.txt | grep "<|>" Credits: Arnaud
  • 22. 3. Setup CDC Replication MySQL binlog file: mysql.bin.00001 binlog position: 100002 Or Gtid: 1233:223232323 Debezium Altinity Sink Connector Kafka* Event Stream ClickHouse Setup Debezium to start from binlog file/position or Gtid https://github.com/Altinity/clickhouse-sink-connector/blob/develop/doc/debezium_setup.md
  • 23. Final step - Deploy ● Docker Compose (Debezium Strimzi, Sink Strimzi) https://hub.docker.com/repository/docker/altinity/clickhouse-sink-connector ● Kubernetes (Docker images) ● JAR file
  • 24. Simplified Architecture MySQL binlog file: mysql.bin.00001 binlog position: 100002 Or Gtid: 1233:223232323 ClickHouse Debezium Altinity Sink Connector One executable One service
  • 25. Final step - Monitor ● Monitor Lag ● Connector Status ● Kafka monitoring ● CPU/Memory Stats
  • 26. Challenges - MySQL Master failover - Schema Changes(DDL)
  • 29. MySQL Master Failover - Snowflake ID binlog timestamp
  • 30. Alter Table support 30 ADD Column <col_name> varchar(1000) NULL ADD Column <col_name> Nullable(String) ADD index type btree ADD index type minmax MySQL ClickHouse
  • 32. 32 Replicating Schema Changes ● Debezium does not provide events for all DDL Changes ● Complete DDL is only available in a separate topic(Not a SinkRecord) ● Parallel Kafka workers might process messages out of order.
  • 34. Where can I get more information? 34 Altinity Sink Connector for ClickHouse https://github.com/Altinity/clickhouse-sink-connector https://github.com/ClickHouse/ClickHouse https://github.com/mydumper/mydumper
  • 35. 35 Project roadmap and next Steps - PostgreSQL, Mongo, SQL server support - CH shards/replicas support - Support Transactions.

Editor's Notes

  • #14: Experience deploying to customers and the tools we have developed in the process. It's a complicated set of steps, it will be easier to automate the entire process. Create schema/databases -> we have scripts for the initial load that simplifies this process, and sink connector can also auto create tables. Complete suite of tools to simplify the process end to end.
  • #15: Existing data in MySQL might be big, need a solution that will be fast to do the Initial transfer. (CH needs to be in-sync) End to End solution for transferring data from MySQL to ClickHouse for Production Deployments. Debezium timeout(STATEMENT execution timeout). Source DB might have limited permissions. You might not have permission to perform OUTFILE.
  • #16: Step 1: Perform a dump of data from MySQL and load it into ClickHouse. Debezium initial snapshot might not be faster. Step 2: After the dump is loaded, validate the data. Step 3: Setup CDC replication using Debezium and Altinity sink connector.
  • #17: Debezium provides initial snapshotting, but it’s slow. Debezium load times very slow. MAX_EXECUTION_TIMEOUT
  • #18: Debezium provides initial snapshotting, but it’s slow. Mysqlsh requires a PK, if PK is not present, it does not parallelize and do not provide chunking capabilities.
  • #19: Debezium provides initial snapshotting, but it’s slow. Mysql shell uses zstd compression standard by default. –threads option provides parallelism.
  • #20: Debezium provides initial snapshotting, but it’s slow. Mysql shell uses zstd compression standard by default. –threads option provides parallelism. Clickhouse_loader creates CH schema and adds version and sign columns for UPDATES/DELETES.
  • #21: Debezium provides initial snapshotting, but it’s slow. Mysql shell uses zstd compression standard by default. –threads option provides parallelism. Clickhouse_loader creates CH schema and adds version and sign columns for UPDATES/DELETES.
  • #22: Debezium provides initial snapshotting, but it’s slow. Compare results of the aggregation table that drives your dashboard. Sales numbers have to be accurate.
  • #24: Debezium provides initial snapshotting, but it’s slow. Different environments We also maintain images for Debezium/Strimzi and Sink/Strimzi
  • #25: Debezium provides initial snapshotting, but it’s slow. Different environments We also maintain images for Debezium/Strimzi and Sink/Strimzi
  • #26: Setup Alerts if connectors are down. Setup Alerts when there is a lag. Setup Alerts when there are errors. We also bundle the debezium dashboard and the kafka dashboard.
  • #32: Co-ordination is Key! Tradeoff between Parallelism and Consistency.
  • #33: Events: Truncate table.
  • #34: Events: Truncate table.
  • #37: Events: Truncate table.