SlideShare a Scribd company logo
1
Streaming database
events over Kafka
using Debezium
2
https://medium.com/jobteaser-dev-team
About me and where I work
@knil_sama
JobTeaser
Preparing the new generation to reach its
full potential, embrace the future with
optimism and make its mark in the world
Clément Demonchy
Data engineering with:
Python, AWS, Kubernetes, Kafka,
and anything that works
We are hiring !
3
The need
Applicative data is used to
● Recommend offers to student
● Measure KPIs for the company
● Enrich users’ experience
4
Separated space between data and website
JobTeaser infra
5
Content
6
● Fetch all the data
● Follow schema updates
● Real time
● No impact on production performance
The requirements
7
Log-based Change-Data-Capture (CDC)
8
Debezium
Created by RedHat
Open source
Free
Supports SQL DBs
Compatible Kafka connect
9
We just happen to have a Kafka
NOT THIS ONE
10
Kafka connect
11
Install a connector
Upload connector runnable
on kafka connect worker
Write connector
configuration
Push connector
12
A few more configuration to go
13
1. Create a user with the following rights
SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT
2. Update configuration
Changes on MySQL
14
To avoid connection dropping/hanging
expire_logs_days setting is not working with RDS
RDS specific changes
15
Increase workers memory
=> For initial snapshot surge (Database)
Connectors in distributed mode
=> Fully stateless
auto.create.topics.enable topic set to true
=> Large number of tables and new tables
Changes on Kafka
16
snapshot.mode :
With data: initial, when_needed or never
Without data: schema_only, schema_only_recovery
On RDS you need a global lock during snapshot
otherwise use snapshot.locking.mode : minimal, extended none
Debezium Snapshot
17
Good to go ?
18
Tables Columns Content
Anonymization strategies with Debezium
table.whitelist/table.blacklist column.mask.with.length.chars
Always prefer explicit whitelisting
because you can’t prevent columns changing name or being added
column.blacklist column.whitelist
19
In our case the pre-existing database was
~ 40 GB db
> 100 Tables
Initial snapshot took 40 minutes to complete
In practice
20
Content
Pushing it in production
21
Big row issue, we have row with a size greater than default value !!
In theory you only need of increasing max.request.size, but it’s not enough
On Kafka connects worker
On Kafka brokers
Issues with Debezium (part 1)
22
Some DCL command
Make the connector crash
So we had to set in connector
Issues with Debezium (part 2)
23
If MySQL goes down, connector will fail
and you have to restart the task
1) Identify the failing task
2) Restart it
Easy case: binlog and consumer offset still exist
Connector stream recovery
24
Basic monitoring:
● Prometheus
● Grafana
● Alerting on Slack
Be less strict and log errors instead of crashing for
Watch out for DEBUG with some loggers else you will flood the worker
Staying alive
25
Kafka output
26
Confluent created connector
Stream back debezium event to PostgreSQL
Handle create, update and schema changes
But … deleted records are not removed in target database
(A PR was merged recently and new default is to delete them)
JDBC Connect to the rescue
27
Debezium provides an option on JDBC connector
That will add a flag column “__deleted” on every tables
Other usefuls SMTs : RemoveNulls, MultiTimestampConverter
Single Message Transformations (SMTs)
28
Final result
29
Thanks for your attention
Gimme questions !
We are hiring !

More Related Content

What's hot (20)

PDF
Red Hat OpenShift Operators - Operators ABC
Robert Bohne
 
PPTX
Rethinking Cloud Proxies
Mikey Cohen - Hiring Amazing Engineers
 
PPTX
What is Change Data Capture (CDC) and Why is it Important?
FlyData Inc.
 
PDF
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PgDay.Seoul
 
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
PDF
Scaling your analytics with Amazon EMR
Israel AWS User Group
 
PDF
Using ClickHouse for Experimentation
Gleb Kanterov
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PPTX
Kafka Connect - debezium
Kasun Don
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
KSQL: Streaming SQL for Kafka
confluent
 
PDF
Introduction to GitHub Copilot
All Things Open
 
PDF
What is new in PostgreSQL 14?
Mydbops
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
PDF
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
Yevgeniy Brikman
 
Red Hat OpenShift Operators - Operators ABC
Robert Bohne
 
Rethinking Cloud Proxies
Mikey Cohen - Hiring Amazing Engineers
 
What is Change Data Capture (CDC) and Why is it Important?
FlyData Inc.
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PgDay.Seoul
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Scaling your analytics with Amazon EMR
Israel AWS User Group
 
Using ClickHouse for Experimentation
Gleb Kanterov
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Kafka Connect - debezium
Kasun Don
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Making Apache Spark Better with Delta Lake
Databricks
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
KSQL: Streaming SQL for Kafka
confluent
 
Introduction to GitHub Copilot
All Things Open
 
What is new in PostgreSQL 14?
Mydbops
 
Databricks Fundamentals
Dalibor Wijas
 
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
Yevgeniy Brikman
 

Similar to From my sql to postgresql using kafka+debezium (20)

PPTX
Capture the Streams of Database Changes
confluent
 
PDF
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Paul Brebner
 
PPTX
Rds data lake @ Robinhood
BalajiVaradarajan13
 
PDF
Kafka Summit SF 2017 - Database Streaming at WePay
confluent
 
PDF
Technology choices for Apache Kafka and Change Data Capture
Andrew Schofield
 
PDF
Building data pipelines at Shopee with DEC
Rim Zaidullin
 
PDF
Embracing Database Diversity with Kafka and Debezium
Frank Lyaruu
 
PPTX
Data Stream Processing for Beginners with Kafka and CDC
Abhijit Kumar
 
PDF
CDC patterns in Apache Kafka®
confluent
 
PDF
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Red Hat Developers
 
PDF
PartnerSkillUp_Enable a Streaming CDC Solution
Timothy Spann
 
PDF
Application modernization patterns with apache kafka, debezium, and kubernete...
Bilgin Ibryam
 
PDF
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
PDF
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
PDF
Tips for Apache Flink on Kafka with Olena Babenko | Kafka Summit London 2022
HostedbyConfluent
 
PDF
Building Out Your Kafka Developer CDC Ecosystem
confluent
 
PDF
How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...
HostedbyConfluent
 
PPTX
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Lviv Startup Club
 
PPTX
Debezium POC
kloia
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Capture the Streams of Database Changes
confluent
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Paul Brebner
 
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Kafka Summit SF 2017 - Database Streaming at WePay
confluent
 
Technology choices for Apache Kafka and Change Data Capture
Andrew Schofield
 
Building data pipelines at Shopee with DEC
Rim Zaidullin
 
Embracing Database Diversity with Kafka and Debezium
Frank Lyaruu
 
Data Stream Processing for Beginners with Kafka and CDC
Abhijit Kumar
 
CDC patterns in Apache Kafka®
confluent
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Red Hat Developers
 
PartnerSkillUp_Enable a Streaming CDC Solution
Timothy Spann
 
Application modernization patterns with apache kafka, debezium, and kubernete...
Bilgin Ibryam
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
Tips for Apache Flink on Kafka with Olena Babenko | Kafka Summit London 2022
HostedbyConfluent
 
Building Out Your Kafka Developer CDC Ecosystem
confluent
 
How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...
HostedbyConfluent
 
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Lviv Startup Club
 
Debezium POC
kloia
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Ad

Recently uploaded (20)

PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Ad

From my sql to postgresql using kafka+debezium