SlideShare a Scribd company logo
©Instaclustr Pty Limited, 2021
Change Data Capture (CDC)
With Kafka Connect® and
the Debezium PostgreSQL®
Source Connector
Paul Brebner
Technology Evangelist, Instaclustr
December 2021
© Instaclustr Pty Limited, 2021
Instaclustr Managed Platform
A complete ecosystem
to support mission
critical open source
big data applications
This Talk Focuses On
Technologies
• Debezium CDC Use Case
• PostgreSQL® (source database)
• Kafka® + Kafka Connect®
(streaming)
• Elasticsearch/OpenSearch® (sink
system)
Open Source
• There’s nothing specific to our
platform
• I used Instaclustr managed Kafka
and Elasticsearch © Instaclustr Pty Limited, 2021
(Source: Shutterstock)
Which Came
First?
© Instaclustr Pty Limited, 2021
Which Came
First?
(Source: Shutterstock)
The state or the
event?
What if you have
state and want
events?
Events and you
want state?
© Instaclustr Pty Limited, 2021
Or, how can
speed up an
Elephant
(PostgreSQL)
to be as fast as
a Cheetah
(Kafka)? Cheetahs are the fastest land animal (top speed
120km/hr. They can accelerate from 0 to 100km/hr in 3
seconds), three times faster than elephants (40km/hr)
(Source: Shutterstock)
© Instaclustr Pty Limited, 2021
1. The
Debezium
PostgreSQL
Connector
• The Debezium PostgreSQL connector captures
row-level database changes and streams them to
Kafka via Kafka Connect.
• Runs as a Kafka source connector
• How does it get Postgresql change events? Does it
poll with queries?
© Instaclustr Pty Limited, 2021
1. The
Debezium
PostgreSQL
Connector
• As of PostgreSQL 10+, there is a logical replication stream
mode, called pgoutput that is natively supported by
PostgreSQL
• This means that a Debezium PostgreSQL connector can
consume that replication stream [as a client] without the need
for additional plug-ins
pgoutput pgoutput client
Logical replication stream
© Instaclustr Pty Limited, 2021
1. The
Debezium
PostgreSQL
Connector—
Run It
• Download the Debezium PostgreSQL connector
• Deploy it:
o Upload to AWS S3 bucket
o Synchronise with Instaclustr managed Kafka connect
o "io.debezium.connector.postgresql.PostgresConnector" will be in list
of available connectors on the console
• Configure PostgreSQL
o Set wal_level (write ahead log) to logical (3rd non-default level,
requires server restart)
o Create Debezium user with REPLICATION and LOGIN permissions
o These need PostgreSQL admin permissions
• Configure Debezium connector and run it
o Plugin.name default must be set to pgoutput, need PG
username/password and IP
Configure
and
Run
curl https://KafkaConnectIP:8083/connectors -X POST -H
'Content-Type: application/json' -k -u kc_username:kc_password
-d '{
"name": "debezium-test1",
"config": {
"connector.class":
"io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "PG_IP",
"database.port": "5432",
"database.user": "pg_username",
"database.password": "pg_password",
"database.dbname" : "postgres",
"database.server.name": "test1",
"plugin.name": "pgoutput"
}
}
‘
If it worked you will see a single task
running, tasks.max can only = 1
© Instaclustr Pty Limited, 2021
Exploring the
Debezium
PostgreSQL
Connector
Change Data
Events
A terrifying “Giraffosaurus” (T-Raffe?)!
(Source: Shutterstock)
© Instaclustr Pty Limited, 2021
CRUD?
What
Operations
Result in
Change?
Create/Insert? Yes
Read? No
Update? Yes
Delete? Yes
© Instaclustr Pty Limited, 2021
Table ->
Topic
Mapping
What Does
the Kafka
Record Look
Like?
• CDC events from PostgreSQL: server + database + table à
Kafka topic: server.database.name
• Example insert record (table has id and v1 v2 integer columns):
• Struct{after=Struct{id=1,v1=2,v2=3},source=Struct{version=1.6.1.
Final,connector=postgresql,name=test1,ts_ms=1632457564326,d
b=postgres,sequence=["1073751912","1073751912"],schema=pu
blic,table=test1,txId=612,lsn=1073751968},op=c,ts_ms=1632457
564351}
• Operation types: “op=c” (insert), “op=u” (update), “op=d” (delete)
• For insert and update there’s an “after” record with id and values
after transaction committed
• For delete there’s a “before” record which shows id and NULL
value only
• And lots of metadata
• Documentation led me to believe I would be seeing JSON with
schema metadata? What’s wrong?
© Instaclustr Pty Limited, 2021
Updated
Connector
Configuration
and JSON
Record
Example
"value.converter":
"org.apache.kafka.connect.json.JsonConverter"
"value.converter.schemas.enable": "true"
"key.converter": "org.apache.kafka.connect.json.JsonConverter"
"key.converter.schemas.enable": "true”
Schema is verbose, has Schema and payload records; turn it
off, now have implicit payload only—Insert now looks like
this:
{"before":null,"after":{"id":10,"v1":10,"v2":10},"source":{"version":"
1.6.1.Final","connector":"postgresql","name":"test1","ts_ms":1632
717503331,"snapshot":"false","db":"postgres","sequence":"["194
6172256","1946172256"]","schema":"public","table":"test1","txId
":1512,"lsn":59122909632,"xmin":null},"op":"c","ts_ms":16327175
03781,"transaction":null}
© Instaclustr Pty Limited, 2021
Two T’s
- Truncations
- Transactions
• Truncate
o Is also a PostgreSQL operation (makes a table vanish)
o What would you expect to happen?
o Lots of deletes? No—nothing
o Turned off by default
• Transactions
o PostgreSQL is a real SQL transactional database
o What happens when multiple tables are changed in a
single transaction?
o You get multiple Kafka records, with the same transaction ID
o The transaction ID can optionally be written to another
Kafka topic
• Note that to process Truncations and Transactions the Kafka sink
connector needs to be pretty intelligent, and semantics will
depend on target sink system
© Instaclustr Pty Limited, 2021
Debezium
PostgreSQL
Connector
Throughput
How fast can a Debezium PostgreSQL Connector run?
(Source: Shutterstock)
© Instaclustr Pty Limited, 2021
Debezium
PostgreSQL
Connector
Throughput
1 Task Only
(Source: Shutterstock)
© Instaclustr Pty Limited, 2021
Debezium
PostgreSQL
Connector
Throughput
1 Task Only
Throughput limited to 7,000 events/s per task
© Instaclustr Pty Limited, 2021
Debezium
PostgreSQL
Connector
Throughput
1 Task Only
But PostgreSQL server is capable of 41,000 inserts/s (6x more, from previous tests)
© Instaclustr Pty Limited, 2021
Solutions?
Most workloads will be
more balanced between
reads/writes
One lane may be fine!
© Instaclustr Pty Limited, 2021
(Sources: Paul Brebner &
Shutterstock)
Solutions?
A wider bridge?
© Instaclustr Pty Limited, 2021
(Source: Paul Brebner)
Solutions?
Multiple connectors
1 per table?
© Instaclustr Pty Limited, 2021
M
ultiple
replication
slots
© Instaclustr Pty Limited, 2021
Solutions?
Multiple connectors
1 per table?
Odd Behaviour
One connector watching
2 tables
Multiple changes to 1 before
the other
Changes in 1st table all
processed before any changes
in the other (10m delay!)
Multiple connectors may be
best practice
Only 1 table at a time is processed?
© Instaclustr Pty Limited, 2021
(Source: Shutterstock)
What if there
are lots of
tables (and
databases)?
Better? 1 connector per
table “group” (tables
common to a service,
tables with similar change
rates, etc.)
Table 1 Table 2
Table 3 Table 4
Table 5 Table 6
Debezium
Connector 1
Debezium
Connector 2
Service
1
Service
2
© Instaclustr Pty Limited, 2021
Streaming
Debezium
PostgreSQL
Connector
Change Data
Capture Events
Into Elasticsearch
With Kafka Sink
Connectors
The final metamorphosis, from Cheetah (Kafka)
to Rhino (Elasticsearch)!
(Source: Shutterstock)
© Instaclustr Pty Limited, 2021
What Can
You Do With
the CDC Data
Once It’s in
Kafka?
Stream it into 1 or more sink systems, e.g. Elasticsearch
© Instaclustr Pty Limited, 2021
Pipeline Blog
Series
Berlin Beer Pipes?
(Source:Paul Brebner)
Reuse Kafka
Elasticsearch sink
connectors
Worked well with
schema less JSON data
© Instaclustr Pty Limited, 2021
Camel Sink
Connector?
Missing a class
(“org.elasticsearch.rest.
BytesRestResponse”)
Gave up!
(Source: Shutterstock)
APACHE
© Instaclustr Pty Limited, 2021
Tried the
Lenses
Connector
Example configuration
To process 7,000
events/s need more
tasks, partitions, and
Elasticsearch shards,
and probably BULK API!
curl https://KC_IP:8083/connectors/elastic-sink-tides/config -k -u
KC_user:KC_password -X PUT -H 'Content-Type: application/json' -d '
{
"connector.class" :
"com.datamountaineer.streamreactor.connect.elastic7.ElasticSinkCon
nector",
"tasks.max" : 100,
"topics" : "test1.public.test1",
"connect.elastic.hosts" : "ES_IP",
"connect.elastic.port" : 9201,
"connect.elastic.kcql" : "INSERT INTO test-index SELECT * FROM
test1.public.test",
"connect.elastic.use.http.username" : "ES_user",
"connect.elastic.use.http.password" : "ES_password"
}
}'
© Instaclustr Pty Limited, 2021
All Events Are
“Inserts” Into
Elasticsearch
But we have “before”
and “after”?!
Get rid of before events
with Single Message
Transformation on
Source connector side
curl https://KC_IP:8083/connectors -X POST -H 'Content-Type: application/json' -k -u
kc_user:kc_password -d '{
"name": "debezium-test1",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg_ip",
"database.port": "5432",
"database.user": "pg_user",
"database.password": "pg_password",
"database.dbname" : "postgres",
"database.server.name": "test1",
"plugin.name": "pgoutput",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
}
}
‘
This ”event flattening” SMT extracts the after field from a Debezium change
event and creates a simple Kafka record with the after field contents.
© Instaclustr Pty Limited, 2021
All Events Are
“Inserts” Into
Elasticsearch
But we have “before”
and “after”?!
Get rid of before events
with Single Message
Transformation on
Source connector side
curl https://KC_IP:8083/connectors -X POST -H 'Content-Type: application/json' -k -u
kc_user:kc_password -d '{
"name": "debezium-test1",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg_ip",
"database.port": "5432",
"database.user": "pg_user",
"database.password": "pg_password",
"database.dbname" : "postgres",
"database.server.name": "test1",
"plugin.name": "pgoutput",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
}
}
‘
This ”event flattening” SMT extracts the after field from a Debezium
change event and creates a simple Kafka record with the after field
contents.
How
w
ould
w
e
process
updates
and
deletes?
© Instaclustr Pty Limited, 2021
A Clever
Test/Trick?!
Previous Tidal data ⇨
Elasticsearch pipeline,
V2 modified to use
PostgreSQL as sink
Pipeline 1: Tidal Data (REST source connector)
à PostgreSQL
© Instaclustr Pty Limited, 2021
Pipeline 2: PostgreSQL à Elasticsearch
Events State
State Events State
A Clever
Test/Trick?!
So I used the
PostgreSQL Tidal data
as the source system!
Simple test as only have
“inserts”
Pipeline 1: Tidal Data (REST source connector)
à PostgreSQL
© Instaclustr Pty Limited, 2021
Kibana
Visualization
of
Tidal Data⇨
Kafka Connect ⇨
PostgreSQL ⇨
Kafka Connect ⇨
Elasticsearch ⇨ Kibana
© Instaclustr Pty Limited, 2021
Solving the
Chicken or
Egg Dilemma
i.e. It doesn’t matter as
long as we get to
eat the omelet
(Source: Shutterstock)
© Instaclustr Pty Limited, 2021
PostgreSQL
Configuration
Required to run
the Debezium
Source Connector
Not currently
supported in our
managed PG
service
1 Task Only
Limits throughput
Issues with
multiple tables per
connector?
Best-practice to
run multiple
connectors,
maybe 1 per
“related” tables?
CDC Events
Complex Kafka
record structure
Meta data and
data
Schema or
schemaless?
Truncate?
Transactions?
Sink
Connectors
May need
customization to
understand CDC
events and
process correctly
for target sink
system
Debezium PostgreSQL Conclusions
PostgreSQL
Configuration
Required to run the
Debezium Source
Connector
Not yet supported in
Instaclustr’s
managed PG
service
1 Task
Only
Limits throughput
Issues with multiple
tables per connector?
Best-practice to run
multiple connectors,
maybe 1 per “related”
tables?
CDC
Events
Complex Kafka
record structure
Meta data and data
Schema or
schemaless?
Truncate?
Transactions?
Sink
Connectors
May need
customization to
understand CDC
events and process
correctly for target
sink system
© Instaclustr Pty Limited, 2021
Debezium
PostgreSQL
Connector
- NOTES
■ This talk covers a generic open source solution
● Using Debezium
● PostgreSQL
● Apache Kafka Connect
● OpenSearch
■ For hosted PostgreSQL
● You may need help with PostgreSQL configuration from cloud
providers
■ But may be tricky to configure correctly
● For high throughput
● Many databases and tables
● For unbalanced changes across multiple tables
● I also didn’t test failover scenarios
■ The Debezium PostgreSQL Connector with
Instaclustr’s Managed PostgreSQL service is on
the roadmap for 2022
© Instaclustr Pty Limited, 2021
Further
Information
Blogs
■ www.instaclustr.com/paul-brebner/
■ Lots of Blogs using open source technologies:
PostgreSQL, Apache Kafka, Apache Cassandra,
Apache Spark, Apache Zookeeper, Redis,
Elasticsearch/OpenSearch, Cadence (new) etc
■ For interesting use cases:
IoT, ML, anomaly detection, geospatial, fintech,
pipelines, etc
■ Free Trial on homepage for all of these technologies
© Instaclustr Pty Limited, 2021
www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!
© Instaclustr Pty Limited, 2020

More Related Content

Similar to Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Source Connector (20)

PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
PDF
PartnerSkillUp_Enable a Streaming CDC Solution
Timothy Spann
 
PDF
Kafka Summit SF 2017 - Database Streaming at WePay
confluent
 
PDF
What's New in PostgreSQL 9.6
EDB
 
PPTX
Data Stream Processing for Beginners with Kafka and CDC
Abhijit Kumar
 
PDF
How to Build an Apache Kafka® Connector
confluent
 
PDF
London Apache Kafka Meetup (Jan 2017)
Landoop Ltd
 
PPTX
Event processing without breaking production
nzender
 
PPTX
Rds data lake @ Robinhood
BalajiVaradarajan13
 
PPTX
Migrating with Debezium
Mike Fowler
 
PDF
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
ScyllaDB
 
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
PPTX
Event streaming webinar feb 2020
Maheedhar Gunturu
 
PDF
Confluent and Elastic
Paolo Castagna
 
PPTX
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Chartio
 
PDF
Indeed Flex: The Story of a Revolutionary Recruitment Platform
HostedbyConfluent
 
PDF
Lookout on Scaling Security to 100 Million Devices
ScyllaDB
 
PDF
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PPTX
Streaming Data from Scylla to Kafka
ScyllaDB
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
PartnerSkillUp_Enable a Streaming CDC Solution
Timothy Spann
 
Kafka Summit SF 2017 - Database Streaming at WePay
confluent
 
What's New in PostgreSQL 9.6
EDB
 
Data Stream Processing for Beginners with Kafka and CDC
Abhijit Kumar
 
How to Build an Apache Kafka® Connector
confluent
 
London Apache Kafka Meetup (Jan 2017)
Landoop Ltd
 
Event processing without breaking production
nzender
 
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Migrating with Debezium
Mike Fowler
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
ScyllaDB
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
Event streaming webinar feb 2020
Maheedhar Gunturu
 
Confluent and Elastic
Paolo Castagna
 
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Chartio
 
Indeed Flex: The Story of a Revolutionary Recruitment Platform
HostedbyConfluent
 
Lookout on Scaling Security to 100 Million Devices
ScyllaDB
 
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
Real Time Analytics with Dse
DataStax Academy
 
Streaming Data from Scylla to Kafka
ScyllaDB
 

More from Paul Brebner (20)

PPTX
Streaming More For Less With Apache Kafka Tiered Storage
Paul Brebner
 
PDF
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
PDF
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
PDF
Architecting Applications With Multiple Open Source Big Data Technologies
Paul Brebner
 
PDF
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Paul Brebner
 
PDF
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Paul Brebner
 
PDF
Spinning your Drones with Cadence Workflows and Apache Kafka
Paul Brebner
 
PDF
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
PDF
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
PDF
A Visual Introduction to Apache Kafka
Paul Brebner
 
PDF
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Paul Brebner
 
PDF
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Paul Brebner
 
PDF
Grid Middleware – Principles, Practice and Potential
Paul Brebner
 
PDF
Grid middleware is easy to install, configure, secure, debug and manage acros...
Paul Brebner
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PPTX
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PPTX
0b101000 years of computing: a personal timeline - decade "0", the 1980's
Paul Brebner
 
PDF
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
Paul Brebner
 
Streaming More For Less With Apache Kafka Tiered Storage
Paul Brebner
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
Architecting Applications With Multiple Open Source Big Data Technologies
Paul Brebner
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Paul Brebner
 
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Paul Brebner
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Paul Brebner
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
A Visual Introduction to Apache Kafka
Paul Brebner
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Paul Brebner
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Paul Brebner
 
Grid Middleware – Principles, Practice and Potential
Paul Brebner
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
Paul Brebner
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
Paul Brebner
 
Ad

Recently uploaded (20)

PDF
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PPTX
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PDF
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PPTX
Simplifica la seguridad en la nube y la detección de amenazas con FortiCNAPP
Cristian Garcia G.
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
From Chatbot to Destroyer of Endpoints - Can ChatGPT Automate EDR Bypasses (1...
Priyanka Aash
 
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Practical Applications of AI in Local Government
OnBoard
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
The Growing Value and Application of FME & GenAI
Safe Software
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Simplifica la seguridad en la nube y la detección de amenazas con FortiCNAPP
Cristian Garcia G.
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
From Chatbot to Destroyer of Endpoints - Can ChatGPT Automate EDR Bypasses (1...
Priyanka Aash
 
Ad

Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Source Connector

  • 1. ©Instaclustr Pty Limited, 2021 Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL® Source Connector Paul Brebner Technology Evangelist, Instaclustr December 2021
  • 2. © Instaclustr Pty Limited, 2021 Instaclustr Managed Platform A complete ecosystem to support mission critical open source big data applications
  • 3. This Talk Focuses On Technologies • Debezium CDC Use Case • PostgreSQL® (source database) • Kafka® + Kafka Connect® (streaming) • Elasticsearch/OpenSearch® (sink system) Open Source • There’s nothing specific to our platform • I used Instaclustr managed Kafka and Elasticsearch © Instaclustr Pty Limited, 2021
  • 4. (Source: Shutterstock) Which Came First? © Instaclustr Pty Limited, 2021
  • 5. Which Came First? (Source: Shutterstock) The state or the event? What if you have state and want events? Events and you want state? © Instaclustr Pty Limited, 2021
  • 6. Or, how can speed up an Elephant (PostgreSQL) to be as fast as a Cheetah (Kafka)? Cheetahs are the fastest land animal (top speed 120km/hr. They can accelerate from 0 to 100km/hr in 3 seconds), three times faster than elephants (40km/hr) (Source: Shutterstock) © Instaclustr Pty Limited, 2021
  • 7. 1. The Debezium PostgreSQL Connector • The Debezium PostgreSQL connector captures row-level database changes and streams them to Kafka via Kafka Connect. • Runs as a Kafka source connector • How does it get Postgresql change events? Does it poll with queries? © Instaclustr Pty Limited, 2021
  • 8. 1. The Debezium PostgreSQL Connector • As of PostgreSQL 10+, there is a logical replication stream mode, called pgoutput that is natively supported by PostgreSQL • This means that a Debezium PostgreSQL connector can consume that replication stream [as a client] without the need for additional plug-ins pgoutput pgoutput client Logical replication stream © Instaclustr Pty Limited, 2021
  • 9. 1. The Debezium PostgreSQL Connector— Run It • Download the Debezium PostgreSQL connector • Deploy it: o Upload to AWS S3 bucket o Synchronise with Instaclustr managed Kafka connect o "io.debezium.connector.postgresql.PostgresConnector" will be in list of available connectors on the console • Configure PostgreSQL o Set wal_level (write ahead log) to logical (3rd non-default level, requires server restart) o Create Debezium user with REPLICATION and LOGIN permissions o These need PostgreSQL admin permissions • Configure Debezium connector and run it o Plugin.name default must be set to pgoutput, need PG username/password and IP
  • 10. Configure and Run curl https://KafkaConnectIP:8083/connectors -X POST -H 'Content-Type: application/json' -k -u kc_username:kc_password -d '{ "name": "debezium-test1", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "PG_IP", "database.port": "5432", "database.user": "pg_username", "database.password": "pg_password", "database.dbname" : "postgres", "database.server.name": "test1", "plugin.name": "pgoutput" } } ‘ If it worked you will see a single task running, tasks.max can only = 1 © Instaclustr Pty Limited, 2021
  • 11. Exploring the Debezium PostgreSQL Connector Change Data Events A terrifying “Giraffosaurus” (T-Raffe?)! (Source: Shutterstock) © Instaclustr Pty Limited, 2021
  • 12. CRUD? What Operations Result in Change? Create/Insert? Yes Read? No Update? Yes Delete? Yes © Instaclustr Pty Limited, 2021
  • 13. Table -> Topic Mapping What Does the Kafka Record Look Like? • CDC events from PostgreSQL: server + database + table à Kafka topic: server.database.name • Example insert record (table has id and v1 v2 integer columns): • Struct{after=Struct{id=1,v1=2,v2=3},source=Struct{version=1.6.1. Final,connector=postgresql,name=test1,ts_ms=1632457564326,d b=postgres,sequence=["1073751912","1073751912"],schema=pu blic,table=test1,txId=612,lsn=1073751968},op=c,ts_ms=1632457 564351} • Operation types: “op=c” (insert), “op=u” (update), “op=d” (delete) • For insert and update there’s an “after” record with id and values after transaction committed • For delete there’s a “before” record which shows id and NULL value only • And lots of metadata • Documentation led me to believe I would be seeing JSON with schema metadata? What’s wrong? © Instaclustr Pty Limited, 2021
  • 14. Updated Connector Configuration and JSON Record Example "value.converter": "org.apache.kafka.connect.json.JsonConverter" "value.converter.schemas.enable": "true" "key.converter": "org.apache.kafka.connect.json.JsonConverter" "key.converter.schemas.enable": "true” Schema is verbose, has Schema and payload records; turn it off, now have implicit payload only—Insert now looks like this: {"before":null,"after":{"id":10,"v1":10,"v2":10},"source":{"version":" 1.6.1.Final","connector":"postgresql","name":"test1","ts_ms":1632 717503331,"snapshot":"false","db":"postgres","sequence":"["194 6172256","1946172256"]","schema":"public","table":"test1","txId ":1512,"lsn":59122909632,"xmin":null},"op":"c","ts_ms":16327175 03781,"transaction":null} © Instaclustr Pty Limited, 2021
  • 15. Two T’s - Truncations - Transactions • Truncate o Is also a PostgreSQL operation (makes a table vanish) o What would you expect to happen? o Lots of deletes? No—nothing o Turned off by default • Transactions o PostgreSQL is a real SQL transactional database o What happens when multiple tables are changed in a single transaction? o You get multiple Kafka records, with the same transaction ID o The transaction ID can optionally be written to another Kafka topic • Note that to process Truncations and Transactions the Kafka sink connector needs to be pretty intelligent, and semantics will depend on target sink system © Instaclustr Pty Limited, 2021
  • 16. Debezium PostgreSQL Connector Throughput How fast can a Debezium PostgreSQL Connector run? (Source: Shutterstock) © Instaclustr Pty Limited, 2021
  • 17. Debezium PostgreSQL Connector Throughput 1 Task Only (Source: Shutterstock) © Instaclustr Pty Limited, 2021
  • 18. Debezium PostgreSQL Connector Throughput 1 Task Only Throughput limited to 7,000 events/s per task © Instaclustr Pty Limited, 2021
  • 19. Debezium PostgreSQL Connector Throughput 1 Task Only But PostgreSQL server is capable of 41,000 inserts/s (6x more, from previous tests) © Instaclustr Pty Limited, 2021
  • 20. Solutions? Most workloads will be more balanced between reads/writes One lane may be fine! © Instaclustr Pty Limited, 2021 (Sources: Paul Brebner & Shutterstock)
  • 21. Solutions? A wider bridge? © Instaclustr Pty Limited, 2021 (Source: Paul Brebner)
  • 22. Solutions? Multiple connectors 1 per table? © Instaclustr Pty Limited, 2021
  • 23. M ultiple replication slots © Instaclustr Pty Limited, 2021 Solutions? Multiple connectors 1 per table?
  • 24. Odd Behaviour One connector watching 2 tables Multiple changes to 1 before the other Changes in 1st table all processed before any changes in the other (10m delay!) Multiple connectors may be best practice Only 1 table at a time is processed? © Instaclustr Pty Limited, 2021 (Source: Shutterstock)
  • 25. What if there are lots of tables (and databases)? Better? 1 connector per table “group” (tables common to a service, tables with similar change rates, etc.) Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 Debezium Connector 1 Debezium Connector 2 Service 1 Service 2 © Instaclustr Pty Limited, 2021
  • 26. Streaming Debezium PostgreSQL Connector Change Data Capture Events Into Elasticsearch With Kafka Sink Connectors The final metamorphosis, from Cheetah (Kafka) to Rhino (Elasticsearch)! (Source: Shutterstock) © Instaclustr Pty Limited, 2021
  • 27. What Can You Do With the CDC Data Once It’s in Kafka? Stream it into 1 or more sink systems, e.g. Elasticsearch © Instaclustr Pty Limited, 2021
  • 28. Pipeline Blog Series Berlin Beer Pipes? (Source:Paul Brebner) Reuse Kafka Elasticsearch sink connectors Worked well with schema less JSON data © Instaclustr Pty Limited, 2021
  • 29. Camel Sink Connector? Missing a class (“org.elasticsearch.rest. BytesRestResponse”) Gave up! (Source: Shutterstock) APACHE © Instaclustr Pty Limited, 2021
  • 30. Tried the Lenses Connector Example configuration To process 7,000 events/s need more tasks, partitions, and Elasticsearch shards, and probably BULK API! curl https://KC_IP:8083/connectors/elastic-sink-tides/config -k -u KC_user:KC_password -X PUT -H 'Content-Type: application/json' -d ' { "connector.class" : "com.datamountaineer.streamreactor.connect.elastic7.ElasticSinkCon nector", "tasks.max" : 100, "topics" : "test1.public.test1", "connect.elastic.hosts" : "ES_IP", "connect.elastic.port" : 9201, "connect.elastic.kcql" : "INSERT INTO test-index SELECT * FROM test1.public.test", "connect.elastic.use.http.username" : "ES_user", "connect.elastic.use.http.password" : "ES_password" } }' © Instaclustr Pty Limited, 2021
  • 31. All Events Are “Inserts” Into Elasticsearch But we have “before” and “after”?! Get rid of before events with Single Message Transformation on Source connector side curl https://KC_IP:8083/connectors -X POST -H 'Content-Type: application/json' -k -u kc_user:kc_password -d '{ "name": "debezium-test1", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "pg_ip", "database.port": "5432", "database.user": "pg_user", "database.password": "pg_password", "database.dbname" : "postgres", "database.server.name": "test1", "plugin.name": "pgoutput", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter.schemas.enable": "false", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter.schemas.enable": "false", "transforms": "unwrap", "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState" } } ‘ This ”event flattening” SMT extracts the after field from a Debezium change event and creates a simple Kafka record with the after field contents. © Instaclustr Pty Limited, 2021
  • 32. All Events Are “Inserts” Into Elasticsearch But we have “before” and “after”?! Get rid of before events with Single Message Transformation on Source connector side curl https://KC_IP:8083/connectors -X POST -H 'Content-Type: application/json' -k -u kc_user:kc_password -d '{ "name": "debezium-test1", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "pg_ip", "database.port": "5432", "database.user": "pg_user", "database.password": "pg_password", "database.dbname" : "postgres", "database.server.name": "test1", "plugin.name": "pgoutput", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter.schemas.enable": "false", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter.schemas.enable": "false", "transforms": "unwrap", "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState" } } ‘ This ”event flattening” SMT extracts the after field from a Debezium change event and creates a simple Kafka record with the after field contents. How w ould w e process updates and deletes? © Instaclustr Pty Limited, 2021
  • 33. A Clever Test/Trick?! Previous Tidal data ⇨ Elasticsearch pipeline, V2 modified to use PostgreSQL as sink Pipeline 1: Tidal Data (REST source connector) à PostgreSQL © Instaclustr Pty Limited, 2021
  • 34. Pipeline 2: PostgreSQL à Elasticsearch Events State State Events State A Clever Test/Trick?! So I used the PostgreSQL Tidal data as the source system! Simple test as only have “inserts” Pipeline 1: Tidal Data (REST source connector) à PostgreSQL © Instaclustr Pty Limited, 2021
  • 35. Kibana Visualization of Tidal Data⇨ Kafka Connect ⇨ PostgreSQL ⇨ Kafka Connect ⇨ Elasticsearch ⇨ Kibana © Instaclustr Pty Limited, 2021
  • 36. Solving the Chicken or Egg Dilemma i.e. It doesn’t matter as long as we get to eat the omelet (Source: Shutterstock) © Instaclustr Pty Limited, 2021
  • 37. PostgreSQL Configuration Required to run the Debezium Source Connector Not currently supported in our managed PG service 1 Task Only Limits throughput Issues with multiple tables per connector? Best-practice to run multiple connectors, maybe 1 per “related” tables? CDC Events Complex Kafka record structure Meta data and data Schema or schemaless? Truncate? Transactions? Sink Connectors May need customization to understand CDC events and process correctly for target sink system Debezium PostgreSQL Conclusions PostgreSQL Configuration Required to run the Debezium Source Connector Not yet supported in Instaclustr’s managed PG service 1 Task Only Limits throughput Issues with multiple tables per connector? Best-practice to run multiple connectors, maybe 1 per “related” tables? CDC Events Complex Kafka record structure Meta data and data Schema or schemaless? Truncate? Transactions? Sink Connectors May need customization to understand CDC events and process correctly for target sink system © Instaclustr Pty Limited, 2021
  • 38. Debezium PostgreSQL Connector - NOTES ■ This talk covers a generic open source solution ● Using Debezium ● PostgreSQL ● Apache Kafka Connect ● OpenSearch ■ For hosted PostgreSQL ● You may need help with PostgreSQL configuration from cloud providers ■ But may be tricky to configure correctly ● For high throughput ● Many databases and tables ● For unbalanced changes across multiple tables ● I also didn’t test failover scenarios ■ The Debezium PostgreSQL Connector with Instaclustr’s Managed PostgreSQL service is on the roadmap for 2022 © Instaclustr Pty Limited, 2021
  • 39. Further Information Blogs ■ www.instaclustr.com/paul-brebner/ ■ Lots of Blogs using open source technologies: PostgreSQL, Apache Kafka, Apache Cassandra, Apache Spark, Apache Zookeeper, Redis, Elasticsearch/OpenSearch, Cadence (new) etc ■ For interesting use cases: IoT, ML, anomaly detection, geospatial, fintech, pipelines, etc ■ Free Trial on homepage for all of these technologies © Instaclustr Pty Limited, 2021