Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Source Connector

©Instaclustr Pty Limited, 2021
Change Data Capture (CDC)
With Kafka Connect® and
the Debezium PostgreSQL®
Source Connector
Paul Brebner
Technology Evangelist, Instaclustr
December 2021

© Instaclustr Pty Limited, 2021
Instaclustr Managed Platform
A complete ecosystem
to support mission
critical open source
big data applications

This Talk Focuses On
Technologies
• Debezium CDC Use Case
• PostgreSQL® (source database)
• Kafka® + Kafka Connect®
(streaming)
• Elasticsearch/OpenSearch® (sink
system)
Open Source
• There’s nothing specific to our
platform
• I used Instaclustr managed Kafka
and Elasticsearch © Instaclustr Pty Limited, 2021

(Source: Shutterstock)
Which Came
First?

Which Came
First?
The state or the
event?
What if you have
state and want
events?
Events and you
want state?

Or, how can
speed up an
Elephant
(PostgreSQL)
to be as fast as
a Cheetah
(Kafka)? Cheetahs are the fastest land animal (top speed
120km/hr. They can accelerate from 0 to 100km/hr in 3
seconds), three times faster than elephants (40km/hr)

1. The
Debezium
PostgreSQL
Connector
• The Debezium PostgreSQL connector captures
row-level database changes and streams them to
Kafka via Kafka Connect.
• Runs as a Kafka source connector
• How does it get Postgresql change events? Does it
poll with queries?

1. The
Debezium
PostgreSQL
Connector
• As of PostgreSQL 10+, there is a logical replication stream
mode, called pgoutput that is natively supported by
PostgreSQL
• This means that a Debezium PostgreSQL connector can
consume that replication stream [as a client] without the need
for additional plug-ins
pgoutput pgoutput client
Logical replication stream

1. The
Debezium
PostgreSQL
Connector—
Run It
• Download the Debezium PostgreSQL connector
• Deploy it:
o Upload to AWS S3 bucket
o Synchronise with Instaclustr managed Kafka connect
o "io.debezium.connector.postgresql.PostgresConnector" will be in list
of available connectors on the console
• Configure PostgreSQL
o Set wal_level (write ahead log) to logical (3rd non-default level,
requires server restart)
o Create Debezium user with REPLICATION and LOGIN permissions
o These need PostgreSQL admin permissions
• Configure Debezium connector and run it
o Plugin.name default must be set to pgoutput, need PG
username/password and IP

Configure
and
Run
curl https://KafkaConnectIP:8083/connectors -X POST -H
'Content-Type: application/json' -k -u kc_username:kc_password
-d '{
"name": "debezium-test1",
"config": {
"connector.class":
"io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "PG_IP",
"database.port": "5432",
"database.user": "pg_username",
"database.password": "pg_password",
"database.dbname" : "postgres",
"database.server.name": "test1",
"plugin.name": "pgoutput"
}
}
‘
If it worked you will see a single task
running, tasks.max can only = 1

Exploring the
Debezium
PostgreSQL
Connector
Change Data
Events
A terrifying “Giraffosaurus” (T-Raffe?)!

CRUD?
What
Operations
Result in
Change?
Create/Insert? Yes
Read? No
Update? Yes
Delete? Yes

Table ->
Topic
Mapping
What Does
the Kafka
Record Look
Like?
• CDC events from PostgreSQL: server + database + table à
Kafka topic: server.database.name
• Example insert record (table has id and v1 v2 integer columns):
• Struct{after=Struct{id=1,v1=2,v2=3},source=Struct{version=1.6.1.
Final,connector=postgresql,name=test1,ts_ms=1632457564326,d
b=postgres,sequence=["1073751912","1073751912"],schema=pu
blic,table=test1,txId=612,lsn=1073751968},op=c,ts_ms=1632457
564351}
• Operation types: “op=c” (insert), “op=u” (update), “op=d” (delete)
• For insert and update there’s an “after” record with id and values
after transaction committed
• For delete there’s a “before” record which shows id and NULL
value only
• And lots of metadata
• Documentation led me to believe I would be seeing JSON with
schema metadata? What’s wrong?

Updated
Connector
Configuration
and JSON
Record
Example
"value.converter":
"org.apache.kafka.connect.json.JsonConverter"
"value.converter.schemas.enable": "true"
"key.converter": "org.apache.kafka.connect.json.JsonConverter"
"key.converter.schemas.enable": "true”
Schema is verbose, has Schema and payload records; turn it
off, now have implicit payload only—Insert now looks like
this:
{"before":null,"after":{"id":10,"v1":10,"v2":10},"source":{"version":"
1.6.1.Final","connector":"postgresql","name":"test1","ts_ms":1632
717503331,"snapshot":"false","db":"postgres","sequence":"["194
6172256","1946172256"]","schema":"public","table":"test1","txId
":1512,"lsn":59122909632,"xmin":null},"op":"c","ts_ms":16327175
03781,"transaction":null}

Two T’s
- Truncations
- Transactions
• Truncate
o Is also a PostgreSQL operation (makes a table vanish)
o What would you expect to happen?
o Lots of deletes? No—nothing
o Turned off by default
• Transactions
o PostgreSQL is a real SQL transactional database
o What happens when multiple tables are changed in a
single transaction?
o You get multiple Kafka records, with the same transaction ID
o The transaction ID can optionally be written to another
Kafka topic
• Note that to process Truncations and Transactions the Kafka sink
connector needs to be pretty intelligent, and semantics will
depend on target sink system

Debezium
PostgreSQL
Connector
Throughput
How fast can a Debezium PostgreSQL Connector run?

Debezium
PostgreSQL
Connector
Throughput
1 Task Only

Debezium
PostgreSQL
Connector
Throughput
1 Task Only
Throughput limited to 7,000 events/s per task

Debezium
PostgreSQL
Connector
Throughput
1 Task Only
But PostgreSQL server is capable of 41,000 inserts/s (6x more, from previous tests)

Solutions?
Most workloads will be
more balanced between
reads/writes
One lane may be fine!
(Sources: Paul Brebner &
Shutterstock)

Solutions?
A wider bridge?
(Source: Paul Brebner)

Solutions?
Multiple connectors
1 per table?

M
ultiple
replication
slots
Solutions?
Multiple connectors
1 per table?

Odd Behaviour
One connector watching
2 tables
Multiple changes to 1 before
the other
Changes in 1st table all
processed before any changes
in the other (10m delay!)
Multiple connectors may be
best practice
Only 1 table at a time is processed?

What if there
are lots of
tables (and
databases)?
Better? 1 connector per
table “group” (tables
common to a service,
tables with similar change
rates, etc.)
Table 1 Table 2
Table 3 Table 4
Table 5 Table 6
Debezium
Connector 1
Debezium
Connector 2
Service
1
Service
2

Streaming
Debezium
PostgreSQL
Connector
Change Data
Capture Events
Into Elasticsearch
With Kafka Sink
Connectors
The final metamorphosis, from Cheetah (Kafka)
to Rhino (Elasticsearch)!

What Can
You Do With
the CDC Data
Once It’s in
Kafka?
Stream it into 1 or more sink systems, e.g. Elasticsearch

Pipeline Blog
Series
Berlin Beer Pipes?
(Source:Paul Brebner)
Reuse Kafka
Elasticsearch sink
connectors
Worked well with
schema less JSON data

Camel Sink
Connector?
Missing a class
(“org.elasticsearch.rest.
BytesRestResponse”)
Gave up!
APACHE

Tried the
Lenses
Connector
Example configuration
To process 7,000
events/s need more
tasks, partitions, and
Elasticsearch shards,
and probably BULK API!
curl https://KC_IP:8083/connectors/elastic-sink-tides/config -k -u
KC_user:KC_password -X PUT -H 'Content-Type: application/json' -d '
{
"connector.class" :
"com.datamountaineer.streamreactor.connect.elastic7.ElasticSinkCon
nector",
"tasks.max" : 100,
"topics" : "test1.public.test1",
"connect.elastic.hosts" : "ES_IP",
"connect.elastic.port" : 9201,
"connect.elastic.kcql" : "INSERT INTO test-index SELECT * FROM
test1.public.test",
"connect.elastic.use.http.username" : "ES_user",
"connect.elastic.use.http.password" : "ES_password"
}
}'

All Events Are
“Inserts” Into
Elasticsearch
But we have “before”
and “after”?!
Get rid of before events
with Single Message
Transformation on
Source connector side
curl https://KC_IP:8083/connectors -X POST -H 'Content-Type: application/json' -k -u
kc_user:kc_password -d '{
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg_ip",
"database.user": "pg_user",
"plugin.name": "pgoutput",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
}
}
‘
This ”event flattening” SMT extracts the after field from a Debezium change
event and creates a simple Kafka record with the after field contents.

All Events Are
“Inserts” Into
Elasticsearch
But we have “before”
and “after”?!
Get rid of before events
with Single Message
Transformation on
Source connector side
curl https://KC_IP:8083/connectors -X POST -H 'Content-Type: application/json' -k -u
kc_user:kc_password -d '{
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg_ip",
"database.user": "pg_user",
"plugin.name": "pgoutput",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
}
}
‘
This ”event flattening” SMT extracts the after field from a Debezium
change event and creates a simple Kafka record with the after field
contents.
How
w
ould
w
e
process
updates
and
deletes?

A Clever
Test/Trick?!
Previous Tidal data ⇨
Elasticsearch pipeline,
V2 modified to use
PostgreSQL as sink
Pipeline 1: Tidal Data (REST source connector)
à PostgreSQL

Pipeline 2: PostgreSQL à Elasticsearch
Events State
State Events State
A Clever
Test/Trick?!
So I used the
PostgreSQL Tidal data
as the source system!
Simple test as only have
“inserts”
Pipeline 1: Tidal Data (REST source connector)
à PostgreSQL

Kibana
Visualization
of
Tidal Data⇨
Kafka Connect ⇨
PostgreSQL ⇨
Kafka Connect ⇨
Elasticsearch ⇨ Kibana

Solving the
Chicken or
Egg Dilemma
i.e. It doesn’t matter as
long as we get to
eat the omelet

PostgreSQL
Configuration
Required to run
the Debezium
Source Connector
Not currently
supported in our
managed PG
service
1 Task Only
Limits throughput
Issues with
multiple tables per
connector?
Best-practice to
run multiple
connectors,
maybe 1 per
“related” tables?
CDC Events
Complex Kafka
record structure
Meta data and
data
Schema or
schemaless?
Truncate?
Transactions?
Sink
Connectors
May need
customization to
understand CDC
events and
process correctly
for target sink
system
Debezium PostgreSQL Conclusions
PostgreSQL
Configuration
Required to run the
Debezium Source
Connector
Not yet supported in
Instaclustr’s
managed PG
service
1 Task
Only
Limits throughput
Issues with multiple
tables per connector?
Best-practice to run
multiple connectors,
maybe 1 per “related”
tables?
CDC
Events
Complex Kafka
record structure
Meta data and data
Schema or
schemaless?
Truncate?
Transactions?
Sink
Connectors
May need
customization to
understand CDC
events and process
correctly for target
sink system

Debezium
PostgreSQL
Connector
- NOTES
■ This talk covers a generic open source solution
● Using Debezium
● PostgreSQL
● Apache Kafka Connect
● OpenSearch
■ For hosted PostgreSQL
● You may need help with PostgreSQL configuration from cloud
providers
■ But may be tricky to configure correctly
● For high throughput
● Many databases and tables
● For unbalanced changes across multiple tables
● I also didn’t test failover scenarios
■ The Debezium PostgreSQL Connector with
Instaclustr’s Managed PostgreSQL service is on
the roadmap for 2022

Further
Information
Blogs
■ www.instaclustr.com/paul-brebner/
■ Lots of Blogs using open source technologies:
PostgreSQL, Apache Kafka, Apache Cassandra,
Apache Spark, Apache Zookeeper, Redis,
Elasticsearch/OpenSearch, Cadence (new) etc
■ For interesting use cases:
IoT, ML, anomaly detection, geospatial, fintech,
pipelines, etc
■ Free Trial on homepage for all of these technologies

www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!

Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Source Connector

More Related Content

Similar to Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Source Connector (20)

More from Paul Brebner (20)

Recently uploaded (20)

Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Source Connector