SlideShare a Scribd company logo
Technology Choices for Kafka
and Change Data Capture
Kate Stanley and Andrew Schofield
Apache Kafka London Meetup October 2019
IBM Event StreamsApache Kafka
Change Data Capture identifies and
captures the changes to a data store
© 2019 IBM Corporation 2
Change Data Capture identifies and
captures the changes to a data store
as a stream of Kafka events
© 2019 IBM Corporation 3
Point-to-point data
integration
© 2019 IBM Corporation 4
MASTER DATABASE
APPLICATION
Point-to-point data
integration
© 2019 IBM Corporation 5
MASTER DATABASE
RECOVERY
DATABASE
AUDIT LOG
QUERY CACHE
APPLICATION
It’s publish/subscribe for data
© 2019 IBM Corporation 6
MASTER DATABASE
RECOVERY
DATABASE
AUDIT LOG
QUERY CACHE
APPLICATION
Technology choices
These different approaches have all been used successfully
1. Data store natively generates a feed of changes
2. Repeated queries, with optimization or restrictions
3. Log scanning
© 2019 IBM Corporation 7
Why use Kafka with CDC?
Kafka has lots of connectors to other systems
It acts as a buffer, loosening coupling between source and target
Publish/subscribe, instead of point-to-point
Makes it easy to process the CDC stream as events in Kafka client application code
© 2019 IBM Corporation 8
Kafka Connect JDBC source
© 2019 IBM Corporation 9
JDBC source connector
Uses JDBC to connect to any compliant relational database
e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres.
© 2019 IBM Corporation 10
JDBC source connector
Uses JDBC to connect to any compliant relational database
e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres.
Requires a Kafka Connect runtime
© 2019 IBM Corporation 11
JDBC source connector
Uses JDBC to connect to any compliant relational database
e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres.
Requires a Kafka Connect runtime
Can bulk copy tables with any columns
To receive just the changes, particular columns needed
© 2019 IBM Corporation 12
JDBC source connector
Uses JDBC to connect to any compliant relational database
e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres.
Requires a Kafka Connect runtime
Can bulk copy tables with any columns
To receive just the changes, particular columns needed
Open-source: https://github.com/confluentinc/kafka-connect-jdbc
© 2019 IBM Corporation 13
Configuring the JDBC
connector
© 2019 IBM Corporation 14
$ curl -X PUT -d '{"connector.class":”
io.confluent.connect.jdbc.JdbcSourceConnector"}’
http://localhost:8083/connector-
plugins/MyConnector/config/validate
Configuring the JDBC
connector
© 2019 IBM Corporation 15
$ curl -X PUT -d '{"connector.class":”
io.confluent.connect.jdbc.JdbcSourceConnector"}’
http://localhost:8083/connector-
plugins/MyConnector/config/validate
Required config options:
name
connector.class
connection.url – JDBC connection URL
topic.prefix – prefix to prepend to table names
16
Configuring the JDBC
connector
© 2019 IBM Corporation
Configuring the JDBC
connector
Required config options:
name
connector.class
connection.url – JDBC connection URL
topic.prefix – prefix to prepend to table names
mode – bulk, incrementing, timestamp, timestamp + incrementing
© 2019 IBM Corporation 17
Incrementing mode
Use a strictly incrementing column on each table to
detect only new rows.
© 2019 IBM Corporation 18
id
First
name
Surname Amount
0 John Smith 20
1 Daisy Williams 25
2 Laura Thomas 15
Incrementing mode
Use a strictly incrementing column on each table to
detect only new rows.
Requires incrementing.column.name to be set
Does not detect modifications or deletions of existing
rows
ID column must be present on all tables
Identifier must be in a single column
© 2019 IBM Corporation 19
id
First
name
Surname Amount
0 John Smith 20
1 Daisy Williams 25
2 Laura Thomas 15
Timestamp mode
Use a timestamp column to detect new and
modified rows.
© 2019 IBM Corporation 20
timestamp First name Surname Amount
2019-10-09
18:10:15
John Smith 20
2019-10-09
18:17:36
Daisy Williams 25
2019-10-09
18:57:12
Laura Thomas 15
Timestamp mode
Use a timestamp column to detect new and
modified rows.
© 2019 IBM Corporation 21
timestamp First name Surname Amount
2019-10-09
18:10:15
John Smith 20
2019-10-09
18:17:36
Daisy Williams 25
2019-10-09
18:57:12
Laura Thomas 15
Requires timestamp.column.name to be set
Timestamp column must be updated with each write
Timestamp column must be monotonically incrementing
Timestamp column must be present on all tables
Timestamp mode
Use a timestamp column to detect new and
modified rows.
© 2019 IBM Corporation 22
timestamp First name Surname Amount
2019-10-09
18:10:15
John Smith 20
2019-10-09
18:17:36
Daisy Williams 25
2019-10-09
18:57:12
Laura Thomas 15
Requires timestamp.column.name to be set
Timestamp column must be updated with each write
Timestamp column must be monotonically incrementing
Timestamp column must be present on all tables
Does not guarantee all updated data delivered, since timestamps aren’t unique.
Timestamp+Incrementing
mode
Uses both a timestamp column and incrementing id column.
Detects new and updated rows.
More robust than timestamp alone since the combination of id and timestamp should
be unique.
© 2019 IBM Corporation 23
timestamp id First name Surname Amount
2019-10-09
18:10:15
0 John Smith 20
2019-10-09
18:17:36
1 Daisy Williams 25
2019-10-09
18:57:12
2 Laura Thomas 15
JDBC source connector
© 2019 IBM Corporation 24
LICENSE:
Confluent Community License Agreement Version 1.0
© 2019 IBM Corporation 25
Building the JDBC connector
from source
1. Edit the pom.xml:
a) Comment out the Confluent parts of the
pom.xml
b) Add a version
c) Comment out checkstyle
d) Add Java 8 enforcement
e) Add versions for dependencies
2. git clone confluentinc/kafka-
connect-jdbc.git
3. cd kafka-connect-jdbc
mvn install –D skipTests
© 2019 IBM Corporation 26
Building the JDBC connector
from source
1. git clone confluentinc/kafka.git
(Apache 2.0 license)
2. cd kafka
gradle
./gradlew installAll
3. git clone confluentinc/common.git
(Apache 2.0 license)
4. cd common
mvn install
5. git clone confluentinc/kafka-connect-jdbc.git
(Confluent Community license)
6. cd kafka-connect-jdbc
mvn install
© 2019 IBM Corporation 27
Running the JDBC connector
Must check the JDBC driver has been loaded (SQLite and Postgres included by default)
1. Increase log level to DEBUG
2. Check JDBC driver JAR is in Loading plugin urls list
3. Check for ‘Added plugin’ line immediately after
CLASSPATH=/Users/katherinestanley/connectors/mysql-connector-java-
8.0.17.jar ./bin/connect-distributed.sh config/connect-
distributed.properties
https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector
© 2019 IBM Corporation 28
Debezium
© 2019 IBM Corporation 29
Debezium
Debezium is an open-source platform for change data capture using Kafka Connect
MySQL, MongoDB, PostgreSQL, SQL Server; incubator – Oracle, Cassandra, Db2 (soon)
Each supported database has separate code
Underlying technology depends on database
MySQL uses log scanning, SQL Server uses special CDC tables created by the database, …
Open-source – https://github.com/debezium/debezium
Proper open licence – Apache 2.0
© 2019 IBM Corporation 30
Debezium – log scanning
© 2019 IBM Corporation 31
Kafka Connect worker
T1
Ins
T2
Upd
T1
Ins
T2
Ins
T1
Del
T1
Cmt
T2
Pre
T2
Cmt
Debezium
Ins Upd Ins Ins Del
Read
Publish
DATABASE LOG
Debezium MySQL
Uses log scanning – requires configuration of row-based binary logs
WRITE_ROWS for row insert
UPDATE_ROWS for row update
DELETE_ROWS for row delete
QUERY for all kinds of miscellaneous stuff, including transaction commit
Nice and efficient, but connector code is very specific to MySQL internal details
© 2019 IBM Corporation 32
Database replication
© 2019 IBM Corporation 33
Source database
T1
Ins
T2
Upd
T1
Ins
T2
Ins
T1
Del
T1
Cmt
T2
Pre
T2
Cmt
CAPTURE PROGRAM
CHANGE DATA
TABLE
SOURCE
TABLE
Target database
TARGET
TABLE
APPLY PROGRAM
DATABASE LOG
Debezium – replication tables
© 2019 IBM Corporation 34
Source database
T1
Ins
T2
Upd
T1
Ins
T2
Ins
T1
Del
T1
Cmt
T2
Pre
T2
Cmt
CAPTURE PROGRAM
CHANGE DATA
TABLE
SOURCE
TABLE
DATABASE LOG
Kafka Connect worker
Debezium
Ins Upd Ins Ins Del
Publish
How can I try it?
Try the totally excellent Docker-based tutorial
https://debezium.io/documentation/reference/0.10/tutorial.html
© 2019 IBM Corporation 35
Record formatting
The default is comprehensive and very verbose
© 2019 IBM Corporation 36
{
"schema" : {
},
"payload" : {
"op": "u",
"source": {
...
},
"ts_ms" : "...",
"before" : {
"field1" : "oldvalue1",
"field2" : "oldvalue2"
},
"after" : {
"field1" : "newvalue1",
"field2" : "newvalue2"
}
}
}
Record formatting
Just use the provided ExtractNewRecordState SMT
© 2019 IBM Corporation 37
{
"schema" : {
},
"payload" : {
"op": "u",
"source": {
...
},
"ts_ms" : "...",
"before" : {
"field1" : "oldvalue1",
"field2" : "oldvalue2"
},
"after" : {
"field1" : "newvalue1",
"field2" : "newvalue2"
}
}
}
{
"field1" : "newvalue1",
"field2" : "newvalue2”
}
SMT
IBM InfoSphere Data Replication
© 2019 IBM Corporation 38
IBM InfoSphere Data Replication
Enterprise-grade CDC built exclusively on log scanning
Focus on performance and transactionality
Can be customised with user code
Does not use Kafka Connect because wants tighter control over publish
© 2019 IBM Corporation 39
IIDR architecture
© 2019 IBM Corporation 40
Source server
T1
Ins
T2
Upd
T1
Ins
T2
Ins
T1
Del
T1
Cmt
T2
Pre
T2
Cmt
CDC SOURCE ENGINE
DATABASE LOG
Ins Upd Ins Ins Del
Target server
CDC TARGET ENGINE
PublishWRITER
WRITER
PARSE
TRANSFORM
MANAGEMENT CONSOLE
Read
Send
er 15, 2018 / © 2018 IBM Corporation
Four Time-Interleaved Source Database Transactions
Transaction 1 Op1(Tab2) Op2(Tab3) Op3(tab2) Commit
Transaction 2 Op1(Tab2) Op2(Tab2) Op3(Tab3) Op4(tab2) Commit
Transaction 3 Op1(Tab1) Commit
Transaction 4 Op1(tab1) Commit
===================== TIME =====================è
Transactionally Consistent Consumer
Recreates order of operations in source database across multiple topics and
partitions, with no duplicates
Uses a ”commitstream” topic to maintain transaction metadata
User topic data is not modified
Kafka records can be written out of strict order and TCC sorts it all out
© 2019 IBM Corporation 42
Summary
© 2019 IBM Corporation 43
Summary
There is a variety of open-source and commercial CDC options for Kafka
Choice depends largely on desired throughput, flexibility, semantics and cost
© 2019 IBM Corporation 44
© 2019 IBM Corporation
IBM Cloud - London
This is a group for anyone interested in learning about
#IBMCloud, the cloud built for business. You can be an
existing #IBMCloud user, or someone who has never touched
the #IBMCloud before. Meetup topics will vary and can be of
interest to developers, administrators, or even business
leaders!
We are interested in using amazing tech to grow business and
make the world a better place. Some of the technology topics
that we will talk about are: cloud platforms, artificial
intelligence, blockchain, analytics, automation, cloud services
/ APIs, data science, integration, application development,
and governance.
Humanizing your chatbot,
how I digress!
Site Reliability
Engineer to the rescue!
Blockchain: The Good, The Bad and The Ugly!
Unlocking the power of
automation with AI and ML
Innovate with APIs (App
Mod #2)
Sign up at:
https://www.meetup.com/IBM-Cloud-London/
to come along and take part at our events!
Thank you
Kate Stanley @katestanley91
Andrew Schofield https://medium.com/@andrew_schofield
Links: https://kafka.apache.org/documentation/#connect
https://github.com/confluentinc/kafka-connect-jdbc
https://debezium.io
https://github.com/debezium
IBM Event Streams: ibm.biz/aboutEventStreams
© 2019 IBM Corporation 46

More Related Content

PPSX
Service Mesh - Observability
PPSX
Microservices Testing Strategies JUnit Cucumber Mockito Pact
PDF
Apache Kafka Scalable Message Processing and more!
PPTX
Docker Kubernetes Istio
PPSX
Apache Flink, AWS Kinesis, Analytics
PDF
Designing For Multicloud, CF Summit Frankfurt 2016
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
PDF
Partner Development Guide for Kafka Connect
Service Mesh - Observability
Microservices Testing Strategies JUnit Cucumber Mockito Pact
Apache Kafka Scalable Message Processing and more!
Docker Kubernetes Istio
Apache Flink, AWS Kinesis, Analytics
Designing For Multicloud, CF Summit Frankfurt 2016
Event Sourcing & CQRS, Kafka, Rabbit MQ
Partner Development Guide for Kafka Connect

What's hot (20)

PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Introducing Change Data Capture with Debezium
PDF
How easy (or hard) it is to monitor your graph ql service performance
PPTX
Microservices Architecture Part 2 Event Sourcing and Saga
PPSX
Microservices Docker Kubernetes Istio Kanban DevOps SRE
PDF
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
PPTX
Databus - LinkedIn's Change Data Capture Pipeline
PPSX
Microservices Architecture - Cloud Native Apps
PPTX
Microservices Architecture - Bangkok 2018
PPTX
Functional reactive programming
PDF
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
PPSX
Agile, User Stories, Domain Driven Design
PDF
Redis and Kafka - Advanced Microservices Design Patterns Simplified
PDF
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PPTX
Microservices Part 4: Functional Reactive Programming
PDF
Microservices with Kafka Ecosystem
PPSX
Microservices, DevOps & SRE
PPTX
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
Apache Kafka - Scalable Message-Processing and more !
Introducing Change Data Capture with Debezium
How easy (or hard) it is to monitor your graph ql service performance
Microservices Architecture Part 2 Event Sourcing and Saga
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Databus - LinkedIn's Change Data Capture Pipeline
Microservices Architecture - Cloud Native Apps
Microservices Architecture - Bangkok 2018
Functional reactive programming
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
Agile, User Stories, Domain Driven Design
Redis and Kafka - Advanced Microservices Design Patterns Simplified
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Microservices Part 4: Functional Reactive Programming
Microservices with Kafka Ecosystem
Microservices, DevOps & SRE
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Ad

Similar to Technology choices for Apache Kafka and Change Data Capture (20)

PPTX
Capture the Streams of Database Changes
PDF
CDC patterns in Apache Kafka®
PPTX
SQL for Data Science - for everyone.pptx
PDF
Owning time series with team apache Strata San Jose 2015
PDF
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
PDF
NOSQL Overview
PPTX
Data Stream Processing for Beginners with Kafka and CDC
PPTX
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
PDF
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
PDF
Streaming Sensor Data Slides_Virender
PDF
Apache cassandra & apache spark for time series data
PPTX
Confluent-Ably-AWS-ID-2023 - GSlide.pptx
PPTX
Apache Cassandra Lunch #96: Apache Cassandra Change Data Capture (CDC) Strate...
PPTX
Qubole - Big data in cloud
PDF
SQL on Hadoop in Taiwan
PDF
Airstream: Spark Streaming At Airbnb
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
From my sql to postgresql using kafka+debezium
PPTX
Event streaming webinar feb 2020
PPTX
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Capture the Streams of Database Changes
CDC patterns in Apache Kafka®
SQL for Data Science - for everyone.pptx
Owning time series with team apache Strata San Jose 2015
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
NOSQL Overview
Data Stream Processing for Beginners with Kafka and CDC
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Streaming Sensor Data Slides_Virender
Apache cassandra & apache spark for time series data
Confluent-Ably-AWS-ID-2023 - GSlide.pptx
Apache Cassandra Lunch #96: Apache Cassandra Change Data Capture (CDC) Strate...
Qubole - Big data in cloud
SQL on Hadoop in Taiwan
Airstream: Spark Streaming At Airbnb
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
From my sql to postgresql using kafka+debezium
Event streaming webinar feb 2020
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Ad

More from Andrew Schofield (9)

PPTX
Event-driven microservices
PDF
IBM Message Hub: Cloud-Native Messaging
PDF
Effectively Managing a Hybrid Messaging Environment
PDF
Introducing IBM Message Hub: Cloud-scale messaging based on Apache Kafka
PPTX
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
PPTX
Ame 2269 ibm mq high availability
PPTX
Ame 4166 ibm mq appliance
PDF
Connecting IBM MessageSight to the Enterprise
PDF
Introduction to IBM MessageSight
Event-driven microservices
IBM Message Hub: Cloud-Native Messaging
Effectively Managing a Hybrid Messaging Environment
Introducing IBM Message Hub: Cloud-scale messaging based on Apache Kafka
IBM Message Hub service in Bluemix - Apache Kafka in a public cloud
Ame 2269 ibm mq high availability
Ame 4166 ibm mq appliance
Connecting IBM MessageSight to the Enterprise
Introduction to IBM MessageSight

Recently uploaded (20)

PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Introduction to Artificial Intelligence
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Transform Your Business with a Software ERP System
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
AI in Product Development-omnex systems
PPTX
L1 - Introduction to python Backend.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Operating system designcfffgfgggggggvggggggggg
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Introduction to Artificial Intelligence
Understanding Forklifts - TECH EHS Solution
Transform Your Business with a Software ERP System
Reimagine Home Health with the Power of Agentic AI​
Softaken Excel to vCard Converter Software.pdf
AI in Product Development-omnex systems
L1 - Introduction to python Backend.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo Companies in India – Driving Business Transformation.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development

Technology choices for Apache Kafka and Change Data Capture

  • 1. Technology Choices for Kafka and Change Data Capture Kate Stanley and Andrew Schofield Apache Kafka London Meetup October 2019 IBM Event StreamsApache Kafka
  • 2. Change Data Capture identifies and captures the changes to a data store © 2019 IBM Corporation 2
  • 3. Change Data Capture identifies and captures the changes to a data store as a stream of Kafka events © 2019 IBM Corporation 3
  • 4. Point-to-point data integration © 2019 IBM Corporation 4 MASTER DATABASE APPLICATION
  • 5. Point-to-point data integration © 2019 IBM Corporation 5 MASTER DATABASE RECOVERY DATABASE AUDIT LOG QUERY CACHE APPLICATION
  • 6. It’s publish/subscribe for data © 2019 IBM Corporation 6 MASTER DATABASE RECOVERY DATABASE AUDIT LOG QUERY CACHE APPLICATION
  • 7. Technology choices These different approaches have all been used successfully 1. Data store natively generates a feed of changes 2. Repeated queries, with optimization or restrictions 3. Log scanning © 2019 IBM Corporation 7
  • 8. Why use Kafka with CDC? Kafka has lots of connectors to other systems It acts as a buffer, loosening coupling between source and target Publish/subscribe, instead of point-to-point Makes it easy to process the CDC stream as events in Kafka client application code © 2019 IBM Corporation 8
  • 9. Kafka Connect JDBC source © 2019 IBM Corporation 9
  • 10. JDBC source connector Uses JDBC to connect to any compliant relational database e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres. © 2019 IBM Corporation 10
  • 11. JDBC source connector Uses JDBC to connect to any compliant relational database e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres. Requires a Kafka Connect runtime © 2019 IBM Corporation 11
  • 12. JDBC source connector Uses JDBC to connect to any compliant relational database e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres. Requires a Kafka Connect runtime Can bulk copy tables with any columns To receive just the changes, particular columns needed © 2019 IBM Corporation 12
  • 13. JDBC source connector Uses JDBC to connect to any compliant relational database e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres. Requires a Kafka Connect runtime Can bulk copy tables with any columns To receive just the changes, particular columns needed Open-source: https://github.com/confluentinc/kafka-connect-jdbc © 2019 IBM Corporation 13
  • 14. Configuring the JDBC connector © 2019 IBM Corporation 14 $ curl -X PUT -d '{"connector.class":” io.confluent.connect.jdbc.JdbcSourceConnector"}’ http://localhost:8083/connector- plugins/MyConnector/config/validate
  • 15. Configuring the JDBC connector © 2019 IBM Corporation 15 $ curl -X PUT -d '{"connector.class":” io.confluent.connect.jdbc.JdbcSourceConnector"}’ http://localhost:8083/connector- plugins/MyConnector/config/validate Required config options: name connector.class connection.url – JDBC connection URL topic.prefix – prefix to prepend to table names
  • 16. 16 Configuring the JDBC connector © 2019 IBM Corporation
  • 17. Configuring the JDBC connector Required config options: name connector.class connection.url – JDBC connection URL topic.prefix – prefix to prepend to table names mode – bulk, incrementing, timestamp, timestamp + incrementing © 2019 IBM Corporation 17
  • 18. Incrementing mode Use a strictly incrementing column on each table to detect only new rows. © 2019 IBM Corporation 18 id First name Surname Amount 0 John Smith 20 1 Daisy Williams 25 2 Laura Thomas 15
  • 19. Incrementing mode Use a strictly incrementing column on each table to detect only new rows. Requires incrementing.column.name to be set Does not detect modifications or deletions of existing rows ID column must be present on all tables Identifier must be in a single column © 2019 IBM Corporation 19 id First name Surname Amount 0 John Smith 20 1 Daisy Williams 25 2 Laura Thomas 15
  • 20. Timestamp mode Use a timestamp column to detect new and modified rows. © 2019 IBM Corporation 20 timestamp First name Surname Amount 2019-10-09 18:10:15 John Smith 20 2019-10-09 18:17:36 Daisy Williams 25 2019-10-09 18:57:12 Laura Thomas 15
  • 21. Timestamp mode Use a timestamp column to detect new and modified rows. © 2019 IBM Corporation 21 timestamp First name Surname Amount 2019-10-09 18:10:15 John Smith 20 2019-10-09 18:17:36 Daisy Williams 25 2019-10-09 18:57:12 Laura Thomas 15 Requires timestamp.column.name to be set Timestamp column must be updated with each write Timestamp column must be monotonically incrementing Timestamp column must be present on all tables
  • 22. Timestamp mode Use a timestamp column to detect new and modified rows. © 2019 IBM Corporation 22 timestamp First name Surname Amount 2019-10-09 18:10:15 John Smith 20 2019-10-09 18:17:36 Daisy Williams 25 2019-10-09 18:57:12 Laura Thomas 15 Requires timestamp.column.name to be set Timestamp column must be updated with each write Timestamp column must be monotonically incrementing Timestamp column must be present on all tables Does not guarantee all updated data delivered, since timestamps aren’t unique.
  • 23. Timestamp+Incrementing mode Uses both a timestamp column and incrementing id column. Detects new and updated rows. More robust than timestamp alone since the combination of id and timestamp should be unique. © 2019 IBM Corporation 23 timestamp id First name Surname Amount 2019-10-09 18:10:15 0 John Smith 20 2019-10-09 18:17:36 1 Daisy Williams 25 2019-10-09 18:57:12 2 Laura Thomas 15
  • 24. JDBC source connector © 2019 IBM Corporation 24 LICENSE: Confluent Community License Agreement Version 1.0
  • 25. © 2019 IBM Corporation 25
  • 26. Building the JDBC connector from source 1. Edit the pom.xml: a) Comment out the Confluent parts of the pom.xml b) Add a version c) Comment out checkstyle d) Add Java 8 enforcement e) Add versions for dependencies 2. git clone confluentinc/kafka- connect-jdbc.git 3. cd kafka-connect-jdbc mvn install –D skipTests © 2019 IBM Corporation 26
  • 27. Building the JDBC connector from source 1. git clone confluentinc/kafka.git (Apache 2.0 license) 2. cd kafka gradle ./gradlew installAll 3. git clone confluentinc/common.git (Apache 2.0 license) 4. cd common mvn install 5. git clone confluentinc/kafka-connect-jdbc.git (Confluent Community license) 6. cd kafka-connect-jdbc mvn install © 2019 IBM Corporation 27
  • 28. Running the JDBC connector Must check the JDBC driver has been loaded (SQLite and Postgres included by default) 1. Increase log level to DEBUG 2. Check JDBC driver JAR is in Loading plugin urls list 3. Check for ‘Added plugin’ line immediately after CLASSPATH=/Users/katherinestanley/connectors/mysql-connector-java- 8.0.17.jar ./bin/connect-distributed.sh config/connect- distributed.properties https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector © 2019 IBM Corporation 28
  • 29. Debezium © 2019 IBM Corporation 29
  • 30. Debezium Debezium is an open-source platform for change data capture using Kafka Connect MySQL, MongoDB, PostgreSQL, SQL Server; incubator – Oracle, Cassandra, Db2 (soon) Each supported database has separate code Underlying technology depends on database MySQL uses log scanning, SQL Server uses special CDC tables created by the database, … Open-source – https://github.com/debezium/debezium Proper open licence – Apache 2.0 © 2019 IBM Corporation 30
  • 31. Debezium – log scanning © 2019 IBM Corporation 31 Kafka Connect worker T1 Ins T2 Upd T1 Ins T2 Ins T1 Del T1 Cmt T2 Pre T2 Cmt Debezium Ins Upd Ins Ins Del Read Publish DATABASE LOG
  • 32. Debezium MySQL Uses log scanning – requires configuration of row-based binary logs WRITE_ROWS for row insert UPDATE_ROWS for row update DELETE_ROWS for row delete QUERY for all kinds of miscellaneous stuff, including transaction commit Nice and efficient, but connector code is very specific to MySQL internal details © 2019 IBM Corporation 32
  • 33. Database replication © 2019 IBM Corporation 33 Source database T1 Ins T2 Upd T1 Ins T2 Ins T1 Del T1 Cmt T2 Pre T2 Cmt CAPTURE PROGRAM CHANGE DATA TABLE SOURCE TABLE Target database TARGET TABLE APPLY PROGRAM DATABASE LOG
  • 34. Debezium – replication tables © 2019 IBM Corporation 34 Source database T1 Ins T2 Upd T1 Ins T2 Ins T1 Del T1 Cmt T2 Pre T2 Cmt CAPTURE PROGRAM CHANGE DATA TABLE SOURCE TABLE DATABASE LOG Kafka Connect worker Debezium Ins Upd Ins Ins Del Publish
  • 35. How can I try it? Try the totally excellent Docker-based tutorial https://debezium.io/documentation/reference/0.10/tutorial.html © 2019 IBM Corporation 35
  • 36. Record formatting The default is comprehensive and very verbose © 2019 IBM Corporation 36 { "schema" : { }, "payload" : { "op": "u", "source": { ... }, "ts_ms" : "...", "before" : { "field1" : "oldvalue1", "field2" : "oldvalue2" }, "after" : { "field1" : "newvalue1", "field2" : "newvalue2" } } }
  • 37. Record formatting Just use the provided ExtractNewRecordState SMT © 2019 IBM Corporation 37 { "schema" : { }, "payload" : { "op": "u", "source": { ... }, "ts_ms" : "...", "before" : { "field1" : "oldvalue1", "field2" : "oldvalue2" }, "after" : { "field1" : "newvalue1", "field2" : "newvalue2" } } } { "field1" : "newvalue1", "field2" : "newvalue2” } SMT
  • 38. IBM InfoSphere Data Replication © 2019 IBM Corporation 38
  • 39. IBM InfoSphere Data Replication Enterprise-grade CDC built exclusively on log scanning Focus on performance and transactionality Can be customised with user code Does not use Kafka Connect because wants tighter control over publish © 2019 IBM Corporation 39
  • 40. IIDR architecture © 2019 IBM Corporation 40 Source server T1 Ins T2 Upd T1 Ins T2 Ins T1 Del T1 Cmt T2 Pre T2 Cmt CDC SOURCE ENGINE DATABASE LOG Ins Upd Ins Ins Del Target server CDC TARGET ENGINE PublishWRITER WRITER PARSE TRANSFORM MANAGEMENT CONSOLE Read Send
  • 41. er 15, 2018 / © 2018 IBM Corporation Four Time-Interleaved Source Database Transactions Transaction 1 Op1(Tab2) Op2(Tab3) Op3(tab2) Commit Transaction 2 Op1(Tab2) Op2(Tab2) Op3(Tab3) Op4(tab2) Commit Transaction 3 Op1(Tab1) Commit Transaction 4 Op1(tab1) Commit ===================== TIME =====================è
  • 42. Transactionally Consistent Consumer Recreates order of operations in source database across multiple topics and partitions, with no duplicates Uses a ”commitstream” topic to maintain transaction metadata User topic data is not modified Kafka records can be written out of strict order and TCC sorts it all out © 2019 IBM Corporation 42
  • 43. Summary © 2019 IBM Corporation 43
  • 44. Summary There is a variety of open-source and commercial CDC options for Kafka Choice depends largely on desired throughput, flexibility, semantics and cost © 2019 IBM Corporation 44
  • 45. © 2019 IBM Corporation IBM Cloud - London This is a group for anyone interested in learning about #IBMCloud, the cloud built for business. You can be an existing #IBMCloud user, or someone who has never touched the #IBMCloud before. Meetup topics will vary and can be of interest to developers, administrators, or even business leaders! We are interested in using amazing tech to grow business and make the world a better place. Some of the technology topics that we will talk about are: cloud platforms, artificial intelligence, blockchain, analytics, automation, cloud services / APIs, data science, integration, application development, and governance. Humanizing your chatbot, how I digress! Site Reliability Engineer to the rescue! Blockchain: The Good, The Bad and The Ugly! Unlocking the power of automation with AI and ML Innovate with APIs (App Mod #2) Sign up at: https://www.meetup.com/IBM-Cloud-London/ to come along and take part at our events!
  • 46. Thank you Kate Stanley @katestanley91 Andrew Schofield https://medium.com/@andrew_schofield Links: https://kafka.apache.org/documentation/#connect https://github.com/confluentinc/kafka-connect-jdbc https://debezium.io https://github.com/debezium IBM Event Streams: ibm.biz/aboutEventStreams © 2019 IBM Corporation 46