SlideShare a Scribd company logo
Using MongoDB with Kafka
Percona Live Online
20-21 October 2020
Antonios Giannopoulos
Senior Database Administrator
Pedro Albuquerque
Principal Database Engineer
Agenda
● Definitions
● Use cases
● Using MongoDB as a source
● Using MongoDB as a sink
● Real world use case: Transferwise
● MongoDB to Kafka Connectors
● Takeaways
What is MongoDB?
● Document-oriented Database
● Flexible JSON-style schema
Use-Cases:
● Pretty much any workload
● After version 4.0/4.2 supports ACID transactions
● Frequent schema changes
What is Apache Kafka?
● Distributed event streaming platform
Use-Cases:
● Publish and subscribe to stream of events
● Async RPC-style calls between services
● Log replay
● CQRS and Event Sourcing
● Real-time analytics
How can they work
together?
Use cases - Topologies
MongoDB as a
sink
MongoDB as a
source
MongoDB as a
source/sink
MongoDB as a Source
Selective Replication/EL/ETL
MongoDB doesn’t support selective Replication
Oplog or Change Streams (prefered method)
Kafka cluster, with one topic per collection
MongoDB to Kafka connectors
Debezium
Supports both Replica-set and Sharded clusters
Uses the oplog to capture and create events
Selective Replication: [database|collection].[include|exclude].list
EL: field.exclude.list & field.renames
snapshot.mode = initial | never
tasks.max
initial.sync.max.threads
MongoDB Kafka Source
Connector
- Supports both Replica-set and Sharded clusters
- Uses MongoDB Change Streams to create events
- Selective Replication:
- mongodb db.collection -> db.collection kafka topic
- Multi-source replication:
- multiple collections to single kafka topic
- EL: Filter or modify change events with MongoDB aggregation
pipeline
- Sync historical data (copy.existing=true)
- copy.existing.max.threads
MongoDB as a Sink
Throttling
Throttling* (is a forbidden word but) is extremely useful:
- During MongoDB scaling
- Planned or unplanned maintenances
- Unexpected growth events
- Provides workload priorities
The need for throttling: MongoDB 4.2 Flow control
You can configure Flow Control on the Replica-Set level
(Config settings: enableFlowControl, flowControlTargetLagSeconds)
Kafka provides a more flexible “flow control” that you can easily manage
* Throttling may not be suitable for every workloads
Throttling
The aim is to rate limit write operations
Kafka supports higher write throughput & scales faster
Kafka scales:
- Adding partitions
- Add brokers
- Add clusters
- Minimal application changes
MongoDB scales as well:
- Adding shards
- Balancing takes time
- Balancing affects performance
Throttling
Quotas can be applied to (user, client-id), user or client-id groups
producer_byte_rate : The total rate limit for the user’s producers without a client-id quota override
consumer_byte_rate : The total rate limit for the user’s consumers without a client-id quota override
Static changes: /config/users/ & /config/clients (watch out the override order)
Dynamic changes:
> bin/kafka-configs.sh --bootstrap-server <host>:<port> --describe --entity-type users|clients --entity-name user|client-id
> bin/kafka-configs.sh --bootstrap-server <host>:<port> --alter --add-config
'producer_byte_rate=1024,consumer_byte_rate=2048' --entity-type users|clients --entity-name user|client-id
Throttling
Evaluate a MongoDB metric - Read/Write Queues , Latency etc
> db.serverStatus().globalLock.currentQueue.writers
0
Prometheus Alert Manager
- Tons of integrations
- Groups alerts
- Notify on resolution
Consumer
Producer
kafka-configs.sh
PROD
or your
favorite
integration...
Prometheus monitors Production
Workload isolation
Kafka handles specific workloads better
An successful event website (for example: Percona Live 2020)
- Contains a stream of social media interactions
- Kafka serves the raw stream - all interactions
- MongoDB serves aggregated data - for example top tags
Raw steam is native for Kafka as its a commit-log
MongoDB rich aggregation framework provides aggregated data
Workload isolation
Continuous aggregations
Useful for use-cases that raw data are useless (or not very useful)
Kafka streams is your friend - Windowing
Examples:
Meteo stations sending metrics every second
MongoDB serves the min(),max() for every hour
Website statistics - counters
MongoDB gets updated every N seconds with hits summary
MongoDB gets updated with hits per minute/hour
Journal
Data recovery is a usual request in the databases world
Human error, application bugs, hardware failures are some reasons
Kafka can help on partial recovery or point in time recovery
A partial data recovery may require restore of a full backup
Restore changes from a full backup, Replay the changes from Kafka
Journal
TransferWise:
Activity Service
● Customer action
● Many types
● Different status
● Variety of categories
● Repository of all activities
● List of customer’s actions
● Activity list
● Ability to search and filter
Processors
TransferWise:
Activity Service
Balance
Plastic
Transfer
Activity
Updates
Activity
Group
Aggrs
Activity
Deletes
Activity
Updates
Consumer
Activity
Group
Aggrs
Consumer
Activity
Deletes
Consumer
Updates
Processor
Aggrs
Processor
Deletes
Processor
Producers ConsumersTopics
spring-kafka
Producer configuration
private ProducerFactory<Object, Object> producerFactory(KafkaProperties kafkaProperties) {
return new DefaultKafkaProducerFactory<>(
Map.of(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaProperties.getServers(),
ProducerConfig.CLIENT_ID_CONFIG, kafkaProperties.getClientId(),
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, JsonSerializer.class,
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class
)
);
}
public KafkaTemplate<Object, Object> kafkaTemplate(KafkaProperties kafkaProperties) {
return new KafkaTemplate<>(producerFactory(kafkaProperties));
}
spring-kafka
Send message
public void send(String key, Object value, Runnable successCallback) {
String jsonBody = value.getClass() == String.class ? (String) value : JSON_SERIALIZER.writeAsJson(value);
kafkaTemplate.send(topic, key, jsonBody)
.addCallback(new ListenableFutureCallback<>() {
@Override
public void onFailure(Throwable ex) {
log.error("Failed sending message with key {} to {}", key, topic);
}
@Override
public void onSuccess(SendResult<String, String> result) {
successCallback.run();
}
});
}
spring-kafka
Consumer configuration
@EnableKafka
private ConsumerFactory<String, String> consumerFactory(KafkaProperties kafkaProperties) {
return new DefaultKafkaConsumerFactory<>(
Map.of(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaProperties.getServers(),
ConsumerConfig.CLIENT_ID_CONFIG, kafkaProperties.getClientId(),
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class,
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class
));}
ConcurrentKafkaListenerContainerFactory<String, String> factory = buildListenerContainerFactory(objectMapper,
kafkaProperties);
KafkaRetryConfig retryConfig = new KafkaRetryConfig(KafkaProducerFactory.kafkaTemplate(kafkaProperties));
@KafkaListener(topics = "${activity-service.kafka.topics.activityUpdates}", containerFactory =
ActivityUpdatesKafkaListenersConfig.ACTIVITY_UPDATES_KAFKA_LISTENER_FACTORY)
TransferWise:
Activity Service
Balance
Plastic
Transfer
Activity
Updates
Activity
Group
Aggrs
Activity
Deletes
Activity
Updates
Consumer
Activity
Group
Aggrs
Consumer
Activity
Deletes
Consumer
Updates
Processor
Aggrs
Processor
Deletes
Processor
MongoDB Kafka Sink Connector
name=mongodb-sink-example
topics=topicA,topicB
connector.class=com.mongodb.kafka.connect.MongoSinkConnector
tasks.max=1
# Specific global MongoDB Sink Connector configuration
connection.uri=mongodb://mongod1:27017,mongod2:27017,mongod3:27017
database=perconalive
collection=slides
MongoDB Kafka Sink
connector: Configuration
MongoDB Kafka Sink
connector: Configuration
# Message types
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
MongoDB Kafka Sink
connector: Configuration
## Document manipulation settings
[key|value].projection.type=AllowList
[key|value].projection.list=name,age,address.post_code
## Id Strategy
document.id.strategy=com.mongodb.kafka.connect.sink.processor.id.strategy.BsonOidStrategy
post.processor.chain=com.mongodb.kafka.connect.sink.processor.DocumentIdAdder
MongoDB Kafka Sink
connector: Configuration
## Dead letter queue
errors.tolerance=all
errors.log.enable=true
errors.log.include.messages=true
errors.deadletterqueue.topic.name=perconalive.deadletterqueue
errors.deadletterqueue.context.headers.enable=true
Recap/Takeaways
There are tons of use-cases for MongoDB & Kafka
We described couple of use-cases
● Selective replication/ETL
● Throttling/Journaling/Workload Isolation
Kafka has a rich ecosystem that can expand the use-cases
Connectors is your friend, but you can build your own connector
Large orgs like TransferWise use MongoDB & Kafka for complex projects
- Thank you!!! -
- Q&A -
Big thanks to:
John Moore, Principal Engineer @Eventador
Diego Furtado, Senior Software Engineer @TransferWise
for their guidance
Ad

Recommended

How to upgrade to MongoDB 4.0 - Percona Europe 2018
How to upgrade to MongoDB 4.0 - Percona Europe 2018
Antonios Giannopoulos
 
Sharded cluster tutorial
Sharded cluster tutorial
Antonios Giannopoulos
 
Sharding in MongoDB 4.2 #what_is_new
Sharding in MongoDB 4.2 #what_is_new
Antonios Giannopoulos
 
Upgrading to MongoDB 4.0 from older versions
Upgrading to MongoDB 4.0 from older versions
Antonios Giannopoulos
 
Triggers in MongoDB
Triggers in MongoDB
Antonios Giannopoulos
 
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
Antonios Giannopoulos
 
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
Antonios Giannopoulos
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
Antonios Giannopoulos
 
MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster Tutorial
Jason Terpko
 
MongoDB - External Authentication
MongoDB - External Authentication
Jason Terpko
 
Triggers In MongoDB
Triggers In MongoDB
Jason Terpko
 
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
Jason Terpko
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
Jason Terpko
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017
Antonios Giannopoulos
 
Unqlite
Unqlite
Paul Myeongchan Kim
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Like loggly using open source
Like loggly using open source
Thomas Alrin
 
Fluentd meetup
Fluentd meetup
Sadayuki Furuhashi
 
Fluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log Management
NTT Communications Technology Development
 
Redis modules 101
Redis modules 101
Dvir Volk
 
ELK stack at weibo.com
ELK stack at weibo.com
琛琳 饶
 
Sessionization with Spark streaming
Sessionization with Spark streaming
Ramūnas Urbonas
 
MongoDB Scalability Best Practices
MongoDB Scalability Best Practices
Jason Terpko
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
Sematext Group, Inc.
 
Collect distributed application logging using fluentd (EFK stack)
Collect distributed application logging using fluentd (EFK stack)
Marco Pas
 
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Mike Friedman
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
Prajal Kulkarni
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Databricks
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
Syed Hadoop
 
Introduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 

More Related Content

What's hot (20)

MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster Tutorial
Jason Terpko
 
MongoDB - External Authentication
MongoDB - External Authentication
Jason Terpko
 
Triggers In MongoDB
Triggers In MongoDB
Jason Terpko
 
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
Jason Terpko
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
Jason Terpko
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017
Antonios Giannopoulos
 
Unqlite
Unqlite
Paul Myeongchan Kim
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Like loggly using open source
Like loggly using open source
Thomas Alrin
 
Fluentd meetup
Fluentd meetup
Sadayuki Furuhashi
 
Fluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log Management
NTT Communications Technology Development
 
Redis modules 101
Redis modules 101
Dvir Volk
 
ELK stack at weibo.com
ELK stack at weibo.com
琛琳 饶
 
Sessionization with Spark streaming
Sessionization with Spark streaming
Ramūnas Urbonas
 
MongoDB Scalability Best Practices
MongoDB Scalability Best Practices
Jason Terpko
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
Sematext Group, Inc.
 
Collect distributed application logging using fluentd (EFK stack)
Collect distributed application logging using fluentd (EFK stack)
Marco Pas
 
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Mike Friedman
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
Prajal Kulkarni
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Databricks
 
MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster Tutorial
Jason Terpko
 
MongoDB - External Authentication
MongoDB - External Authentication
Jason Terpko
 
Triggers In MongoDB
Triggers In MongoDB
Jason Terpko
 
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
Jason Terpko
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
Jason Terpko
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017
Antonios Giannopoulos
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Like loggly using open source
Like loggly using open source
Thomas Alrin
 
Redis modules 101
Redis modules 101
Dvir Volk
 
ELK stack at weibo.com
ELK stack at weibo.com
琛琳 饶
 
Sessionization with Spark streaming
Sessionization with Spark streaming
Ramūnas Urbonas
 
MongoDB Scalability Best Practices
MongoDB Scalability Best Practices
Jason Terpko
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
Sematext Group, Inc.
 
Collect distributed application logging using fluentd (EFK stack)
Collect distributed application logging using fluentd (EFK stack)
Marco Pas
 
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Mike Friedman
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
Prajal Kulkarni
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Databricks
 

Similar to Using MongoDB with Kafka - Use Cases and Best Practices (20)

Kafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
Syed Hadoop
 
Introduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
Introduction to Apache Kafka
Introduction to Apache Kafka
Shiao-An Yuan
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to hero
Apache Kafka TLV
 
Kafka zero to hero
Kafka zero to hero
Avi Levi
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Guido Schmutz
 
Kafka RealTime Streaming
Kafka RealTime Streaming
Viyaan Jhiingade
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
Kafka internals
Kafka internals
David Groozman
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
confluent
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should know
Ratish Ravindran
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
Data pipeline with kafka
Data pipeline with kafka
Mole Wong
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
confluent
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
Syed Hadoop
 
Introduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
Introduction to Apache Kafka
Introduction to Apache Kafka
Shiao-An Yuan
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to hero
Apache Kafka TLV
 
Kafka zero to hero
Kafka zero to hero
Avi Levi
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Guido Schmutz
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
confluent
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should know
Ratish Ravindran
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
Data pipeline with kafka
Data pipeline with kafka
Mole Wong
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
confluent
 
Ad

More from Antonios Giannopoulos (6)

Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic
Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic
Antonios Giannopoulos
 
Percona Live 2017 ­- Sharded cluster tutorial
Percona Live 2017 ­- Sharded cluster tutorial
Antonios Giannopoulos
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
Antonios Giannopoulos
 
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos
 
Introduction to Polyglot Persistence
Introduction to Polyglot Persistence
Antonios Giannopoulos
 
MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals
Antonios Giannopoulos
 
Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic
Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic
Antonios Giannopoulos
 
Percona Live 2017 ­- Sharded cluster tutorial
Percona Live 2017 ­- Sharded cluster tutorial
Antonios Giannopoulos
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
Antonios Giannopoulos
 
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos
 
Introduction to Polyglot Persistence
Introduction to Polyglot Persistence
Antonios Giannopoulos
 
Ad

Recently uploaded (20)

Best Software Development at Best Prices
Best Software Development at Best Prices
softechies7
 
AI for PV: Development and Governance for a Regulated Industry
AI for PV: Development and Governance for a Regulated Industry
Biologit
 
A Guide to Telemedicine Software Development.pdf
A Guide to Telemedicine Software Development.pdf
Olivero Bozzelli
 
Which Hiring Management Tools Offer the Best ROI?
Which Hiring Management Tools Offer the Best ROI?
HireME
 
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
pcprocore
 
Simplify Insurance Regulations with Compliance Management Software
Simplify Insurance Regulations with Compliance Management Software
Insurance Tech Services
 
Introduction to Agile Frameworks for Product Managers.pdf
Introduction to Agile Frameworks for Product Managers.pdf
Ali Vahed
 
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
mary rojas
 
Building Geospatial Data Warehouse for GIS by GIS with FME
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
Digital Transformation: Automating the Placement of Medical Interns
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
IDM Crack with Internet Download Manager 6.42 [Latest 2025]
IDM Crack with Internet Download Manager 6.42 [Latest 2025]
HyperPc soft
 
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
Jamie Coleman
 
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
IFI Techsolutions
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Best Practice for LLM Serving in the Cloud
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Azure AI Foundry: The AI app and agent factory
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
 
IObit Driver Booster Pro 12 Crack Latest Version Download
IObit Driver Booster Pro 12 Crack Latest Version Download
pcprocore
 
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
dheeodoo
 
Best Software Development at Best Prices
Best Software Development at Best Prices
softechies7
 
AI for PV: Development and Governance for a Regulated Industry
AI for PV: Development and Governance for a Regulated Industry
Biologit
 
A Guide to Telemedicine Software Development.pdf
A Guide to Telemedicine Software Development.pdf
Olivero Bozzelli
 
Which Hiring Management Tools Offer the Best ROI?
Which Hiring Management Tools Offer the Best ROI?
HireME
 
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
pcprocore
 
Simplify Insurance Regulations with Compliance Management Software
Simplify Insurance Regulations with Compliance Management Software
Insurance Tech Services
 
Introduction to Agile Frameworks for Product Managers.pdf
Introduction to Agile Frameworks for Product Managers.pdf
Ali Vahed
 
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
mary rojas
 
Building Geospatial Data Warehouse for GIS by GIS with FME
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
Digital Transformation: Automating the Placement of Medical Interns
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
IDM Crack with Internet Download Manager 6.42 [Latest 2025]
IDM Crack with Internet Download Manager 6.42 [Latest 2025]
HyperPc soft
 
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
Jamie Coleman
 
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
IFI Techsolutions
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Best Practice for LLM Serving in the Cloud
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Azure AI Foundry: The AI app and agent factory
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
 
IObit Driver Booster Pro 12 Crack Latest Version Download
IObit Driver Booster Pro 12 Crack Latest Version Download
pcprocore
 
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
dheeodoo
 

Using MongoDB with Kafka - Use Cases and Best Practices

  • 1. Using MongoDB with Kafka Percona Live Online 20-21 October 2020
  • 2. Antonios Giannopoulos Senior Database Administrator Pedro Albuquerque Principal Database Engineer
  • 3. Agenda ● Definitions ● Use cases ● Using MongoDB as a source ● Using MongoDB as a sink ● Real world use case: Transferwise ● MongoDB to Kafka Connectors ● Takeaways
  • 4. What is MongoDB? ● Document-oriented Database ● Flexible JSON-style schema Use-Cases: ● Pretty much any workload ● After version 4.0/4.2 supports ACID transactions ● Frequent schema changes
  • 5. What is Apache Kafka? ● Distributed event streaming platform Use-Cases: ● Publish and subscribe to stream of events ● Async RPC-style calls between services ● Log replay ● CQRS and Event Sourcing ● Real-time analytics
  • 6. How can they work together?
  • 7. Use cases - Topologies MongoDB as a sink MongoDB as a source MongoDB as a source/sink
  • 8. MongoDB as a Source
  • 9. Selective Replication/EL/ETL MongoDB doesn’t support selective Replication Oplog or Change Streams (prefered method) Kafka cluster, with one topic per collection MongoDB to Kafka connectors
  • 10. Debezium Supports both Replica-set and Sharded clusters Uses the oplog to capture and create events Selective Replication: [database|collection].[include|exclude].list EL: field.exclude.list & field.renames snapshot.mode = initial | never tasks.max initial.sync.max.threads
  • 11. MongoDB Kafka Source Connector - Supports both Replica-set and Sharded clusters - Uses MongoDB Change Streams to create events - Selective Replication: - mongodb db.collection -> db.collection kafka topic - Multi-source replication: - multiple collections to single kafka topic - EL: Filter or modify change events with MongoDB aggregation pipeline - Sync historical data (copy.existing=true) - copy.existing.max.threads
  • 12. MongoDB as a Sink
  • 13. Throttling Throttling* (is a forbidden word but) is extremely useful: - During MongoDB scaling - Planned or unplanned maintenances - Unexpected growth events - Provides workload priorities The need for throttling: MongoDB 4.2 Flow control You can configure Flow Control on the Replica-Set level (Config settings: enableFlowControl, flowControlTargetLagSeconds) Kafka provides a more flexible “flow control” that you can easily manage * Throttling may not be suitable for every workloads
  • 14. Throttling The aim is to rate limit write operations Kafka supports higher write throughput & scales faster Kafka scales: - Adding partitions - Add brokers - Add clusters - Minimal application changes MongoDB scales as well: - Adding shards - Balancing takes time - Balancing affects performance
  • 15. Throttling Quotas can be applied to (user, client-id), user or client-id groups producer_byte_rate : The total rate limit for the user’s producers without a client-id quota override consumer_byte_rate : The total rate limit for the user’s consumers without a client-id quota override Static changes: /config/users/ & /config/clients (watch out the override order) Dynamic changes: > bin/kafka-configs.sh --bootstrap-server <host>:<port> --describe --entity-type users|clients --entity-name user|client-id > bin/kafka-configs.sh --bootstrap-server <host>:<port> --alter --add-config 'producer_byte_rate=1024,consumer_byte_rate=2048' --entity-type users|clients --entity-name user|client-id
  • 16. Throttling Evaluate a MongoDB metric - Read/Write Queues , Latency etc > db.serverStatus().globalLock.currentQueue.writers 0 Prometheus Alert Manager - Tons of integrations - Groups alerts - Notify on resolution Consumer Producer kafka-configs.sh PROD or your favorite integration... Prometheus monitors Production
  • 17. Workload isolation Kafka handles specific workloads better An successful event website (for example: Percona Live 2020) - Contains a stream of social media interactions - Kafka serves the raw stream - all interactions - MongoDB serves aggregated data - for example top tags Raw steam is native for Kafka as its a commit-log MongoDB rich aggregation framework provides aggregated data
  • 19. Continuous aggregations Useful for use-cases that raw data are useless (or not very useful) Kafka streams is your friend - Windowing Examples: Meteo stations sending metrics every second MongoDB serves the min(),max() for every hour Website statistics - counters MongoDB gets updated every N seconds with hits summary MongoDB gets updated with hits per minute/hour
  • 20. Journal Data recovery is a usual request in the databases world Human error, application bugs, hardware failures are some reasons Kafka can help on partial recovery or point in time recovery A partial data recovery may require restore of a full backup Restore changes from a full backup, Replay the changes from Kafka
  • 22. TransferWise: Activity Service ● Customer action ● Many types ● Different status ● Variety of categories ● Repository of all activities ● List of customer’s actions ● Activity list ● Ability to search and filter
  • 24. spring-kafka Producer configuration private ProducerFactory<Object, Object> producerFactory(KafkaProperties kafkaProperties) { return new DefaultKafkaProducerFactory<>( Map.of( ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaProperties.getServers(), ProducerConfig.CLIENT_ID_CONFIG, kafkaProperties.getClientId(), ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, JsonSerializer.class, ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class ) ); } public KafkaTemplate<Object, Object> kafkaTemplate(KafkaProperties kafkaProperties) { return new KafkaTemplate<>(producerFactory(kafkaProperties)); }
  • 25. spring-kafka Send message public void send(String key, Object value, Runnable successCallback) { String jsonBody = value.getClass() == String.class ? (String) value : JSON_SERIALIZER.writeAsJson(value); kafkaTemplate.send(topic, key, jsonBody) .addCallback(new ListenableFutureCallback<>() { @Override public void onFailure(Throwable ex) { log.error("Failed sending message with key {} to {}", key, topic); } @Override public void onSuccess(SendResult<String, String> result) { successCallback.run(); } }); }
  • 26. spring-kafka Consumer configuration @EnableKafka private ConsumerFactory<String, String> consumerFactory(KafkaProperties kafkaProperties) { return new DefaultKafkaConsumerFactory<>( Map.of( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaProperties.getServers(), ConsumerConfig.CLIENT_ID_CONFIG, kafkaProperties.getClientId(), ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class, ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class ));} ConcurrentKafkaListenerContainerFactory<String, String> factory = buildListenerContainerFactory(objectMapper, kafkaProperties); KafkaRetryConfig retryConfig = new KafkaRetryConfig(KafkaProducerFactory.kafkaTemplate(kafkaProperties)); @KafkaListener(topics = "${activity-service.kafka.topics.activityUpdates}", containerFactory = ActivityUpdatesKafkaListenersConfig.ACTIVITY_UPDATES_KAFKA_LISTENER_FACTORY)
  • 28. name=mongodb-sink-example topics=topicA,topicB connector.class=com.mongodb.kafka.connect.MongoSinkConnector tasks.max=1 # Specific global MongoDB Sink Connector configuration connection.uri=mongodb://mongod1:27017,mongod2:27017,mongod3:27017 database=perconalive collection=slides MongoDB Kafka Sink connector: Configuration
  • 29. MongoDB Kafka Sink connector: Configuration # Message types key.converter=io.confluent.connect.avro.AvroConverter key.converter.schema.registry.url=http://localhost:8081 value.converter=io.confluent.connect.avro.AvroConverter value.converter.schema.registry.url=http://localhost:8081
  • 30. MongoDB Kafka Sink connector: Configuration ## Document manipulation settings [key|value].projection.type=AllowList [key|value].projection.list=name,age,address.post_code ## Id Strategy document.id.strategy=com.mongodb.kafka.connect.sink.processor.id.strategy.BsonOidStrategy post.processor.chain=com.mongodb.kafka.connect.sink.processor.DocumentIdAdder
  • 31. MongoDB Kafka Sink connector: Configuration ## Dead letter queue errors.tolerance=all errors.log.enable=true errors.log.include.messages=true errors.deadletterqueue.topic.name=perconalive.deadletterqueue errors.deadletterqueue.context.headers.enable=true
  • 32. Recap/Takeaways There are tons of use-cases for MongoDB & Kafka We described couple of use-cases ● Selective replication/ETL ● Throttling/Journaling/Workload Isolation Kafka has a rich ecosystem that can expand the use-cases Connectors is your friend, but you can build your own connector Large orgs like TransferWise use MongoDB & Kafka for complex projects
  • 33. - Thank you!!! - - Q&A - Big thanks to: John Moore, Principal Engineer @Eventador Diego Furtado, Senior Software Engineer @TransferWise for their guidance