SlideShare a Scribd company logo
Kafka 101
Just enough knowledge
to break everything
(Simplified) Glossary
Kafka ~ Distributed messaging system (distributed Pub Sub)
Brokers ~ The machines where the data is stored
Topic ~ Queue(s) of messages on cluster
Producer & Consumer ~ Pub Sub clients for the topic
Avro ~ A serialization format
OVERVIEW
Kafka Why and How ?
Producer - Consumer
Topics
A common format : Avro
Where is the data ?
Isn’t that just one big single point of failure ?
Kafka Why and How ?
Without a centralised communication pipe
DATA SOURCES
DATA OPERATION
With a centralised communication pipe
DATA SOURCES
DATA OPERATION
Articulated around 3 parts
Publish & Subscribe using a messaging queue
● Topic represented by a dedicated queue
● Writer and Reader don’t known each other
● Processing data is the reader’s responsibility
Processing in real time
Kafka storage
By default on kafka :
● Write on disk (0 copy)
● Retention of message is of 6 months by topic
● Topics are distributed for parallelism
● Topics are replicated for resilience
Producer - Consumer
Producer consumer model in Kafka
Kafka producer
Kafka producer pattern of publication
At-Least-Once:
=> Wait for ack from cluster
At-Most-Once
=> Don’t wait for ack from cluster
Kafka consumer
Kafka consumer pattern by default “latest”
Kafka consumer pattern “earliest”
Kafka consumer using a specific offset
Topics and partitions
Topic are glorified log file (sic)
Splitting topics into partitions
Consumer groups
A common format avro
Avro example
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
● Binary file
● Strictly typed data structure
● Allow Union and Default value
● Schema version attached to file
● Schema needed to Read/Write
● One schema but multiple versions
Avro usage in kafka
Schema registry in action
Where is the data ?
Brokers are where most of the stuff happens
The data sits on the brokers’
disk(s).
Data flows to/from Kafka. It’s
immutable, you can’t change
it directly.
Dump the data
By default, keep for approx. 6
months but it can stay there
indefinitely.
In all cases, its expiration is
totally independent from it’s
consumption.
Retention
To increase space we can
“simply” add a new broker.
Scalable
Replication
Isn’t that just a big SPOF ?
Failures resilience
Partition follower failure
Partition leader failure
Zookeeper: the puppet master
Kafka at JobTeaser
Talent bank’s use case
Stream “Latest”
1 topic by domain.entity
3 partitions by topic
Retention > weeks
Data team’s use case with JT MySQL
Stream full content of DB
1 topic by table
1 partition by topic
Retention > months
Data team’s use case with Salesforce
Stream “Latest”
1 topic by “Object”
1 partition
Retention < 1 week
(Complete) Glossary
Kakfa -> Your new best friend
topic -> Log file of the message (exist on cluster level)
Offset -> Primary key of the message (on partition level)
Brokers -> The machines that fully handle the topics
Producer & Consumer -> Your job
Avro -> So much better than json ;)
Join the movement !
Valuables resources
Kafka for beginners : https://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
Kafka overview : https://www.alibabacloud.com/blog/an-overview-of-kafka-distributed-message-system_594218
Kafka a database : https://speakerdeck.com/ept/is-kafka-a-database
Putting the Power of Kafka into the Hands of Data Scientists :
https://multithreaded.stitchfix.com/blog/2018/09/05/datahighway/
Why we choose Kafka : https://tech.trello.com/why-we-chose-kafka/
Salesforce notifications to Kafka topics : https://glenmazza.net/blog/entry/salesforce-notifications-to-kafka-topics
Streaming data out of the monolith : https://medium.com/blablacar-tech/streaming-data-out-of-the-monolith-building-a-
highly-reliable-cdc-stack-d71599131acb
Kafka client At Most One, At Least Once, Exactly Once : https://dzone.com/articles/kafka-clients-at-most-once-at-least-
once-exactly-o
Message serialization in Kafka using Avro part 1 : http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-
apache-kafka-using-apache-avro-part-1/
Message serialization in Kafka using Avro part 2 :
http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-apache-kafka-using-apache-avro-part-2/
Offset management in Kafka : https://fr.slideshare.net/jjkoshy/offset-management-in-kafka
Kafka listeners explained : https://rmoff.net/2018/08/02/kafka-listeners-explained/
The power of rebalancing in Kafka : https://www.youtube.com/watch?v=MmLezWRI3Ys

More Related Content

PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Introduction to Apache Kafka
PPTX
Introduction to Apache Kafka
PPTX
Apache Kafka
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Kafka Streams: What it is, and how to use it?
PPTX
Apache Kafka Best Practices
PPTX
BCT.pptx
Apache Kafka Architecture & Fundamentals Explained
Introduction to Apache Kafka
Introduction to Apache Kafka
Apache Kafka
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Streams: What it is, and how to use it?
Apache Kafka Best Practices
BCT.pptx

What's hot (20)

PPTX
A visual introduction to Apache Kafka
PPTX
PDF
Fundamentals of Apache Kafka
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PDF
Apache Kafka - Martin Podval
PDF
An Introduction to Apache Kafka
PPTX
Kafka presentation
ODP
Stream processing using Kafka
PDF
PPTX
Apache kafka
PDF
Apache Kafka Introduction
PPTX
Kafka 101
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
From Zero to Hero with Kafka Connect
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Apache Kafka at LinkedIn
PDF
Kafka Streams State Stores Being Persistent
A visual introduction to Apache Kafka
Fundamentals of Apache Kafka
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka - Martin Podval
An Introduction to Apache Kafka
Kafka presentation
Stream processing using Kafka
Apache kafka
Apache Kafka Introduction
Kafka 101
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
From Zero to Hero with Kafka Connect
Evening out the uneven: dealing with skew in Flink
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Apache Kafka at LinkedIn
Kafka Streams State Stores Being Persistent
Ad

Similar to Kafka 101 (20)

PDF
Kafka syed academy_v1_introduction
PDF
Timothy Spann: Apache Pulsar for ML
PDF
bigdata 2022_ FLiP Into Pulsar Apps
PDF
Streaming Data with Apache Kafka
PPTX
Apache Kafka
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PPT
Spinnaker VLDB 2011
PPTX
Kafka and ibm event streams basics
PDF
Big Data Streams Architectures. Why? What? How?
PDF
ES & Kafka
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
PPTX
Introduction to Kafka Streams Presentation
DOCX
A Quick Guide to Refresh Kafka Skills
PDF
Kafka Overview
PDF
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
PDF
Apache Kafka - Scalable Message-Processing and more !
PPTX
Apache kafka
PDF
Introduction to apache kafka
PPTX
Apache kafka
PPTX
Session 23 - Kafka and Zookeeper
Kafka syed academy_v1_introduction
Timothy Spann: Apache Pulsar for ML
bigdata 2022_ FLiP Into Pulsar Apps
Streaming Data with Apache Kafka
Apache Kafka
Multi-Datacenter Kafka - Strata San Jose 2017
Spinnaker VLDB 2011
Kafka and ibm event streams basics
Big Data Streams Architectures. Why? What? How?
ES & Kafka
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Introduction to Kafka Streams Presentation
A Quick Guide to Refresh Kafka Skills
Kafka Overview
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
Apache Kafka - Scalable Message-Processing and more !
Apache kafka
Introduction to apache kafka
Apache kafka
Session 23 - Kafka and Zookeeper
Ad

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
Introduction to the R Programming Language
PPTX
modul_python (1).pptx for professional and student
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Microsoft Core Cloud Services powerpoint
PDF
annual-report-2024-2025 original latest.
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Qualitative Qantitative and Mixed Methods.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
importance of Data-Visualization-in-Data-Science. for mba studnts
Introduction to the R Programming Language
modul_python (1).pptx for professional and student
Optimise Shopper Experiences with a Strong Data Estate.pdf
ISS -ESG Data flows What is ESG and HowHow
Galatica Smart Energy Infrastructure Startup Pitch Deck
A Complete Guide to Streamlining Business Processes
Microsoft Core Cloud Services powerpoint
annual-report-2024-2025 original latest.
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Mega Projects Data Mega Projects Data
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}

Kafka 101

  • 1. Kafka 101 Just enough knowledge to break everything
  • 2. (Simplified) Glossary Kafka ~ Distributed messaging system (distributed Pub Sub) Brokers ~ The machines where the data is stored Topic ~ Queue(s) of messages on cluster Producer & Consumer ~ Pub Sub clients for the topic Avro ~ A serialization format
  • 3. OVERVIEW Kafka Why and How ? Producer - Consumer Topics A common format : Avro Where is the data ? Isn’t that just one big single point of failure ?
  • 5. Without a centralised communication pipe DATA SOURCES DATA OPERATION
  • 6. With a centralised communication pipe DATA SOURCES DATA OPERATION
  • 8. Publish & Subscribe using a messaging queue ● Topic represented by a dedicated queue ● Writer and Reader don’t known each other ● Processing data is the reader’s responsibility
  • 10. Kafka storage By default on kafka : ● Write on disk (0 copy) ● Retention of message is of 6 months by topic ● Topics are distributed for parallelism ● Topics are replicated for resilience
  • 14. Kafka producer pattern of publication At-Least-Once: => Wait for ack from cluster At-Most-Once => Don’t wait for ack from cluster
  • 16. Kafka consumer pattern by default “latest”
  • 17. Kafka consumer pattern “earliest”
  • 18. Kafka consumer using a specific offset
  • 20. Topic are glorified log file (sic)
  • 21. Splitting topics into partitions
  • 24. Avro example {"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● Binary file ● Strictly typed data structure ● Allow Union and Default value ● Schema version attached to file ● Schema needed to Read/Write ● One schema but multiple versions
  • 25. Avro usage in kafka
  • 27. Where is the data ?
  • 28. Brokers are where most of the stuff happens The data sits on the brokers’ disk(s). Data flows to/from Kafka. It’s immutable, you can’t change it directly. Dump the data By default, keep for approx. 6 months but it can stay there indefinitely. In all cases, its expiration is totally independent from it’s consumption. Retention To increase space we can “simply” add a new broker. Scalable
  • 30. Isn’t that just a big SPOF ?
  • 36. Talent bank’s use case Stream “Latest” 1 topic by domain.entity 3 partitions by topic Retention > weeks
  • 37. Data team’s use case with JT MySQL Stream full content of DB 1 topic by table 1 partition by topic Retention > months
  • 38. Data team’s use case with Salesforce Stream “Latest” 1 topic by “Object” 1 partition Retention < 1 week
  • 39. (Complete) Glossary Kakfa -> Your new best friend topic -> Log file of the message (exist on cluster level) Offset -> Primary key of the message (on partition level) Brokers -> The machines that fully handle the topics Producer & Consumer -> Your job Avro -> So much better than json ;)
  • 41. Valuables resources Kafka for beginners : https://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/ Kafka overview : https://www.alibabacloud.com/blog/an-overview-of-kafka-distributed-message-system_594218 Kafka a database : https://speakerdeck.com/ept/is-kafka-a-database Putting the Power of Kafka into the Hands of Data Scientists : https://multithreaded.stitchfix.com/blog/2018/09/05/datahighway/ Why we choose Kafka : https://tech.trello.com/why-we-chose-kafka/ Salesforce notifications to Kafka topics : https://glenmazza.net/blog/entry/salesforce-notifications-to-kafka-topics Streaming data out of the monolith : https://medium.com/blablacar-tech/streaming-data-out-of-the-monolith-building-a- highly-reliable-cdc-stack-d71599131acb Kafka client At Most One, At Least Once, Exactly Once : https://dzone.com/articles/kafka-clients-at-most-once-at-least- once-exactly-o Message serialization in Kafka using Avro part 1 : http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in- apache-kafka-using-apache-avro-part-1/ Message serialization in Kafka using Avro part 2 : http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-apache-kafka-using-apache-avro-part-2/ Offset management in Kafka : https://fr.slideshare.net/jjkoshy/offset-management-in-kafka Kafka listeners explained : https://rmoff.net/2018/08/02/kafka-listeners-explained/ The power of rebalancing in Kafka : https://www.youtube.com/watch?v=MmLezWRI3Ys