SlideShare a Scribd company logo
(C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA
Content
Why Kafka
Kafka Eco System
Topics and Partitions
Brokers
Replication Factor
Segments
Leader Concept
Producer
Consumer
Consumer Group
Why Kafka Connect
Kafka Connect Architecture
Kafka Connect Demo Example
File Connector
JDBC Connector
Kafka Streams
Stream Processing
KStream & KTable
Kafka Streams Demo Example
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 2
Why Kafka
10/18/2018 3(C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA
Simple Application to start with
Source System
Target System
Complex Enterprise System
Source System
Target System
Source System
Target System
Source System
Target System
Complex Enterprise System
Source System
Target System
Source System
Target System
Source System
Target System
Apache Kafka
Why Kafka
10/18/2018 4(C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA
• Messaging system
• De-coupling of system dependencies
• Gather metrics from various systems
• Application log
• Stream Processing (Kafka Stream API)
• Integration with different system (Kafka
Connect API)
• ..
Use Cases
• Distributed, resilient and fault tolerance.
• Horizontal Scalability
• High performance
• Widely adopted by big companies
• Netflix
• AirBnb
• Walmart
• LinkedIn
Features
Kafka Eco System
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 5
Source System Target SystemKAFKAProducer Consumer
ZOOKEEPER
Topics and Partitions
Topics
• It is stream of data
• Similar as tables in database but without any kind of constraints
• Topic is identified by unique name
• There can be N number of topics
• Each topics are split in partitions
• Partitions are in order
• Each topic can have N number of partitions
• Messages are stored in partition with incremental ID, it is called as Offset
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 6
Partition 0 1 2 3 10…..0
Partition 1 1 2 3 15…..0
Topic
Topics and Partitions
• There can be as many partitions user wants
• Offset has meaning within partitions not across partitions. e.g. Data in offset 1 of
partition 1 is not same as data of offset of partition 2
• Order is guaranteed within partition not across partition
• Once data is saved it cannot be changed, i.e. it is immutable
• Assignment of data to partition is random unless key is provided
• Data is stored to partition for one week by default, it can be configured
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 7
Brokers
•Kafka cluster is composed of multiple brokers, each identified by integer ID.
•Each broker contains certain topic partitions.
•Once you are connected to any broker (bootstrap broker) you will get connected to entire cluster
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 8
Broker 1
Topic 1
Partition 2
Topic 2
Partition 1
Broker 2
Topic 1
Partition 1
Topic 2
Partition 2
Broker 3
Topic 1
Partition 3
Data is distributed, TOPIC 2 is not present on Broker 3
Replication Factor
Topic should have replication factor greater then 1, this will guaranty if one broker goes down
other can serve.
Example of Topic with partition 2 and replication 2.
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 9
Broker 1
Topic 1
Partition 1
Broker 2
Topic 1
Partition 1
Topic 1
Partition 2
Broker 3
Topic 1
Partition 2
If anyone broker goes down still data is served as data is replicated on other broker also
Partitions and Replication Factor
◦ Most important two factor while creating TOPIC, as this two factor impact performance and durability
◦ It is always best to define this two factor while creating TOPIC instead of modifying those after creating
topic
◦ If you increase partition after creating topic, key ordering is going to break. e.g. if you created topic with 3 partition and increase it
to 5, then message with same key will not go to same partition as we have changed partition.
◦ If you increase replication factor after creating topic, then there will be more pressure on Kafka cluster. This might affect
performance.
◦ Confulent blog
◦ Roughly each partition can get throughput of 10mb/sec.
◦ More partition leads to better parallelism, but this lead to more file open & if broker fails it lead to lot of concurrent leader
election.
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 10
Partitions and Segments
Topics are made of Partitions & Partitions are made of segments.
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 11
Partitions
Segment 1:
Offset 1999 - Current
Segment 1:
Offset 1000 - 1999
Segment 0:
Offset 0 - 999
Active Segment
For
Write
• At a given time only one segment can be active
• Two important segment settings
• Max size of segment in bytes – log.segment.bytes (default is 1GB)
• Time Kafka will wait before committing segment – log.segment.ms (default
is 1 week)
Segment and Indexes
Segment comes with two indexes files
• Offset position of index : Which allows Kafka to read message from given offset
• Timestamp Index: Which allows Kafka to read message for given time
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 12
Position 0
Segment 1:
Offset 1999 - Current
Segment 1:
Offset 1000 - 1999
Segment 0:
Offset 0 - 999
Position 1 Position 2
Timestamp 0 Timestamp 1 Timestamp 2
Note: Timestamp index is from Kafka 0.10.1
Concept of leader
• At any given time only one broker can leader for given partition
• Only leader can receive and serve data for a partition
• Other broker are ISR (in-sync replica)
• Every partition can have one leader and multiple ISR
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 13
Broker 2
Topic 1
Partition 1
Topic 1
Partition 2
Broker 1
Topic 1
Partition 1
Broker 3
Topic 1
Partition 2
• Broker 1 is leader for Partition 1
• Broker 3 is leader for Partition 2
Producers
◦ Producers write data to topics. They have to only specify topic name and one of broker, Kafka will
automatically take care of sending data to right broker.
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 14
Producer
Broker 1
Topic 1
Partition 0
Broker 2
Topic 1
Partition 1
Broker 3
Topic 1
Partition 3
AutoLoadBalancing
ACKS Description
0 Producer don’t wait,
possible loss of data.
1 Producer wait for leader
acknowledgment.
ALL No Data Loss
Producers : Message Key
◦ Producer can send message with key
◦ If Key is present with message, message will always go to same partition.
◦ This enables ordering of data
◦ Example
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 15
Producer
Broker 1
Topic 1
Partition 0
Broker 2
Topic 1
Partition 1
key = user_id
If user_id = 0 message always goes to
partition 0.
If user_id = 1 message always goes to
partition 1.
Consumers
◦ Consumers reads the data from topic. They just have to specify the topic name and one broker, Kafka
will automatically take care of reading message from right broker
◦ Data is read by consumer in order within a partition
◦ Consumer can read the data in parallel but across partition
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 16
Broker 1
Topic 1
Partition 0
Broker 2
Topic 1
Partition 1
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
Consumer
Read in order
Read in order
Consumer Groups
◦ Consumers read the data within Consumer Group
◦ Each consumer within group read exclusive to partition
◦ There cannot be more consumer then partition, if they are some of them will be inactive
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 17
Topic1 Partition 0 Topic1 Partition 1
Consumer 1 Consumer 2 Consumer 1
Consumer Group 0 Consumer Group 1
Zookeeper
◦ Zookeeper manages the broker and keep the list of them
◦ Zookeeper does the leader election for partition
◦ Zookeeper notifies Kafka in case of
◦ New Topic
◦ Deletion of topic
◦ New Broker
◦ Broker Dies
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 18
Demo using command line
Assuming Zookeeper & Kafka is installed, Also kafka is started on two broker (9093/94 ports)
• Create topic with 2 partition using replication factor as 1
• sh kafka-topics.sh --zookeeper 127.0.0.1:2181 --create --topic first_topic --partitions 2 --replication-factor 1
• To display topic details “sh kafka-topics.sh --zookeeper 127.0.0.1:2181 –list”
• Produce the message on topic created
• sh kafka-console-producer.sh --broker-list 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic
• Consume the message on topic created
• sh kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic --from-beginning
• Enter couple of message on producer console and see corresponding message getting displayed on
consumer console.
• Consumer group – If you want to test consumer group concept, execute below command on two
different terminal. You will notice message are getting consumed as per partitions
• sh kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic --consumer-property
group.id=mygroup1 --from-beginning
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 19
Kafka Client Example (Java)
◦ Checkout source code from git
◦ ProducerDemo.java publish message & callback method is called once message is acknowledge
◦ ConsumerDemo.java code is configured with consumer-group = “mygroup1”, If you run multiple
instance of this application you will be able to see message getting consumed as per partitions.
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 20
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 21
Why Kafka Connect & Streams
Common use case on KAFKA are
• Source -> Kafka
• Kafka -> Sink
• Kafka -> Kafka
• Kafka -> App
Kafka Connect allows to
◦ Import data from various data source like: Database, Filesystem, Twitter, FTP etc.
◦ Export data to various data source like : Twitter, Filesystem, Database, Splunk etc.
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 22
Kafka Connect API
Kafka Streams API
Kafka Consumer API
Simplifies getting in and out data of Kafka
Kafka Connect & Streams Architecture
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 23
Source
Broker
Kafka Cluster
Stream App1
Stream App1
Stream App1
Kafka Connect
Worker
Worker
Worker
Worker Broker
Broker
Broker
Sink
1
2
3
5 4
File and JDBC connector (Demo)
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 24
Source File Kafka Connect Kafka Topic
Sink File
Kafka
connect
Kafka Topic
FileConnector
Source DB Kafka Connect Kafka Topic
Sink DB
Kafka
connect
Kafka Topic
KafkaJDBCConnector
Git:https://github.com/MetaArivu/kafka-connect.git
1
2
3
4
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 25
Kafka Streams
Kafka Streams is a client library for processing and analyzing data stored in Kafka.
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 26
KAFKA
Data
Transformation
Data Enrichment
Monitoring and
Alerting
• Simple and lightweight java client library
• Has no external dependencies on systems other than Apache
Kafka itself
• Supports exactly-once processing semantics to guarantee that
each record will be processed once
• Employs one-record-at-a-time processing to achieve millisecond
processing latency
• Supports fault-tolerant local state
Stream Processing
◦ Stream it represents an unbounded, continuously
updating data set
◦ Stream processing application, defines its computational
logic through one or more processor topologies
◦ Stream processor, is a node in the processor topology,
there are two special processors in the topology
◦ Source processor, doesn’t have up-stream processor. It produce input
stream for its topology from one or more kafka topics and forward them
to down-stream processor
◦ Sink processor, doesn’t have down-stream processor. It sends any data
received from its up-stream to Kafka Topic
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 27
stream
stream
processor
source
processor
sink
processor
KStream & KTable
KStream
◦ It is all Insert
◦ Unbounded Streams
◦ It is infinite
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 28
Topic KStream
Team A, 10 Team A, 10
Team B, 20
Congress, 20
Team B, 10
Team A, 40
Congress, 20
Team A, 10
Team A, 40
KTable
◦ It is upsert on non null values
◦ Delete on null values
Topic KTable
Team A, 10 Team A, 10
Team B, 20
Congress, 20
Team A, 10
Team A, 40
Congress, 20
Team A, 40
Kafka Streams – Polling APP Demo
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 29
Goal on this application is to display party count based on incoming data, were incoming data is party
and candidate name.
INPUT
Congress Candidate 2
Congress Candidate 3
Independent Candidate 4
OUTPUT
Congress 2
Independent 1
https://github.com/MetaArivu/kafka-streams

More Related Content

PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Fundamentals of Apache Kafka
PDF
Securing Kafka
PPTX
Introduction to Apache Kafka
PDF
Introduction to Kafka Streams
PDF
Introduction to Apache Kafka
PPTX
Apache Kafka
Apache Kafka Architecture & Fundamentals Explained
Fundamentals of Apache Kafka
Securing Kafka
Introduction to Apache Kafka
Introduction to Kafka Streams
Introduction to Apache Kafka
Apache Kafka

What's hot (20)

PPTX
PPTX
Kafka presentation
PPTX
Deep Dive into Apache Kafka
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
An Introduction to Apache Kafka
ODP
Stream processing using Kafka
PDF
PPTX
A visual introduction to Apache Kafka
PDF
Apache Kafka Introduction
PDF
Kafka 101 and Developer Best Practices
PPTX
Kafka 101
PDF
Handle Large Messages In Apache Kafka
PPTX
Introduction to Apache Kafka
PDF
Kafka Streams: What it is, and how to use it?
PDF
Common issues with Apache Kafka® Producer
PPTX
Envoy and Kafka
PDF
Getting Started with Confluent Schema Registry
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
PPTX
Apache Kafka Best Practices
PPTX
Apache kafka
Kafka presentation
Deep Dive into Apache Kafka
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
An Introduction to Apache Kafka
Stream processing using Kafka
A visual introduction to Apache Kafka
Apache Kafka Introduction
Kafka 101 and Developer Best Practices
Kafka 101
Handle Large Messages In Apache Kafka
Introduction to Apache Kafka
Kafka Streams: What it is, and how to use it?
Common issues with Apache Kafka® Producer
Envoy and Kafka
Getting Started with Confluent Schema Registry
Event Sourcing & CQRS, Kafka, Rabbit MQ
Apache Kafka Best Practices
Apache kafka
Ad

Similar to APACHE KAFKA / Kafka Connect / Kafka Streams (20)

PDF
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PDF
Apache Kafka Women Who Code Meetup
PDF
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
PDF
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
PDF
Developing Realtime Data Pipelines With Apache Kafka
PPTX
Citi Tech Talk Disaster Recovery Solutions Deep Dive
PPTX
Kafka streams decoupling with stores
PDF
Apache Kafka - From zero to hero
PDF
Kafka zero to hero
PDF
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
PPTX
YugaByte DB Internals - Storage Engine and Transactions
PPTX
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
PDF
Kafka Deep Dive
PDF
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
PPTX
Decoupling Decisions with Apache Kafka
PPT
advanced Google file System
PDF
Apache Pulsar at Yahoo! Japan
PPTX
Google File System
PPTX
Stateful streaming and the challenge of state
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Developing Real-Time Data Pipelines with Apache Kafka
Apache Kafka Women Who Code Meetup
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Developing Realtime Data Pipelines With Apache Kafka
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Kafka streams decoupling with stores
Apache Kafka - From zero to hero
Kafka zero to hero
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
YugaByte DB Internals - Storage Engine and Transactions
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Kafka Deep Dive
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Decoupling Decisions with Apache Kafka
advanced Google file System
Apache Pulsar at Yahoo! Japan
Google File System
Stateful streaming and the challenge of state
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Dropbox Q2 2025 Financial Results & Investor Presentation
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Unlocking AI with Model Context Protocol (MCP)

APACHE KAFKA / Kafka Connect / Kafka Streams

  • 1. (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA
  • 2. Content Why Kafka Kafka Eco System Topics and Partitions Brokers Replication Factor Segments Leader Concept Producer Consumer Consumer Group Why Kafka Connect Kafka Connect Architecture Kafka Connect Demo Example File Connector JDBC Connector Kafka Streams Stream Processing KStream & KTable Kafka Streams Demo Example 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 2
  • 3. Why Kafka 10/18/2018 3(C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA Simple Application to start with Source System Target System Complex Enterprise System Source System Target System Source System Target System Source System Target System Complex Enterprise System Source System Target System Source System Target System Source System Target System Apache Kafka
  • 4. Why Kafka 10/18/2018 4(C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA • Messaging system • De-coupling of system dependencies • Gather metrics from various systems • Application log • Stream Processing (Kafka Stream API) • Integration with different system (Kafka Connect API) • .. Use Cases • Distributed, resilient and fault tolerance. • Horizontal Scalability • High performance • Widely adopted by big companies • Netflix • AirBnb • Walmart • LinkedIn Features
  • 5. Kafka Eco System 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 5 Source System Target SystemKAFKAProducer Consumer ZOOKEEPER
  • 6. Topics and Partitions Topics • It is stream of data • Similar as tables in database but without any kind of constraints • Topic is identified by unique name • There can be N number of topics • Each topics are split in partitions • Partitions are in order • Each topic can have N number of partitions • Messages are stored in partition with incremental ID, it is called as Offset 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 6 Partition 0 1 2 3 10…..0 Partition 1 1 2 3 15…..0 Topic
  • 7. Topics and Partitions • There can be as many partitions user wants • Offset has meaning within partitions not across partitions. e.g. Data in offset 1 of partition 1 is not same as data of offset of partition 2 • Order is guaranteed within partition not across partition • Once data is saved it cannot be changed, i.e. it is immutable • Assignment of data to partition is random unless key is provided • Data is stored to partition for one week by default, it can be configured 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 7
  • 8. Brokers •Kafka cluster is composed of multiple brokers, each identified by integer ID. •Each broker contains certain topic partitions. •Once you are connected to any broker (bootstrap broker) you will get connected to entire cluster 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 8 Broker 1 Topic 1 Partition 2 Topic 2 Partition 1 Broker 2 Topic 1 Partition 1 Topic 2 Partition 2 Broker 3 Topic 1 Partition 3 Data is distributed, TOPIC 2 is not present on Broker 3
  • 9. Replication Factor Topic should have replication factor greater then 1, this will guaranty if one broker goes down other can serve. Example of Topic with partition 2 and replication 2. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 9 Broker 1 Topic 1 Partition 1 Broker 2 Topic 1 Partition 1 Topic 1 Partition 2 Broker 3 Topic 1 Partition 2 If anyone broker goes down still data is served as data is replicated on other broker also
  • 10. Partitions and Replication Factor ◦ Most important two factor while creating TOPIC, as this two factor impact performance and durability ◦ It is always best to define this two factor while creating TOPIC instead of modifying those after creating topic ◦ If you increase partition after creating topic, key ordering is going to break. e.g. if you created topic with 3 partition and increase it to 5, then message with same key will not go to same partition as we have changed partition. ◦ If you increase replication factor after creating topic, then there will be more pressure on Kafka cluster. This might affect performance. ◦ Confulent blog ◦ Roughly each partition can get throughput of 10mb/sec. ◦ More partition leads to better parallelism, but this lead to more file open & if broker fails it lead to lot of concurrent leader election. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 10
  • 11. Partitions and Segments Topics are made of Partitions & Partitions are made of segments. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 11 Partitions Segment 1: Offset 1999 - Current Segment 1: Offset 1000 - 1999 Segment 0: Offset 0 - 999 Active Segment For Write • At a given time only one segment can be active • Two important segment settings • Max size of segment in bytes – log.segment.bytes (default is 1GB) • Time Kafka will wait before committing segment – log.segment.ms (default is 1 week)
  • 12. Segment and Indexes Segment comes with two indexes files • Offset position of index : Which allows Kafka to read message from given offset • Timestamp Index: Which allows Kafka to read message for given time 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 12 Position 0 Segment 1: Offset 1999 - Current Segment 1: Offset 1000 - 1999 Segment 0: Offset 0 - 999 Position 1 Position 2 Timestamp 0 Timestamp 1 Timestamp 2 Note: Timestamp index is from Kafka 0.10.1
  • 13. Concept of leader • At any given time only one broker can leader for given partition • Only leader can receive and serve data for a partition • Other broker are ISR (in-sync replica) • Every partition can have one leader and multiple ISR 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 13 Broker 2 Topic 1 Partition 1 Topic 1 Partition 2 Broker 1 Topic 1 Partition 1 Broker 3 Topic 1 Partition 2 • Broker 1 is leader for Partition 1 • Broker 3 is leader for Partition 2
  • 14. Producers ◦ Producers write data to topics. They have to only specify topic name and one of broker, Kafka will automatically take care of sending data to right broker. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 14 Producer Broker 1 Topic 1 Partition 0 Broker 2 Topic 1 Partition 1 Broker 3 Topic 1 Partition 3 AutoLoadBalancing ACKS Description 0 Producer don’t wait, possible loss of data. 1 Producer wait for leader acknowledgment. ALL No Data Loss
  • 15. Producers : Message Key ◦ Producer can send message with key ◦ If Key is present with message, message will always go to same partition. ◦ This enables ordering of data ◦ Example 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 15 Producer Broker 1 Topic 1 Partition 0 Broker 2 Topic 1 Partition 1 key = user_id If user_id = 0 message always goes to partition 0. If user_id = 1 message always goes to partition 1.
  • 16. Consumers ◦ Consumers reads the data from topic. They just have to specify the topic name and one broker, Kafka will automatically take care of reading message from right broker ◦ Data is read by consumer in order within a partition ◦ Consumer can read the data in parallel but across partition 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 16 Broker 1 Topic 1 Partition 0 Broker 2 Topic 1 Partition 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 Consumer Read in order Read in order
  • 17. Consumer Groups ◦ Consumers read the data within Consumer Group ◦ Each consumer within group read exclusive to partition ◦ There cannot be more consumer then partition, if they are some of them will be inactive 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 17 Topic1 Partition 0 Topic1 Partition 1 Consumer 1 Consumer 2 Consumer 1 Consumer Group 0 Consumer Group 1
  • 18. Zookeeper ◦ Zookeeper manages the broker and keep the list of them ◦ Zookeeper does the leader election for partition ◦ Zookeeper notifies Kafka in case of ◦ New Topic ◦ Deletion of topic ◦ New Broker ◦ Broker Dies 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 18
  • 19. Demo using command line Assuming Zookeeper & Kafka is installed, Also kafka is started on two broker (9093/94 ports) • Create topic with 2 partition using replication factor as 1 • sh kafka-topics.sh --zookeeper 127.0.0.1:2181 --create --topic first_topic --partitions 2 --replication-factor 1 • To display topic details “sh kafka-topics.sh --zookeeper 127.0.0.1:2181 –list” • Produce the message on topic created • sh kafka-console-producer.sh --broker-list 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic • Consume the message on topic created • sh kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic --from-beginning • Enter couple of message on producer console and see corresponding message getting displayed on consumer console. • Consumer group – If you want to test consumer group concept, execute below command on two different terminal. You will notice message are getting consumed as per partitions • sh kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic --consumer-property group.id=mygroup1 --from-beginning 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 19
  • 20. Kafka Client Example (Java) ◦ Checkout source code from git ◦ ProducerDemo.java publish message & callback method is called once message is acknowledge ◦ ConsumerDemo.java code is configured with consumer-group = “mygroup1”, If you run multiple instance of this application you will be able to see message getting consumed as per partitions. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 20
  • 21. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 21
  • 22. Why Kafka Connect & Streams Common use case on KAFKA are • Source -> Kafka • Kafka -> Sink • Kafka -> Kafka • Kafka -> App Kafka Connect allows to ◦ Import data from various data source like: Database, Filesystem, Twitter, FTP etc. ◦ Export data to various data source like : Twitter, Filesystem, Database, Splunk etc. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 22 Kafka Connect API Kafka Streams API Kafka Consumer API Simplifies getting in and out data of Kafka
  • 23. Kafka Connect & Streams Architecture 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 23 Source Broker Kafka Cluster Stream App1 Stream App1 Stream App1 Kafka Connect Worker Worker Worker Worker Broker Broker Broker Sink 1 2 3 5 4
  • 24. File and JDBC connector (Demo) 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 24 Source File Kafka Connect Kafka Topic Sink File Kafka connect Kafka Topic FileConnector Source DB Kafka Connect Kafka Topic Sink DB Kafka connect Kafka Topic KafkaJDBCConnector Git:https://github.com/MetaArivu/kafka-connect.git 1 2 3 4
  • 25. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 25
  • 26. Kafka Streams Kafka Streams is a client library for processing and analyzing data stored in Kafka. 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 26 KAFKA Data Transformation Data Enrichment Monitoring and Alerting • Simple and lightweight java client library • Has no external dependencies on systems other than Apache Kafka itself • Supports exactly-once processing semantics to guarantee that each record will be processed once • Employs one-record-at-a-time processing to achieve millisecond processing latency • Supports fault-tolerant local state
  • 27. Stream Processing ◦ Stream it represents an unbounded, continuously updating data set ◦ Stream processing application, defines its computational logic through one or more processor topologies ◦ Stream processor, is a node in the processor topology, there are two special processors in the topology ◦ Source processor, doesn’t have up-stream processor. It produce input stream for its topology from one or more kafka topics and forward them to down-stream processor ◦ Sink processor, doesn’t have down-stream processor. It sends any data received from its up-stream to Kafka Topic 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 27 stream stream processor source processor sink processor
  • 28. KStream & KTable KStream ◦ It is all Insert ◦ Unbounded Streams ◦ It is infinite 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 28 Topic KStream Team A, 10 Team A, 10 Team B, 20 Congress, 20 Team B, 10 Team A, 40 Congress, 20 Team A, 10 Team A, 40 KTable ◦ It is upsert on non null values ◦ Delete on null values Topic KTable Team A, 10 Team A, 10 Team B, 20 Congress, 20 Team A, 10 Team A, 40 Congress, 20 Team A, 40
  • 29. Kafka Streams – Polling APP Demo 10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 29 Goal on this application is to display party count based on incoming data, were incoming data is party and candidate name. INPUT Congress Candidate 2 Congress Candidate 3 Independent Candidate 4 OUTPUT Congress 2 Independent 1 https://github.com/MetaArivu/kafka-streams

Editor's Notes

  • #4: MASA = Mesh App and Service Architecture
  • #5: MASA = Mesh App and Service Architecture