APACHE KAFKA / Kafka Connect / Kafka Streams

(C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA

Content
Why Kafka
Kafka Eco System
Topics and Partitions
Brokers
Replication Factor
Segments
Leader Concept
Producer
Consumer
Consumer Group
Why Kafka Connect
Kafka Connect Architecture
Kafka Connect Demo Example
File Connector
JDBC Connector
Kafka Streams
Stream Processing
KStream & KTable
Kafka Streams Demo Example
10/18/2018 (C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA 2

Why Kafka
10/18/2018 3(C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA
Simple Application to start with
Source System
Target System
Complex Enterprise System
Source System
Target System
Source System
Target System
Source System
Target System
Complex Enterprise System
Source System
Target System
Source System
Target System
Source System
Target System
Apache Kafka

Why Kafka
10/18/2018 4(C) COPYRIGHT METAMAGIC GLOBAL INC., NEW JERSEY, USA
• Messaging system
• De-coupling of system dependencies
• Gather metrics from various systems
• Application log
• Stream Processing (Kafka Stream API)
• Integration with different system (Kafka
Connect API)
• ..
Use Cases
• Distributed, resilient and fault tolerance.
• Horizontal Scalability
• High performance
• Widely adopted by big companies
• Netflix
• AirBnb
• Walmart
• LinkedIn
Features

Kafka Eco System
Source System Target SystemKAFKAProducer Consumer
ZOOKEEPER

Topics and Partitions
Topics
• It is stream of data
• Similar as tables in database but without any kind of constraints
• Topic is identified by unique name
• There can be N number of topics
• Each topics are split in partitions
• Partitions are in order
• Each topic can have N number of partitions
• Messages are stored in partition with incremental ID, it is called as Offset
Partition 0 1 2 3 10…..0
Partition 1 1 2 3 15…..0
Topic

Topics and Partitions
• There can be as many partitions user wants
• Offset has meaning within partitions not across partitions. e.g. Data in offset 1 of
partition 1 is not same as data of offset of partition 2
• Order is guaranteed within partition not across partition
• Once data is saved it cannot be changed, i.e. it is immutable
• Assignment of data to partition is random unless key is provided
• Data is stored to partition for one week by default, it can be configured

Brokers
•Kafka cluster is composed of multiple brokers, each identified by integer ID.
•Each broker contains certain topic partitions.
•Once you are connected to any broker (bootstrap broker) you will get connected to entire cluster
Broker 1
Topic 1
Partition 2
Topic 2
Partition 1
Broker 2
Topic 1
Partition 1
Topic 2
Partition 2
Broker 3
Topic 1
Partition 3
Data is distributed, TOPIC 2 is not present on Broker 3

Replication Factor
Topic should have replication factor greater then 1, this will guaranty if one broker goes down
other can serve.
Example of Topic with partition 2 and replication 2.
Broker 1
Topic 1
Partition 1
Broker 2
Topic 1
Partition 1
Topic 1
Partition 2
Broker 3
Topic 1
Partition 2
If anyone broker goes down still data is served as data is replicated on other broker also

Partitions and Replication Factor
◦ Most important two factor while creating TOPIC, as this two factor impact performance and durability
◦ It is always best to define this two factor while creating TOPIC instead of modifying those after creating
topic
◦ If you increase partition after creating topic, key ordering is going to break. e.g. if you created topic with 3 partition and increase it
to 5, then message with same key will not go to same partition as we have changed partition.
◦ If you increase replication factor after creating topic, then there will be more pressure on Kafka cluster. This might affect
performance.
◦ Confulent blog
◦ Roughly each partition can get throughput of 10mb/sec.
◦ More partition leads to better parallelism, but this lead to more file open & if broker fails it lead to lot of concurrent leader
election.

Partitions and Segments
Topics are made of Partitions & Partitions are made of segments.
Partitions
Segment 1:
Offset 1999 - Current
Segment 1:
Offset 1000 - 1999
Segment 0:
Offset 0 - 999
Active Segment
For
Write
• At a given time only one segment can be active
• Two important segment settings
• Max size of segment in bytes – log.segment.bytes (default is 1GB)
• Time Kafka will wait before committing segment – log.segment.ms (default
is 1 week)

Segment and Indexes
Segment comes with two indexes files
• Offset position of index : Which allows Kafka to read message from given offset
• Timestamp Index: Which allows Kafka to read message for given time
Position 0
Segment 1:
Offset 1999 - Current
Segment 1:
Offset 1000 - 1999
Segment 0:
Offset 0 - 999
Position 1 Position 2
Timestamp 0 Timestamp 1 Timestamp 2
Note: Timestamp index is from Kafka 0.10.1

Concept of leader
• At any given time only one broker can leader for given partition
• Only leader can receive and serve data for a partition
• Other broker are ISR (in-sync replica)
• Every partition can have one leader and multiple ISR
Broker 2
Topic 1
Partition 1
Topic 1
Partition 2
Broker 1
Topic 1
Partition 1
Broker 3
Topic 1
Partition 2
• Broker 1 is leader for Partition 1
• Broker 3 is leader for Partition 2

Producers
◦ Producers write data to topics. They have to only specify topic name and one of broker, Kafka will
automatically take care of sending data to right broker.
Producer
Broker 1
Topic 1
Partition 0
Broker 2
Topic 1
Partition 1
Broker 3
Topic 1
Partition 3
AutoLoadBalancing
ACKS Description
0 Producer don’t wait,
possible loss of data.
1 Producer wait for leader
acknowledgment.
ALL No Data Loss

Producers : Message Key
◦ Producer can send message with key
◦ If Key is present with message, message will always go to same partition.
◦ This enables ordering of data
◦ Example
Producer
Broker 1
Topic 1
Partition 0
Broker 2
Topic 1
Partition 1
key = user_id
If user_id = 0 message always goes to
partition 0.
If user_id = 1 message always goes to
partition 1.

Consumers
◦ Consumers reads the data from topic. They just have to specify the topic name and one broker, Kafka
will automatically take care of reading message from right broker
◦ Data is read by consumer in order within a partition
◦ Consumer can read the data in parallel but across partition
Broker 1
Topic 1
Partition 0
Broker 2
Topic 1
Partition 1
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8
Consumer
Read in order
Read in order

Consumer Groups
◦ Consumers read the data within Consumer Group
◦ Each consumer within group read exclusive to partition
◦ There cannot be more consumer then partition, if they are some of them will be inactive
Topic1 Partition 0 Topic1 Partition 1
Consumer 1 Consumer 2 Consumer 1
Consumer Group 0 Consumer Group 1

Zookeeper
◦ Zookeeper manages the broker and keep the list of them
◦ Zookeeper does the leader election for partition
◦ Zookeeper notifies Kafka in case of
◦ New Topic
◦ Deletion of topic
◦ New Broker
◦ Broker Dies

Demo using command line
Assuming Zookeeper & Kafka is installed, Also kafka is started on two broker (9093/94 ports)
• Create topic with 2 partition using replication factor as 1
• sh kafka-topics.sh --zookeeper 127.0.0.1:2181 --create --topic first_topic --partitions 2 --replication-factor 1
• To display topic details “sh kafka-topics.sh --zookeeper 127.0.0.1:2181 –list”
• Produce the message on topic created
• sh kafka-console-producer.sh --broker-list 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic
• Consume the message on topic created
• sh kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic --from-beginning
• Enter couple of message on producer console and see corresponding message getting displayed on
consumer console.
• Consumer group – If you want to test consumer group concept, execute below command on two
different terminal. You will notice message are getting consumed as per partitions
• sh kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9093,127.0.0.1:9094 --topic first_topic --consumer-property
group.id=mygroup1 --from-beginning

Kafka Client Example (Java)
◦ Checkout source code from git
◦ ProducerDemo.java publish message & callback method is called once message is acknowledge
◦ ConsumerDemo.java code is configured with consumer-group = “mygroup1”, If you run multiple
instance of this application you will be able to see message getting consumed as per partitions.

Why Kafka Connect & Streams
Common use case on KAFKA are
• Source -> Kafka
• Kafka -> Sink
• Kafka -> Kafka
• Kafka -> App
Kafka Connect allows to
◦ Import data from various data source like: Database, Filesystem, Twitter, FTP etc.
◦ Export data to various data source like : Twitter, Filesystem, Database, Splunk etc.
Kafka Connect API
Kafka Streams API
Kafka Consumer API
Simplifies getting in and out data of Kafka

Kafka Connect & Streams Architecture
Source
Broker
Kafka Cluster
Stream App1
Stream App1
Stream App1
Kafka Connect
Worker
Worker
Worker
Worker Broker
Broker
Broker
Sink
1
2
3
5 4

File and JDBC connector (Demo)
Source File Kafka Connect Kafka Topic
Sink File
Kafka
connect
Kafka Topic
FileConnector
Source DB Kafka Connect Kafka Topic
Sink DB
Kafka
connect
Kafka Topic
KafkaJDBCConnector
Git:https://github.com/MetaArivu/kafka-connect.git
1
2
3
4

Kafka Streams
Kafka Streams is a client library for processing and analyzing data stored in Kafka.
KAFKA
Data
Transformation
Data Enrichment
Monitoring and
Alerting
• Simple and lightweight java client library
• Has no external dependencies on systems other than Apache
Kafka itself
• Supports exactly-once processing semantics to guarantee that
each record will be processed once
• Employs one-record-at-a-time processing to achieve millisecond
processing latency
• Supports fault-tolerant local state

Stream Processing
◦ Stream it represents an unbounded, continuously
updating data set
◦ Stream processing application, defines its computational
logic through one or more processor topologies
◦ Stream processor, is a node in the processor topology,
there are two special processors in the topology
◦ Source processor, doesn’t have up-stream processor. It produce input
stream for its topology from one or more kafka topics and forward them
to down-stream processor
◦ Sink processor, doesn’t have down-stream processor. It sends any data
received from its up-stream to Kafka Topic
stream
stream
processor
source
processor
sink
processor

KStream & KTable
KStream
◦ It is all Insert
◦ Unbounded Streams
◦ It is infinite
Topic KStream
Team A, 10 Team A, 10
Team B, 20
Congress, 20
Team B, 10
Team A, 40
Congress, 20
Team A, 10
Team A, 40
KTable
◦ It is upsert on non null values
◦ Delete on null values
Topic KTable
Team A, 10 Team A, 10
Team B, 20
Congress, 20
Team A, 10
Team A, 40
Congress, 20
Team A, 40

Kafka Streams – Polling APP Demo
Goal on this application is to display party count based on incoming data, were incoming data is party
and candidate name.
INPUT
Congress Candidate 2
Congress Candidate 3
Independent Candidate 4
OUTPUT
Congress 2
Independent 1
https://github.com/MetaArivu/kafka-streams

APACHE KAFKA / Kafka Connect / Kafka Streams

More Related Content

What's hot (20)

Similar to APACHE KAFKA / Kafka Connect / Kafka Streams (20)

Recently uploaded (20)

APACHE KAFKA / Kafka Connect / Kafka Streams

Editor's Notes