Developing Real-Time Data Pipelines with Apache Kafka

SPRINGONE2GX
WASHINGTON, DC
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Developing Real-Time Data
Pipelines with Apache Kafka
Joe Stein
@allthingshadoop

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
CEO of Elodina http://www.elodina.net/ a big data as a service platform built on top open source
software. The Elodina platform enables customers to analyze data streams and programmatically
react to the results in real-time. We solve today’s data analytics needs by providing the tools and
support necessary to utilize open source technologies. As users, contributors and committers,
Elodina also provides support for frameworks that run on Mesos including Apache Kafka, Exhibitor
(Zookeeper), Apache Storm, Apache Cassandra and a whole lot more!
Apache Kafka Committer & PMC Member
LinkedIn: http://linkedin.com/in/charmalloc
Twitter : @allthingshadoop
whoami
2

Contents
• Introduction To Kafka
• Overview
• Topics, Partitions & Segments
• Data Durability
• Replication
• Producers
• Consumers
• Performance
• Integration
• Quick Start
• Operations
3
• Designs
• Distributed RPC
o Request
o Process
o Response
• Storage & Analytics
o Stream
o Transform
o Analyze
o Store
o Search

Apache Kafka
4

Apache Kafka
Apache Kafka was first open sourced by LinkedIn in 2011
Papers
● Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf
● Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en-
us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
● Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf
● The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://kafka.apache.org/
5

How Big Data Usually Starts
6

More Big Data!
7

Ah!
8

eesh
9

Kafka de-couples data pipelines
10

Distributed Replicated Log
Read and write
In real time
As much as you want
As fast as your network can go
11

Topics and Partitions
12

Log Segments
13

Distributed Replicated Log
14

Data Durability
15

Replication
16

Producers
17

Consumers
18

Consumer Failover
19

Producer Performance
20
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Consumer Performance
http://kafka.apache.org/documentation.html#maximizingefficiency
21

Client Libraries
Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients
● Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
● Python - Pure Python implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
● C - High performance C library with full protocol support
● Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy
compression supported. Ruby 1.9.3 and up (CI runs MRI 2.
● Clojure - Clojure DSL for the Kafka API
● JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation
Wire Protocol Developer's Guide
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
22

Spring Integration
Good blog about it https://spring.io/blog/2015/04/15/using-apache-kafka-for-
integration-and-data-processing-pipelines-with-spring
Kafka Integration Source https://github.com/spring-projects/spring-integration-
kafka
Spring XD samples
https://github.com/spring-projects/spring-xd-samples/tree/master/kafka-source
23

Quick Start
https://kafka.apache.org/documentation.html#quickstart
Download the 0.8.2.2 release and un-tar it.
> tar -xzf kafka_2.10-0.8.2.2.tgz
> cd kafka_2.10-0.8.2.2
(use at least four terminal windows)
> bin/zookeeper-server-start.sh config/zookeeper.properties
> bin/kafka-server-start.sh config/server.properties
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This is a message
This is another message
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
This is a message
This is another message
24

Operationalizing Kafka
https://kafka.apache.org/documentation.html#basic_ops
Basic Kafka Operations
● Adding and removing topics
● Modifying topics
● Graceful shutdown
● Balancing leadership
● Checking consumer position
● Mirroring data between clusters
● Expanding your cluster
● Decommissioning brokers
● Increasing replication factor
25

Running on Mesos
26

Static Partitioning
27

Scaling is manual (even if orchestrated)
28

Static failures require manual intervention
29

Application Elasticity
30

An operating system for your data center
31

Everything goes on Mesos
32

Kafka on Mesos
https://github.com/mesos/kafka
● smart broker.id assignment.
● preservation of broker placement (through constraints and/or new features).
● ability to-do configuration changes.
● rolling restarts (for things like configuration changes).
● scaling the cluster up and down with automatic, programmatic and manual
options.
● smart partition assignment via constraints visa vi roles, resources and
attributes.
33

Kafka on Mesos
Scheduler
● Provides the operational automation for a Kafka Cluster.
● Manages the changes to the broker's configuration.
● Exposes a REST API for the CLI to use or any other client.
● Runs on Marathon for high availability.
Executor
● The executor interacts with the kafka broker as an intermediary to the scheduler
34

REST API & CLI
● scheduler - starts the scheduler.
● add - adds one more more brokers to the cluster.
● update - changes resources, constraints or broker properties one or more brokers.
● remove - take a broker out of the cluster.
● start - starts a broker up.
● stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop)
● rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual
assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication factor
on a topic.
● help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command}
35

Launch 20 brokers in seconds
36

Kafka 0.9
KIP (Kafka Improvement Process)
• https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
New Consumer
• https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
Security
• https://cwiki.apache.org/confluence/display/KAFKA/Security
• https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=51809888
JIRA
• https://issues.apache.org/jira/browse/KAFKA/fixforversion/12328745/?selectedTab=com.atlassian.jira
.jira-projects-plugin:version-issues-panel
37

Distributed RPC
38

Reference Architecture
39

Questions?
http://www.elodina.net
40

Developing Real-Time Data Pipelines with Apache Kafka

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Developing Real-Time Data Pipelines with Apache Kafka (20)

More from Joe Stein (7)

Recently uploaded (20)

Developing Real-Time Data Pipelines with Apache Kafka