SlideShare a Scribd company logo
SPRINGONE2GX
WASHINGTON, DC
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Developing Real-Time Data
Pipelines with Apache Kafka
Joe Stein
@allthingshadoop
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
CEO of Elodina http://www.elodina.net/ a big data as a service platform built on top open source
software. The Elodina platform enables customers to analyze data streams and programmatically
react to the results in real-time. We solve today’s data analytics needs by providing the tools and
support necessary to utilize open source technologies. As users, contributors and committers,
Elodina also provides support for frameworks that run on Mesos including Apache Kafka, Exhibitor
(Zookeeper), Apache Storm, Apache Cassandra and a whole lot more!
Apache Kafka Committer & PMC Member
LinkedIn: http://linkedin.com/in/charmalloc
Twitter : @allthingshadoop
whoami
2
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Contents
• Introduction To Kafka
• Overview
• Topics, Partitions & Segments
• Data Durability
• Replication
• Producers
• Consumers
• Performance
• Integration
• Quick Start
• Operations
3
• Designs
• Distributed RPC
o Request
o Process
o Response
• Storage & Analytics
o Stream
o Transform
o Analyze
o Store
o Search
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Apache Kafka
4
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Apache Kafka
Apache Kafka was first open sourced by LinkedIn in 2011
Papers
● Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf
● Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en-
us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
● Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf
● The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://kafka.apache.org/
5
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
How Big Data Usually Starts
6
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
More Big Data!
7
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Ah!
8
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
eesh
9
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Kafka de-couples data pipelines
10
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Distributed Replicated Log
Read and write
In real time
As much as you want
As fast as your network can go
11
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Topics and Partitions
12
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Log Segments
13
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Distributed Replicated Log
14
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Data Durability
15
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Replication
16
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Producers
17
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Consumers
18
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Consumer Failover
19
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Producer Performance
20
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Consumer Performance
http://kafka.apache.org/documentation.html#maximizingefficiency
21
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Client Libraries
Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients
● Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
● Python - Pure Python implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
● C - High performance C library with full protocol support
● Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy
compression supported. Ruby 1.9.3 and up (CI runs MRI 2.
● Clojure - Clojure DSL for the Kafka API
● JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation
Wire Protocol Developer's Guide
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
22
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Spring Integration
Good blog about it https://spring.io/blog/2015/04/15/using-apache-kafka-for-
integration-and-data-processing-pipelines-with-spring
Kafka Integration Source https://github.com/spring-projects/spring-integration-
kafka
Spring XD samples
https://github.com/spring-projects/spring-xd-samples/tree/master/kafka-source
23
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Quick Start
https://kafka.apache.org/documentation.html#quickstart
Download the 0.8.2.2 release and un-tar it.
> tar -xzf kafka_2.10-0.8.2.2.tgz
> cd kafka_2.10-0.8.2.2
(use at least four terminal windows)
> bin/zookeeper-server-start.sh config/zookeeper.properties
> bin/kafka-server-start.sh config/server.properties
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This is a message
This is another message
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
This is a message
This is another message
24
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Operationalizing Kafka
https://kafka.apache.org/documentation.html#basic_ops
Basic Kafka Operations
● Adding and removing topics
● Modifying topics
● Graceful shutdown
● Balancing leadership
● Checking consumer position
● Mirroring data between clusters
● Expanding your cluster
● Decommissioning brokers
● Increasing replication factor
25
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Running on Mesos
26
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Static Partitioning
27
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Scaling is manual (even if orchestrated)
28
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Static failures require manual intervention
29
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Application Elasticity
30
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
An operating system for your data center
31
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Everything goes on Mesos
32
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Kafka on Mesos
https://github.com/mesos/kafka
● smart broker.id assignment.
● preservation of broker placement (through constraints and/or new features).
● ability to-do configuration changes.
● rolling restarts (for things like configuration changes).
● scaling the cluster up and down with automatic, programmatic and manual
options.
● smart partition assignment via constraints visa vi roles, resources and
attributes.
33
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Kafka on Mesos
Scheduler
● Provides the operational automation for a Kafka Cluster.
● Manages the changes to the broker's configuration.
● Exposes a REST API for the CLI to use or any other client.
● Runs on Marathon for high availability.
Executor
● The executor interacts with the kafka broker as an intermediary to the scheduler
34
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
REST API & CLI
● scheduler - starts the scheduler.
● add - adds one more more brokers to the cluster.
● update - changes resources, constraints or broker properties one or more brokers.
● remove - take a broker out of the cluster.
● start - starts a broker up.
● stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop)
● rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual
assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication factor
on a topic.
● help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command}
35
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Launch 20 brokers in seconds
36
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Kafka 0.9
KIP (Kafka Improvement Process)
• https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
New Consumer
• https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
Security
• https://cwiki.apache.org/confluence/display/KAFKA/Security
• https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=51809888
JIRA
• https://issues.apache.org/jira/browse/KAFKA/fixforversion/12328745/?selectedTab=com.atlassian.jira
.jira-projects-plugin:version-issues-panel
37
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Distributed RPC
38
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Reference Architecture
39
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Questions?
http://www.elodina.net
40

More Related Content

PPTX
Developing with the Go client for Apache Kafka
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PDF
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
ODP
Introduction to Apache Kafka- Part 1
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
PDF
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
PPTX
Architecture of a Kafka camus infrastructure
PPTX
Data Architectures for Robust Decision Making
Developing with the Go client for Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Introduction to Apache Kafka- Part 1
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Architecture of a Kafka camus infrastructure
Data Architectures for Robust Decision Making

What's hot (19)

PDF
fluentd -- the missing log collector
PPTX
Matt Franklin - Apache Software (Geekfest)
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
PDF
Kafka and Spark Streaming
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
PDF
Cooperative Data Exploration with iPython Notebook
PDF
Stream Processing using Apache Spark and Apache Kafka
PDF
Using the flipn stack for edge ai (flink, nifi, pulsar)
PPTX
Zoo keeper in the wild
PDF
Using FLiP with influxdb for edgeai iot at scale 2022
PPTX
Apache Kafka at LinkedIn
PPTX
Introduction Apache Kafka
PDF
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PDF
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
PDF
An Introduction to Apache Kafka
PDF
Cloud lunch and learn real-time streaming in azure
PPTX
kafka for db as postgres
PPTX
Spark optimization
fluentd -- the missing log collector
Matt Franklin - Apache Software (Geekfest)
Spark Streaming & Kafka-The Future of Stream Processing
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Kafka and Spark Streaming
Kafka & Hadoop - for NYC Kafka Meetup
Cooperative Data Exploration with iPython Notebook
Stream Processing using Apache Spark and Apache Kafka
Using the flipn stack for edge ai (flink, nifi, pulsar)
Zoo keeper in the wild
Using FLiP with influxdb for edgeai iot at scale 2022
Apache Kafka at LinkedIn
Introduction Apache Kafka
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
An Introduction to Apache Kafka
Cloud lunch and learn real-time streaming in azure
kafka for db as postgres
Spark optimization
Ad

Viewers also liked (20)

PPTX
Developing Frameworks for Apache Mesos
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
PPTX
Real-time streaming and data pipelines with Apache Kafka
PPTX
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PPTX
jstein.cassandra.nyc.2011
PDF
Data Pipeline with Kafka
PPTX
Storing Time Series Metrics With Cassandra and Composite Columns
PPTX
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
PDF
Developing Realtime Data Pipelines With Apache Kafka
PPTX
Containerized Data Persistence on Mesos
PPTX
Apache Cassandra 2.0
PPTX
Making Apache Kafka Elastic with Apache Mesos
PPTX
Kafka at scale facebook israel
PPTX
Building data pipelines
PPTX
Flink history, roadmap and vision
PPTX
Introduction To Apache Mesos
PPTX
Current and Future of Apache Kafka
PPTX
Apache Kafka, HDFS, Accumulo and more on Mesos
PDF
SMACK Stack 1.1
Developing Frameworks for Apache Mesos
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Real-time streaming and data pipelines with Apache Kafka
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Real time Analytics with Apache Kafka and Apache Spark
jstein.cassandra.nyc.2011
Data Pipeline with Kafka
Storing Time Series Metrics With Cassandra and Composite Columns
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Developing Realtime Data Pipelines With Apache Kafka
Containerized Data Persistence on Mesos
Apache Cassandra 2.0
Making Apache Kafka Elastic with Apache Mesos
Kafka at scale facebook israel
Building data pipelines
Flink history, roadmap and vision
Introduction To Apache Mesos
Current and Future of Apache Kafka
Apache Kafka, HDFS, Accumulo and more on Mesos
SMACK Stack 1.1
Ad

Similar to Developing Real-Time Data Pipelines with Apache Kafka (20)

PDF
Cassandra and DataStax Enterprise on PCF
PPTX
Building Highly Scalable Spring Applications using In-Memory Data Grids
PDF
Lattice: A Cloud-Native Platform for Your Spring Applications
PPTX
Ratpack - SpringOne2GX 2015
PDF
SpringOnePlatform2017 recap
PPTX
12 Factor, or Cloud Native Apps – What EXACTLY Does that Mean for Spring Deve...
PDF
12 Factor, or Cloud Native Apps - What EXACTLY Does that Mean for Spring Deve...
PDF
Federated Queries with HAWQ - SQL on Hadoop and Beyond
PDF
Cloud-Native Streaming and Event-Driven Microservices
PDF
Introduction to Reactive Streams and Reactor 2.5
PDF
riffing on Knative - Scott Andrews
PDF
Cloud-Native Streaming Platform: Running Apache Kafka on PKS (Pivotal Contain...
PDF
Under the Hood of Reactive Data Access (2/2)
PDF
Migrating to Angular 5 for Spring Developers
PDF
State of Securing Restful APIs s12gx2015
PDF
Migrating to Angular 4 for Spring Developers
PPTX
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
PPTX
Connecting All Abstractions with Istio
PDF
Reactive Web Applications
PDF
Cloud Native Java with Spring Cloud Services
Cassandra and DataStax Enterprise on PCF
Building Highly Scalable Spring Applications using In-Memory Data Grids
Lattice: A Cloud-Native Platform for Your Spring Applications
Ratpack - SpringOne2GX 2015
SpringOnePlatform2017 recap
12 Factor, or Cloud Native Apps – What EXACTLY Does that Mean for Spring Deve...
12 Factor, or Cloud Native Apps - What EXACTLY Does that Mean for Spring Deve...
Federated Queries with HAWQ - SQL on Hadoop and Beyond
Cloud-Native Streaming and Event-Driven Microservices
Introduction to Reactive Streams and Reactor 2.5
riffing on Knative - Scott Andrews
Cloud-Native Streaming Platform: Running Apache Kafka on PKS (Pivotal Contain...
Under the Hood of Reactive Data Access (2/2)
Migrating to Angular 5 for Spring Developers
State of Securing Restful APIs s12gx2015
Migrating to Angular 4 for Spring Developers
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Connecting All Abstractions with Istio
Reactive Web Applications
Cloud Native Java with Spring Cloud Services

More from Joe Stein (7)

PDF
Streaming Processing with a Distributed Commit Log
PDF
Get started with Developing Frameworks in Go on Apache Mesos
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
PPTX
Building and Deploying Application to Apache Mesos
PPTX
Introduction to Apache Mesos
PPTX
Apache Kafka
PPTX
Hadoop Streaming Tutorial With Python
Streaming Processing with a Distributed Commit Log
Get started with Developing Frameworks in Go on Apache Mesos
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Building and Deploying Application to Apache Mesos
Introduction to Apache Mesos
Apache Kafka
Hadoop Streaming Tutorial With Python

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Spectroscopy.pptx food analysis technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
cuic standard and advanced reporting.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Electronic commerce courselecture one. Pdf
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
sap open course for s4hana steps from ECC to s4
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
Spectroscopy.pptx food analysis technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
cuic standard and advanced reporting.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Developing Real-Time Data Pipelines with Apache Kafka

  • 1. SPRINGONE2GX WASHINGTON, DC Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Developing Real-Time Data Pipelines with Apache Kafka Joe Stein @allthingshadoop
  • 2. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ CEO of Elodina http://www.elodina.net/ a big data as a service platform built on top open source software. The Elodina platform enables customers to analyze data streams and programmatically react to the results in real-time. We solve today’s data analytics needs by providing the tools and support necessary to utilize open source technologies. As users, contributors and committers, Elodina also provides support for frameworks that run on Mesos including Apache Kafka, Exhibitor (Zookeeper), Apache Storm, Apache Cassandra and a whole lot more! Apache Kafka Committer & PMC Member LinkedIn: http://linkedin.com/in/charmalloc Twitter : @allthingshadoop whoami 2
  • 3. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Contents • Introduction To Kafka • Overview • Topics, Partitions & Segments • Data Durability • Replication • Producers • Consumers • Performance • Integration • Quick Start • Operations 3 • Designs • Distributed RPC o Request o Process o Response • Storage & Analytics o Stream o Transform o Analyze o Store o Search
  • 4. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Apache Kafka 4
  • 5. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Apache Kafka Apache Kafka was first open sourced by LinkedIn in 2011 Papers ● Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf ● Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en- us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf ● Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf ● The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying http://kafka.apache.org/ 5
  • 6. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ How Big Data Usually Starts 6
  • 7. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ More Big Data! 7
  • 8. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Ah! 8
  • 9. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ eesh 9
  • 10. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Kafka de-couples data pipelines 10
  • 11. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Distributed Replicated Log Read and write In real time As much as you want As fast as your network can go 11
  • 12. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Topics and Partitions 12
  • 13. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Log Segments 13
  • 14. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Distributed Replicated Log 14
  • 15. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data Durability 15
  • 16. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Replication 16
  • 17. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Producers 17
  • 18. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Consumers 18
  • 19. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Consumer Failover 19
  • 20. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Producer Performance 20 https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  • 21. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Consumer Performance http://kafka.apache.org/documentation.html#maximizingefficiency 21
  • 22. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Client Libraries Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients ● Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. ● Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. ● C - High performance C library with full protocol support ● Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. ● Clojure - Clojure DSL for the Kafka API ● JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation Wire Protocol Developer's Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol 22
  • 23. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Spring Integration Good blog about it https://spring.io/blog/2015/04/15/using-apache-kafka-for- integration-and-data-processing-pipelines-with-spring Kafka Integration Source https://github.com/spring-projects/spring-integration- kafka Spring XD samples https://github.com/spring-projects/spring-xd-samples/tree/master/kafka-source 23
  • 24. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Quick Start https://kafka.apache.org/documentation.html#quickstart Download the 0.8.2.2 release and un-tar it. > tar -xzf kafka_2.10-0.8.2.2.tgz > cd kafka_2.10-0.8.2.2 (use at least four terminal windows) > bin/zookeeper-server-start.sh config/zookeeper.properties > bin/kafka-server-start.sh config/server.properties > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test This is a message This is another message > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message 24
  • 25. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Operationalizing Kafka https://kafka.apache.org/documentation.html#basic_ops Basic Kafka Operations ● Adding and removing topics ● Modifying topics ● Graceful shutdown ● Balancing leadership ● Checking consumer position ● Mirroring data between clusters ● Expanding your cluster ● Decommissioning brokers ● Increasing replication factor 25
  • 26. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Running on Mesos 26
  • 27. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Static Partitioning 27
  • 28. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Scaling is manual (even if orchestrated) 28
  • 29. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Static failures require manual intervention 29
  • 30. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Application Elasticity 30
  • 31. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ An operating system for your data center 31
  • 32. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Everything goes on Mesos 32
  • 33. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Kafka on Mesos https://github.com/mesos/kafka ● smart broker.id assignment. ● preservation of broker placement (through constraints and/or new features). ● ability to-do configuration changes. ● rolling restarts (for things like configuration changes). ● scaling the cluster up and down with automatic, programmatic and manual options. ● smart partition assignment via constraints visa vi roles, resources and attributes. 33
  • 34. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Kafka on Mesos Scheduler ● Provides the operational automation for a Kafka Cluster. ● Manages the changes to the broker's configuration. ● Exposes a REST API for the CLI to use or any other client. ● Runs on Marathon for high availability. Executor ● The executor interacts with the kafka broker as an intermediary to the scheduler 34
  • 35. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ REST API & CLI ● scheduler - starts the scheduler. ● add - adds one more more brokers to the cluster. ● update - changes resources, constraints or broker properties one or more brokers. ● remove - take a broker out of the cluster. ● start - starts a broker up. ● stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop) ● rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication factor on a topic. ● help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command} 35
  • 36. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Launch 20 brokers in seconds 36
  • 37. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Kafka 0.9 KIP (Kafka Improvement Process) • https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals New Consumer • https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design Security • https://cwiki.apache.org/confluence/display/KAFKA/Security • https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=51809888 JIRA • https://issues.apache.org/jira/browse/KAFKA/fixforversion/12328745/?selectedTab=com.atlassian.jira .jira-projects-plugin:version-issues-panel 37
  • 38. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Distributed RPC 38
  • 39. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Reference Architecture 39
  • 40. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Questions? http://www.elodina.net 40