SlideShare a Scribd company logo
Containerising
Distributed Pipes
a story about convenience
Hagen Tönnies
www.linkedin.com/in/hagen-toennies
1
Agenda
• Introduction
• A Distributed Pipe
• Tool Application Stack
• Recap
2
Unix Pipe Recap
• […]In Unix-like computer operating systems, a
pipeline is a sequence of processes chained
together by their standard streams, so that the
output of each process (stdout) feeds directly as
input (stdin) to the next one.
3
Unix Philosophy
• Write programs that do one thing and do it well.
• Write programs to work together.
• Write programs to handle text streams,
because that is a universal interface.
Peter H. Salus says about the unix philosophy
4
Apache Kafka
5
Apache Kafka Cluster
Kafka Broker
6
Zookeeper
Apache Kafka Topic
oldernewer
7
topic
Apache Kafka Partitions
8
Apache Kafka Replications
9
Apache Kafka Distribute
Partitions
10
Apache Kafka Producer
oldernewer
Producer
11
Apache Kafka Consumer
Consumer
oldernewer
12
Apache Kafka Consumer
Groups
oldernewer
Consumer1Consumer1Group1
Consumer1Consumer1Group2
13
Kafka Streaming API
14
Low-Level High-Level
• Topology Builder
• Stream and Table
abstractions
• Custom Aggregators • Simple Transformation
• Custom Processors
• Simple Joins of Tables
and Streams
Kafka Streaming API
15
Table Stream (change-log)
alice | 1
alice | 1
charlie | 1
alice | 2
charlie | 1
(„alice“ , 1)
(„charlie, 1)
(„alice, 2)
Kafka featuring Unix
(split %1 "s")
16
Kafka featuring Unix
(split %1 "s")
kafka
Message Broker Stream processing job
17
Kafka featuring Unix
(sketch %1)(split %1 "s") (store %1)(agg-by-key %1)
18
Containers
• […] Operating-system-level virtualization is a
server virtualization method in which the kernel of
an operating system allows the existence of
multiple isolated user-space instances
19
Containers
• docker run -t -i CONTAINER-NAME …args
• java -jar …args
20
Our Distributed Pipe
Application Stack in Docker
ZooKeeper
kafka
Discovering and configuring services in your infrastructure.
Distributed append log a.k.a Message Broker
enables highly reliable distributed coordination.
Provides a distributed full-text search engine
21
consul:

image: qnib/alpn-consul

networks:

- network

cpu_shares: 4

mem_limit: 1g

environment:

- DC_NAME=es 

ports:

- “8500:8500"
22
zookeeper:

image: qnib/zookeeper

extends:

file: base.yml

ports:

- "2181:2181"
ZooKeeper
23
kafka-broker:

image: kafka:0.10.0.1

extends:

file: base.yml

service: gaikai 

volumes:

- /tmp/kafka-logs 

ports:

- "9092:9092"
kafka
24
elasticsearch:

image: elasticsearch:1.7

command: "elasticsearch -Des.cluster.name=dp-es -Dnetwork.bind_host=0.0.0.0"

environment:

- ES_HEAP_SIZE=2g

ports:

- "9200:9200"

- "9300:9300"
25
26
Stream Processing Recap
27
Power
Simplicity
kafka
28
kafka
kafka streams
Stream Processing Recap
• clj-kstream-cutter
• clj-kstream-hh
• clj-kstream-string-long-window-aggregate
• clj-kstream-elasticsearch-sink
https://github.com/sojoner
Our Distributed Tool
Application Stack in Docker
29
https://xkcd.com/297/
30
clj-kstream-cutter
Edmilson Alves 0 Edmilson Alves -LRB- born February
17 , 1976 -RRB- , is a Brazilian midfielder who currently
plays for Roasso Kumamoto in the J. League Division 2 .
[ Edmilson, Alves, 0, Edmilson, Alves, LRB, born …]
31
Input:
Output:
[ Edmilson, Alves, 0, Edmilson, Alves, LRB, born, …]
Edmilson ~10 Alves ~8
clj-kstream-hh
32
Input:
Output:
33
Count Min (CM) sketch
34
CM sketch retrieval
And ~10
Bob ~7
Alice ~5
Foo ~3
Bar ~2
take top_n
retrieve sketched value
Heavy Hitter for t_1
35
Heavy Hitter
(defn- heavy-hitter-processor

"Main stream processor takes a configuration and a mapper function to apply."

[conf]

(let [streamBuilder (-> (new TopologyBuilder)

(.addSource (:name conf) string_dser string_dser (into-array [(:input-topic conf)]))

(.addProcessor "HeavyHitter"

(reify ProcessorSupplier

(get [this]

(get-processor)))

(into-array [(:name conf)]))

(.addStateStore

(->> (Stores/create storeName)

(.withStringKeys)

(.withLongValues)

(.inMemory)

(.build))

(into-array ["HeavyHitter"]))

(.addSink

"Sink"

(:output-topic conf)

string_ser

string_ser

(into-array ["HeavyHitter"])))]

(.start

(KafkaStreams.

streamBuilder

(get-props conf)))))
36
clj-kstream-hh Topology
(defn ^Processor get-processor []

(reify org.apache.kafka.streams.processor.Processor

(init [this context]

(.schedule (:context @application-state) (:time-window @application-state))

(swap! hh/state assoc

:top-n 5

:number-of-hashfn 10N

:bucket-size 1000N)

(reset! hh/hitter ^(priority-map))

(reset! hh/min-sketch (make-array Integer/TYPE 10N 1000N)) …)



(process [this key value]

(debug "Process (k,v)::" key value)

(hh/sketch-value value)

(hh/add-to-hitter value) …)



(punctuate [this timestamp …)



(close [this]

(.close (:store @application-state)))))
37
clj-kstream-hh Processor
38
clj-kstream-hh Container
clj-kstream-hh:
image: sojoner/clj-kstream-hh:0.1.0
hostname: clj-kstream-hh
container_name: clj-kstream-hh
extends:
file: base.yml
service: sojoner
command: "--broker kafka-broker:9092 --input-topic mapped-test-json
--output-topic heavy-hitters --window-size 1 --name stream-hh"
Input:
Output:
Alves ~8 Alves ~10 Edmilson ~5 Edmilson~3
Alves ~18 Edmilson ~8
(key, value)
39
clj-kstream-string-long-
window-aggregate
clj-kstream-elasticsearch-sink
Input:
Output:
{
“name“: “Alves“,
“count“: 18,
“time“: “January 26th 2017, 17:03:00.000”
}
Alves ~18
40
Our Distributed Tool
Application Stack in Docker
Edmilson Alves 0 Edmilson Alves -LRB- born February 17 , 1976 -RRB- , is
a Brazilian midfielder who currently plays for Roasso Kumamoto in the J.
League Division 2 .
[ Edmilson, Alves, 0, Edmilson, Alves, LRB, born …]
Alves ~10 Alves ~8
{“name“: “Alves“, “count“: 18, “time“: “January 26th 2017, 17:03:00.000”}
Alves 18
41
Our Distributed Tool
Application Stack in Docker
42
From the development
setup…
$ export DOCKER_HOST=tcp://my.desktop.de:2576
43
…to a datacenter setup
$ export DOCKER_HOST=tcp://my.datacenter.de:2576
Build a Docker Swarm
44
Current Challenges
• Kafka Streams still at least once
• Persistent and durable Storage
• Still need capacity planning
• Testing / Debugging is still a challenge
• Consistency of the state storage
• Processing Time vs. Event Time
• How about Amdahl’s law
45
Recap
(split %1 "s")
46
as Message broker as Stream Processor
(split %1 "s")
Containerising
Distributed Pipes
a story about convenience
(thanks (listening [this]))
47
Sources 1
• http://kafka.apache.org/
• https://martin.kleppmann.com/2015/05/06/data-agility-at-strata.html
• https://speakerdeck.com/ept/kafka-and-samza-distributed-stream-
processing-in-practice
• https://github.com/mhausenblas/dnpipes
• https://en.wikipedia.org/wiki/Pipeline_%28Unix%29
• https://zookeeper.apache.org/doc/trunk/zookeeperOver.html
• https://github.com/sojoner/container-stacks/tree/master/
kafkaelasticsearch
48
Sources 2
• https://kafka.apache.org/documentation/streams#streams_processor
• https://kafka.apache.org/documentation/streams#streams_dsl
• https://hub.docker.com/r/sojoner/clj-kstream-elasticsearch-sink/
• https://hub.docker.com/r/sojoner/clj-kstream-cutter/
• https://hub.docker.com/r/sojoner/clj-kstream-hh/
• https://hub.docker.com/r/sojoner/clj-kstream-string-long-window-
aggregate/
• https://blog.acolyer.org/2016/07/21/time-adaptive-sketches-ada-
sketches-for-summarizing-data-streams/
49

More Related Content

PDF
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
PDF
Introduction to Kafka Streams
PDF
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
PPTX
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PDF
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Introduction to Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Streaming in Practice - Putting Apache Kafka in Production
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform

What's hot (19)

PPTX
OpenStack High Availability
PDF
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
PPTX
Container Orchestration with Docker Swarm and Kubernetes
PPTX
Architecture of a Kafka camus infrastructure
PDF
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
PDF
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
PDF
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
PDF
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
PDF
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
PDF
Fundamentals of Apache Kafka
PPTX
kafka for db as postgres
PDF
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
PDF
Building Out Your Kafka Developer CDC Ecosystem
PDF
Building Web Scale Apps with Docker and Mesos by Alex Rukletsov (Mesosphere)
PDF
Power of the Log: LSM & Append Only Data Structures
PDF
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
PDF
Kafka Summit SF 2017 - Best Practices for Running Kafka on Docker Containers
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
PDF
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
OpenStack High Availability
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
Container Orchestration with Docker Swarm and Kubernetes
Architecture of a Kafka camus infrastructure
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Fundamentals of Apache Kafka
kafka for db as postgres
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
Building Out Your Kafka Developer CDC Ecosystem
Building Web Scale Apps with Docker and Mesos by Alex Rukletsov (Mesosphere)
Power of the Log: LSM & Append Only Data Structures
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Kafka Summit SF 2017 - Best Practices for Running Kafka on Docker Containers
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Ad

Viewers also liked (20)

PDF
HPC Computing Trends
PDF
Apex & Geode: In-memory streaming, storage & analytics
PPTX
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
PDF
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
PDF
Intersect360 Top of All Things in HPC Snapshot Analysis
PDF
Towards Exascale Computing with Fortran 2015
PPTX
SimplifyStreamingArchitecture
PDF
State of Linux Containers for HPC
PDF
Designing HPC & Deep Learning Middleware for Exascale Systems
PDF
Multi-Physics Methods, Modeling, Simulation & Analysis
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PDF
Streaming all the things with akka streams
PPTX
How to Avoid Problems with Lump-sum Relocation Allowances
PDF
Application Profiling at the HPCAC High Performance Center
PPTX
IDC Perspectives on Big Data Outside of HPC
PDF
Introduction to GPUs in HPC
PDF
IDC HPC Market Update
PDF
Going bananas with recursion schemes for fixed point data types
PPT
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
PDF
[OracleCode - SF] Distributed caching for your next node.js project
HPC Computing Trends
Apex & Geode: In-memory streaming, storage & analytics
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
Intersect360 Top of All Things in HPC Snapshot Analysis
Towards Exascale Computing with Fortran 2015
SimplifyStreamingArchitecture
State of Linux Containers for HPC
Designing HPC & Deep Learning Middleware for Exascale Systems
Multi-Physics Methods, Modeling, Simulation & Analysis
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Streaming all the things with akka streams
How to Avoid Problems with Lump-sum Relocation Allowances
Application Profiling at the HPCAC High Performance Center
IDC Perspectives on Big Data Outside of HPC
Introduction to GPUs in HPC
IDC HPC Market Update
Going bananas with recursion schemes for fixed point data types
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
[OracleCode - SF] Distributed caching for your next node.js project
Ad

Similar to Containerizing Distributed Pipes (20)

PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
PDF
BDX 2016- Monal daxini @ Netflix
PDF
Kafka Summit SF 2017 - Running Streaming Apps on Docker
PDF
The Netflix Way to deal with Big Data Problems
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
PDF
Streaming Processing with a Distributed Commit Log
PDF
Stream Processing made simple with Kafka
PPTX
Apache Kafka Streams
PDF
Building a Dynamic Rules Engine with Kafka Streams
PDF
Big Data Streams Architectures. Why? What? How?
PDF
Real-world Streaming Architectures
PPTX
Docker-N-Beyond
PDF
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
PDF
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
PDF
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
PPTX
Best Practices for Running Kafka on Docker Containers
PDF
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
BDX 2016- Monal daxini @ Netflix
Kafka Summit SF 2017 - Running Streaming Apps on Docker
The Netflix Way to deal with Big Data Problems
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
Streaming Processing with a Distributed Commit Log
Stream Processing made simple with Kafka
Apache Kafka Streams
Building a Dynamic Rules Engine with Kafka Streams
Big Data Streams Architectures. Why? What? How?
Real-world Streaming Architectures
Docker-N-Beyond
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Best Practices for Running Kafka on Docker Containers
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
A comparative analysis of optical character recognition models for extracting...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
Teaching material agriculture food technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
A Presentation on Artificial Intelligence
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
A comparative analysis of optical character recognition models for extracting...
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine Learning_overview_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A Presentation on Artificial Intelligence
Assigned Numbers - 2025 - Bluetooth® Document
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx

Containerizing Distributed Pipes