SlideShare a Scribd company logo
Real-Time Data Processing Pipeline &
Visualization with Docker, Spark, Kafka
and Cassandra
Roberto G. Hashioka – 2016-10-04 – TIAD – Paris
Personal Information
• Roberto Gandolfo Hashioka
• @rogaha (Github) e @rhashioka (Twitter)
• Finance -> Software Engineer
• Growth & Data Engineer at Docker
Summary
• Background / Motivation
• Project Goals
• How to build it?
• DEMO
Background
• Gather of data from multiple sources and process them in “real-time”
• Transform raw data into meaningful and useful information used to enable more effective
decision-making process
• Provide more visibility into trends on: 1) user behavior 2) feature engagement 3) opportunities
for future investments
• Data transparency and standardization
Project Goals
• Create a data processing pipeline that can handle a huge amount of events per second
• Automate the development environment — Docker compose.
• Automate the remote machines management — Docker for AWS / Machine.
• Reduce the time to market / time to development — New hires / new features.
Project / Language Stack
How to build it?
• Step 1: Install Docker for Mac/Win and dockerize all the applications
link: https://www.docker.com/products/docker
Exemplo de Dockerfile
-----------------------------------------------------------------------------------------------------------
FROM ubuntu:14.04
MAINTAINER Roberto Hashioka (roberto@docker.com)
RUN apt-get update && apt-get install -y nginx
RUN echo “Hello World! #TIAD” > /usr/share/nginx/html/index.html
EXPOSE 80
------------------------------------------------------------------------------------------------------------
$ docker build –t rogaha/web_demotiad2016 .
$ docker run –d –p 80:80 –-name web_demotiad2016 rogaha/web_demotiad2016
How to build it?
• Step 2: Define your services stack with a docker-compose file
Docker Compose
containers:
web:
build: .
command: python app.py
ports:
- "5000:5000"
volumes:
- .:/code
links:
- redis
environment:
- PYTHONUNBUFFERED=1
redis:
image: redis:latest
command: redis-server --appendonly yes
How to build it?
• Step 3: Test the applications locally from your laptop using containers
How to build it?
How to build it?
• Step 4: Provision your remote servers and deploy your containers
How to build it?
How to build it?
• Step 5: Scale your services with Docker swarm
DEMO
source code: https://github.com/rogaha/data-processing-pipeline
Open Source Projects Used
• Docker (https://github.com/docker/docker)
• An open platform for distributed applications for developers and sysadmins
• Apache Spark / Spark SQL (https://github.com/apache/spark)
• A fast, in-memory data processing engine. Spark SQL lets you query structured data as a resilient distributed dataset (RDD)
• Apache Kafka (https://github.com/apache/kafka)
• A fast and scalable pub-sub messaging service
• Apache Zookeeper (https://github.com/apache/zookeeper)
• A distributed configuration service, synchronization service, and naming registry for large distributed systems
• Apache Cassandra (https://github.com/apache/cassandra)
• Scalable, high-available and distributed columnar NoSQL database
• D3 (https://github.com/mbostock/d3)
• A JavaScript visualization library for HTML and SVG.
Thanks!
Questions?
@rhashioka

More Related Content

PDF
How Apache Kafka® Works
PPTX
Building APIs with Apigee Edge and Microsoft Azure
PDF
Solving Enterprise Data Challenges with Apache Arrow
PPTX
Kafka presentation
PDF
Introduction to Kafka Streams
PDF
My First 90 days with Vitess
PDF
What is an IoT Agent
ODP
Stream processing using Kafka
How Apache Kafka® Works
Building APIs with Apigee Edge and Microsoft Azure
Solving Enterprise Data Challenges with Apache Arrow
Kafka presentation
Introduction to Kafka Streams
My First 90 days with Vitess
What is an IoT Agent
Stream processing using Kafka

What's hot (20)

PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
PDF
Observability in Java: Getting Started with OpenTelemetry
PPTX
Air traffic controller - Streams Processing meetup
PDF
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
PDF
Domain Driven Design
PDF
Exploring the power of OpenTelemetry on Kubernetes
PDF
Reactive Programming
PPTX
How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...
PPTX
An Introduction to Confluent Cloud: Apache Kafka as a Service
PDF
The Real Cost of Slow Time vs Downtime
PDF
Let's build Developer Portal with Backstage
PDF
Hardening Kafka Replication
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Introduction to the Semantic Web
PDF
CDC patterns in Apache Kafka®
PPTX
OCI Overview
PDF
Performance Tuning RocksDB for Kafka Streams’ State Stores
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Cosco: An Efficient Facebook-Scale Shuffle Service
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Observability in Java: Getting Started with OpenTelemetry
Air traffic controller - Streams Processing meetup
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Domain Driven Design
Exploring the power of OpenTelemetry on Kubernetes
Reactive Programming
How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...
An Introduction to Confluent Cloud: Apache Kafka as a Service
The Real Cost of Slow Time vs Downtime
Let's build Developer Portal with Backstage
Hardening Kafka Replication
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Introduction to the Semantic Web
CDC patterns in Apache Kafka®
OCI Overview
Performance Tuning RocksDB for Kafka Streams’ State Stores
Iceberg: A modern table format for big data (Strata NY 2018)
Ad

Viewers also liked (17)

PDF
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
PDF
Test strategies for data processing pipelines
PDF
[246] foursquare데이터라이프사이클 설현준
PDF
Large scale data processing pipelines at trivago
PDF
[225]yarn 기반의 deep learning application cluster 구축 김제민
PDF
Real-time Big Data Processing with Storm
PPTX
[115] clean fe development_윤지수
PPTX
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
PDF
[211]대규모 시스템 시각화 현동석김광림
PPTX
Real-time Stream Processing with Apache Flink
PPTX
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
PPTX
[112]rest에서 graph ql과 relay로 갈아타기 이정우
PDF
[236] 카카오의데이터파이프라인 윤도영
PDF
Big Data Architecture
PDF
Building a Data Pipeline from Scratch - Joe Crobak
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
Test strategies for data processing pipelines
[246] foursquare데이터라이프사이클 설현준
Large scale data processing pipelines at trivago
[225]yarn 기반의 deep learning application cluster 구축 김제민
Real-time Big Data Processing with Storm
[115] clean fe development_윤지수
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
[211]대규모 시스템 시각화 현동석김광림
Real-time Stream Processing with Apache Flink
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[236] 카카오의데이터파이프라인 윤도영
Big Data Architecture
Building a Data Pipeline from Scratch - Joe Crobak
Real-Time Analytics with Apache Cassandra and Apache Spark
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Ad

Similar to Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra (20)

PPTX
Intro to R and H2O with Spencer Aiello
PPTX
Docker Container As A Service - Mix-IT 2016
PDF
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
PDF
'DOCKER' & CLOUD: ENABLERS For DEVOPS
PDF
Docker and Cloud - Enables for DevOps - by ACA-IT
PDF
Building a data warehouse with Pentaho and Docker
PDF
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
ODP
Docker engine - Indroduc
PPTX
Getting started with Docker sandboxes for MariaDB
PPTX
Dayta AI Seminar - Kubernetes, Docker and AI on Cloud
PDF
Cloud-native .NET Microservices mit Kubernetes
PDF
The App Developer's Kubernetes Toolbox
PPTX
betterCode Workshop: Effizientes DevOps-Tooling mit Go
PDF
Into The Box 2018 Going live with commandbox and docker
PDF
Going live with BommandBox and docker Into The Box 2018
PPTX
Docker Container As A Service - March 2016
PPTX
Containers as a Service with Docker
PPTX
Docker Enterprise Workshop - Technical
PPTX
Deploying applications to Windows Server 2016 and Windows Containers
PPTX
Docker Timisoara: Dockercon19 recap slides, 23 may 2019
Intro to R and H2O with Spencer Aiello
Docker Container As A Service - Mix-IT 2016
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
'DOCKER' & CLOUD: ENABLERS For DEVOPS
Docker and Cloud - Enables for DevOps - by ACA-IT
Building a data warehouse with Pentaho and Docker
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
Docker engine - Indroduc
Getting started with Docker sandboxes for MariaDB
Dayta AI Seminar - Kubernetes, Docker and AI on Cloud
Cloud-native .NET Microservices mit Kubernetes
The App Developer's Kubernetes Toolbox
betterCode Workshop: Effizientes DevOps-Tooling mit Go
Into The Box 2018 Going live with commandbox and docker
Going live with BommandBox and docker Into The Box 2018
Docker Container As A Service - March 2016
Containers as a Service with Docker
Docker Enterprise Workshop - Technical
Deploying applications to Windows Server 2016 and Windows Containers
Docker Timisoara: Dockercon19 recap slides, 23 may 2019

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Monthly Chronicles - July 2025
NewMind AI Weekly Chronicles - August'25 Week I
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

  • 1. Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra Roberto G. Hashioka – 2016-10-04 – TIAD – Paris
  • 2. Personal Information • Roberto Gandolfo Hashioka • @rogaha (Github) e @rhashioka (Twitter) • Finance -> Software Engineer • Growth & Data Engineer at Docker
  • 3. Summary • Background / Motivation • Project Goals • How to build it? • DEMO
  • 4. Background • Gather of data from multiple sources and process them in “real-time” • Transform raw data into meaningful and useful information used to enable more effective decision-making process • Provide more visibility into trends on: 1) user behavior 2) feature engagement 3) opportunities for future investments • Data transparency and standardization
  • 5. Project Goals • Create a data processing pipeline that can handle a huge amount of events per second • Automate the development environment — Docker compose. • Automate the remote machines management — Docker for AWS / Machine. • Reduce the time to market / time to development — New hires / new features.
  • 7. How to build it? • Step 1: Install Docker for Mac/Win and dockerize all the applications link: https://www.docker.com/products/docker
  • 8. Exemplo de Dockerfile ----------------------------------------------------------------------------------------------------------- FROM ubuntu:14.04 MAINTAINER Roberto Hashioka ([email protected]) RUN apt-get update && apt-get install -y nginx RUN echo “Hello World! #TIAD” > /usr/share/nginx/html/index.html EXPOSE 80 ------------------------------------------------------------------------------------------------------------ $ docker build –t rogaha/web_demotiad2016 . $ docker run –d –p 80:80 –-name web_demotiad2016 rogaha/web_demotiad2016
  • 9. How to build it? • Step 2: Define your services stack with a docker-compose file
  • 10. Docker Compose containers: web: build: . command: python app.py ports: - "5000:5000" volumes: - .:/code links: - redis environment: - PYTHONUNBUFFERED=1 redis: image: redis:latest command: redis-server --appendonly yes
  • 11. How to build it? • Step 3: Test the applications locally from your laptop using containers
  • 13. How to build it? • Step 4: Provision your remote servers and deploy your containers
  • 15. How to build it? • Step 5: Scale your services with Docker swarm
  • 17. Open Source Projects Used • Docker (https://github.com/docker/docker) • An open platform for distributed applications for developers and sysadmins • Apache Spark / Spark SQL (https://github.com/apache/spark) • A fast, in-memory data processing engine. Spark SQL lets you query structured data as a resilient distributed dataset (RDD) • Apache Kafka (https://github.com/apache/kafka) • A fast and scalable pub-sub messaging service • Apache Zookeeper (https://github.com/apache/zookeeper) • A distributed configuration service, synchronization service, and naming registry for large distributed systems • Apache Cassandra (https://github.com/apache/cassandra) • Scalable, high-available and distributed columnar NoSQL database • D3 (https://github.com/mbostock/d3) • A JavaScript visualization library for HTML and SVG.