SlideShare a Scribd company logo
Stream processing and
visualization for transaction
investigation
Using Kafka, Spark, and D3.js
Ben Laird
Capital One Labs
C1 Labs
Data
Science
About me
Cornell Engineering ’07
BS, Operations Research
Johns Hopkins ‘12
MS, Applied Math
• Data Engineer
• Northrop Grumman
• IBM
• Space Debris Tracking
• NLP of intel documents
• Counter-IED GIS analysis
Cornell expectations
Cornell reality
C1 Labs
Data
Science
Now: Data Scientist at Capital One Labs
C1 Labs
Data
Science
A technical challenge: Build a dynamic, rich
visualization of large, streaming data
Normally, we have two options
Small data
Easy visualization
Big data
No visualization
C1 Labs
Data
Science
Data Science: More than just Hadoop
• Understanding all the requirements of your problem and
the architecture that meets those demands is an ever
important for a data scientist
• Data processing solution doesn’t matter if you have a
1hr load time in the browser.
• Visualization doesn’t matter if there is no way to
process/store data
Stream
Handling Stream
Processing Intermediate
Storage
Web
Server/Frame
work
Event Based
Comm Browser Viz
C1 Labs
Data
Science
Our system must be able to process and visualize a
real time transaction stream
• Requirement: System must
handle 1B+ transactions
• Loading 1B records on the client
side isn’t feasible
• Our data is not only big, it is live.
• Assume a stream of 50
records/second
C1 Labs
Data
Science
Proposed solution: Use existing big data tools to
process stream before web stack
Tool Purpose
Apache Kafka Distributed Messaging for transaction stream
Apache Spark Streaming Distributed processing of transaction stream.
Aggregate to levels that can be handled by browser
MongoDB Intermediate storage in Capped Collection for web
server access
Node.js Server side framework for web server and Mongo
interaction
Socket.io Event based communication – Pass new data from
stream into browser
Crossfilter Client side data index
DC.js/D3.js D3.js graphics and intergration with Crossfilter
How/Why did I pick these for our architecture?
C1 Labs
Data
Science
A foray into data visualization tools
From the beautiful: Minard Map, 1869
Source: http://www.edwardtufte.com/tufte/minard
C1 Labs
Data
Science
to the ‘not beautiful’
Sources: http://www.excelcharts.com/, http://www.datavis.ca/gallery/evil-pies.php
C1 Labs
Data
Science
With most solutions, you face a trade off between ease of use
and flexibility
• If you need a quick solution or don’t need full
control or customization, there are fantastic options
• Tableau
• ElasticSearch
Kibana
C1 Labs
Data
Science
D3.js provides an extremely powerful way of joining data with
completely custom graphics
Limitless possibilities. Complete control over data and viz. Not trivial to use
though!
C1 Labs
Data
Science
Bind data directly to elements in the DOM. Create graphics from
scratch
http://bl.ocks.org/mbostock/7341714
C1 Labs
Data
Science
All about finding the right level of abstraction. Introduce DC.js
• Don’t always want to construct bar charts from the
ground up.
• Build axes, ticks, set colors, scales, bar widths, height,
projections...Too tedious sometimes
• DC.js adds a thin layer on top of d3.js to construct most
chart types and to link charts together for fast filtering.
C1 Labs
Data
Science
DC.js combines d3.js with Square’s
crossfilter
• Built by
• Javascript library for very fast (<50ms) filtering
of multi-dimensional datasets
• Developed for transaction analysis (Perfect!)
• Very fast sorting and filtering
• Downside: Only practical up to a couple million
records.
C1 Labs
Data
Science
Need some backend processing to aggregate data
before we hit the web stack
• Developed by LinkedIn
• Fast, scalable
messaging publish-
subscribe service that
runs on a distributed
cluster
Transaction Stream Transaction Processing
• Part of the larger
Apache Spark compute
engine
• Fast, in-memory
streaming processing
over sliding windows
• Handles data
aggregation steps
• Can be used to run ML
algorithms
C1 Labs
Data
Science
What is Apache Spark?
Write programs in terms of transformations on
distributed datasets
Resilient Distributed
Datasets
• Collections of objects spread across
a cluster, stored in RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g. map, filter,
groupBy)
• Actions
(e.g. count, collect,
save)
Source: http://spark-summit.org/wp-
content/uploads/2013/10/McDonough-spark-tutorial_spark-summit-
2013.pdf
C1 Labs
Data
Science
Word Count in Spark vs Java MapReduce
scala> val rdd = sc.textFile("all_text_corpus.txt”)
scala> val allWords = rdd.flatMap(sentence=>sentence.split(" ”)
scala> val counts = allWords.map(word=>(word,1)).reduceByKey(_+_)
scala> counts.map{case (k,v)=>(v,k)}
.sortByKey(ascending=false)
.map{case (v,k)=>(k,v)}.take(25)
Array(("",70230), (the,63641), (and,38896), (of,34986), (to,31743), (a,22481),
(in,18710), (his,14712), (was,13963), (that,13735), (he,13588), (I,11761),
(with,11308), (had,9303), (her,8429), (not,7900), (as,7641), (it,7626), (for,7619),
(at,7574), (on,7350), (is,6383), (you,6173), (be,5525), (by,5315))
C1 Labs
Data
Science
Word Count in Spark vs Java MapReduce
C1 Labs
Data
Science
Transaction Aggregation with Spark
Batch up incoming transactions every 30 seconds, and compute average
transaction size and total number of transactions for every merchant, zip
code for a 5 min sliding window. Write batched results to MongoDB
C1 Labs
Data
Science
MongoDB for intermediate storage
• Use capped collection to immediately find last element.
• No costly O(N) or worse searches.
• Tap into Mongo with Node.js
C1 Labs
Data
Science
Node.js and Socket.io for server side updates
• Add socket.io listener in client side javascript
C1 Labs
Data
Science
Demo!

More Related Content

PPTX
Apache HBase Performance Tuning
PDF
Kicking ass with redis
PDF
Apache Spark Introduction
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Linux tuning to improve PostgreSQL performance
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
The Great Debate: PostgreSQL vs MySQL
 
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Apache HBase Performance Tuning
Kicking ass with redis
Apache Spark Introduction
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Linux tuning to improve PostgreSQL performance
Apache Spark in Depth: Core Concepts, Architecture & Internals
The Great Debate: PostgreSQL vs MySQL
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms

What's hot (20)

PDF
Construisez votre première application MongoDB
PDF
Redshift VS BigQuery
PDF
Optimising Geospatial Queries with Dynamic File Pruning
PDF
Using ClickHouse for Experimentation
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
PDF
Modularized ETL Writing with Apache Spark
PPTX
in-memory database system and low latency
PDF
Spark overview
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Thinking Big - Big data: principes et architecture
PDF
Apache Spark Core – Practical Optimization
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
My Experience Using Oracle SQL Plan Baselines 11g/12c
PDF
Redo log improvements MYSQL 8.0
PDF
Productizing Structured Streaming Jobs
PPTX
Introduction to Redis
PDF
Apache Spark At Scale in the Cloud
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PPTX
Construisez votre première application MongoDB
Redshift VS BigQuery
Optimising Geospatial Queries with Dynamic File Pruning
Using ClickHouse for Experimentation
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Modularized ETL Writing with Apache Spark
in-memory database system and low latency
Spark overview
Deep Dive: Memory Management in Apache Spark
Thinking Big - Big data: principes et architecture
Apache Spark Core – Practical Optimization
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
My Experience Using Oracle SQL Plan Baselines 11g/12c
Redo log improvements MYSQL 8.0
Productizing Structured Streaming Jobs
Introduction to Redis
Apache Spark At Scale in the Cloud
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Ad

Viewers also liked (8)

PDF
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
PPTX
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
PPTX
Introduction to Streaming Distributed Processing with Storm
PDF
Manual de programacion_con_robots_para_la_escuela
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
PPTX
Data Science with Spark & Zeppelin
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PDF
Big Data visualization with Apache Spark and Zeppelin
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Introduction to Streaming Distributed Processing with Storm
Manual de programacion_con_robots_para_la_escuela
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Data Science with Spark & Zeppelin
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Big Data visualization with Apache Spark and Zeppelin
Ad

Similar to Real time data viz with Spark Streaming, Kafka and D3.js (20)

PPTX
Lambda architecture with Spark
PDF
Machine learning model to production
PDF
Apache Spark Presentation good for big data
PDF
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
SnappyData Toronto Meetup Nov 2017
PDF
IBM Cloud Day January 2021 - A well architected data lake
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PDF
Making Machine Learning Easy with H2O and WebFlux
PDF
Data Platform in the Cloud
PPSX
IRMAC April 2015 - DMBOK2 DWBI New Content
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
PDF
Flash session -streaming--ses1243-lon
PDF
Lambda architecture @ Indix
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
DoneDeal - AWS Data Analytics Platform
PPTX
Потоковая обработка больших данных
PPTX
Databricks Platform.pptx
Lambda architecture with Spark
Machine learning model to production
Apache Spark Presentation good for big data
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
SnappyData Toronto Meetup Nov 2017
IBM Cloud Day January 2021 - A well architected data lake
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Making Machine Learning Easy with H2O and WebFlux
Data Platform in the Cloud
IRMAC April 2015 - DMBOK2 DWBI New Content
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Flash session -streaming--ses1243-lon
Lambda architecture @ Indix
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DoneDeal - AWS Data Analytics Platform
Потоковая обработка больших данных
Databricks Platform.pptx

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Well-logging-methods_new................
PDF
composite construction of structures.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Artificial Intelligence
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Sustainable Sites - Green Building Construction
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
UNIT 4 Total Quality Management .pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Safety Seminar civil to be ensured for safe working.
Well-logging-methods_new................
composite construction of structures.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Operating System & Kernel Study Guide-1 - converted.pdf
Artificial Intelligence
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Foundation to blockchain - A guide to Blockchain Tech
Sustainable Sites - Green Building Construction
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Mechanical Engineering MATERIALS Selection
CH1 Production IntroductoryConcepts.pptx
additive manufacturing of ss316l using mig welding
UNIT 4 Total Quality Management .pptx

Real time data viz with Spark Streaming, Kafka and D3.js

  • 1. Stream processing and visualization for transaction investigation Using Kafka, Spark, and D3.js Ben Laird Capital One Labs
  • 2. C1 Labs Data Science About me Cornell Engineering ’07 BS, Operations Research Johns Hopkins ‘12 MS, Applied Math • Data Engineer • Northrop Grumman • IBM • Space Debris Tracking • NLP of intel documents • Counter-IED GIS analysis Cornell expectations Cornell reality
  • 3. C1 Labs Data Science Now: Data Scientist at Capital One Labs
  • 4. C1 Labs Data Science A technical challenge: Build a dynamic, rich visualization of large, streaming data Normally, we have two options Small data Easy visualization Big data No visualization
  • 5. C1 Labs Data Science Data Science: More than just Hadoop • Understanding all the requirements of your problem and the architecture that meets those demands is an ever important for a data scientist • Data processing solution doesn’t matter if you have a 1hr load time in the browser. • Visualization doesn’t matter if there is no way to process/store data Stream Handling Stream Processing Intermediate Storage Web Server/Frame work Event Based Comm Browser Viz
  • 6. C1 Labs Data Science Our system must be able to process and visualize a real time transaction stream • Requirement: System must handle 1B+ transactions • Loading 1B records on the client side isn’t feasible • Our data is not only big, it is live. • Assume a stream of 50 records/second
  • 7. C1 Labs Data Science Proposed solution: Use existing big data tools to process stream before web stack Tool Purpose Apache Kafka Distributed Messaging for transaction stream Apache Spark Streaming Distributed processing of transaction stream. Aggregate to levels that can be handled by browser MongoDB Intermediate storage in Capped Collection for web server access Node.js Server side framework for web server and Mongo interaction Socket.io Event based communication – Pass new data from stream into browser Crossfilter Client side data index DC.js/D3.js D3.js graphics and intergration with Crossfilter How/Why did I pick these for our architecture?
  • 8. C1 Labs Data Science A foray into data visualization tools From the beautiful: Minard Map, 1869 Source: http://www.edwardtufte.com/tufte/minard
  • 9. C1 Labs Data Science to the ‘not beautiful’ Sources: http://www.excelcharts.com/, http://www.datavis.ca/gallery/evil-pies.php
  • 10. C1 Labs Data Science With most solutions, you face a trade off between ease of use and flexibility • If you need a quick solution or don’t need full control or customization, there are fantastic options • Tableau • ElasticSearch Kibana
  • 11. C1 Labs Data Science D3.js provides an extremely powerful way of joining data with completely custom graphics Limitless possibilities. Complete control over data and viz. Not trivial to use though!
  • 12. C1 Labs Data Science Bind data directly to elements in the DOM. Create graphics from scratch http://bl.ocks.org/mbostock/7341714
  • 13. C1 Labs Data Science All about finding the right level of abstraction. Introduce DC.js • Don’t always want to construct bar charts from the ground up. • Build axes, ticks, set colors, scales, bar widths, height, projections...Too tedious sometimes • DC.js adds a thin layer on top of d3.js to construct most chart types and to link charts together for fast filtering.
  • 14. C1 Labs Data Science DC.js combines d3.js with Square’s crossfilter • Built by • Javascript library for very fast (<50ms) filtering of multi-dimensional datasets • Developed for transaction analysis (Perfect!) • Very fast sorting and filtering • Downside: Only practical up to a couple million records.
  • 15. C1 Labs Data Science Need some backend processing to aggregate data before we hit the web stack • Developed by LinkedIn • Fast, scalable messaging publish- subscribe service that runs on a distributed cluster Transaction Stream Transaction Processing • Part of the larger Apache Spark compute engine • Fast, in-memory streaming processing over sliding windows • Handles data aggregation steps • Can be used to run ML algorithms
  • 16. C1 Labs Data Science What is Apache Spark? Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Source: http://spark-summit.org/wp- content/uploads/2013/10/McDonough-spark-tutorial_spark-summit- 2013.pdf
  • 17. C1 Labs Data Science Word Count in Spark vs Java MapReduce scala> val rdd = sc.textFile("all_text_corpus.txt”) scala> val allWords = rdd.flatMap(sentence=>sentence.split(" ”) scala> val counts = allWords.map(word=>(word,1)).reduceByKey(_+_) scala> counts.map{case (k,v)=>(v,k)} .sortByKey(ascending=false) .map{case (v,k)=>(k,v)}.take(25) Array(("",70230), (the,63641), (and,38896), (of,34986), (to,31743), (a,22481), (in,18710), (his,14712), (was,13963), (that,13735), (he,13588), (I,11761), (with,11308), (had,9303), (her,8429), (not,7900), (as,7641), (it,7626), (for,7619), (at,7574), (on,7350), (is,6383), (you,6173), (be,5525), (by,5315))
  • 18. C1 Labs Data Science Word Count in Spark vs Java MapReduce
  • 19. C1 Labs Data Science Transaction Aggregation with Spark Batch up incoming transactions every 30 seconds, and compute average transaction size and total number of transactions for every merchant, zip code for a 5 min sliding window. Write batched results to MongoDB
  • 20. C1 Labs Data Science MongoDB for intermediate storage • Use capped collection to immediately find last element. • No costly O(N) or worse searches. • Tap into Mongo with Node.js
  • 21. C1 Labs Data Science Node.js and Socket.io for server side updates • Add socket.io listener in client side javascript