Real time data viz with Spark Streaming, Kafka and D3.js

Stream processing and
visualization for transaction
investigation
Using Kafka, Spark, and D3.js
Ben Laird
Capital One Labs

C1 Labs
Data
Science
About me
Cornell Engineering ’07
BS, Operations Research
Johns Hopkins ‘12
MS, Applied Math
• Data Engineer
• Northrop Grumman
• IBM
• Space Debris Tracking
• NLP of intel documents
• Counter-IED GIS analysis
Cornell expectations
Cornell reality

C1 Labs
Data
Science
Now: Data Scientist at Capital One Labs

C1 Labs
Data
Science
A technical challenge: Build a dynamic, rich
visualization of large, streaming data
Normally, we have two options
Small data
Easy visualization
Big data
No visualization

C1 Labs
Data
Science
Data Science: More than just Hadoop
• Understanding all the requirements of your problem and
the architecture that meets those demands is an ever
important for a data scientist
• Data processing solution doesn’t matter if you have a
1hr load time in the browser.
• Visualization doesn’t matter if there is no way to
process/store data
Stream
Handling Stream
Processing Intermediate
Storage
Web
Server/Frame
work
Event Based
Comm Browser Viz

C1 Labs
Data
Science
Our system must be able to process and visualize a
real time transaction stream
• Requirement: System must
handle 1B+ transactions
• Loading 1B records on the client
side isn’t feasible
• Our data is not only big, it is live.
• Assume a stream of 50
records/second

C1 Labs
Data
Science
Proposed solution: Use existing big data tools to
process stream before web stack
Tool Purpose
Apache Kafka Distributed Messaging for transaction stream
Apache Spark Streaming Distributed processing of transaction stream.
Aggregate to levels that can be handled by browser
MongoDB Intermediate storage in Capped Collection for web
server access
Node.js Server side framework for web server and Mongo
interaction
Socket.io Event based communication – Pass new data from
stream into browser
Crossfilter Client side data index
DC.js/D3.js D3.js graphics and intergration with Crossfilter
How/Why did I pick these for our architecture?

C1 Labs
Data
Science
A foray into data visualization tools
From the beautiful: Minard Map, 1869
Source: http://www.edwardtufte.com/tufte/minard

C1 Labs
Data
Science
to the ‘not beautiful’
Sources: http://www.excelcharts.com/, http://www.datavis.ca/gallery/evil-pies.php

C1 Labs
Data
Science
With most solutions, you face a trade off between ease of use
and flexibility
• If you need a quick solution or don’t need full
control or customization, there are fantastic options
• Tableau
• ElasticSearch
Kibana

C1 Labs
Data
Science
D3.js provides an extremely powerful way of joining data with
completely custom graphics
Limitless possibilities. Complete control over data and viz. Not trivial to use
though!

C1 Labs
Data
Science
Bind data directly to elements in the DOM. Create graphics from
scratch
http://bl.ocks.org/mbostock/7341714

C1 Labs
Data
Science
All about finding the right level of abstraction. Introduce DC.js
• Don’t always want to construct bar charts from the
ground up.
• Build axes, ticks, set colors, scales, bar widths, height,
projections...Too tedious sometimes
• DC.js adds a thin layer on top of d3.js to construct most
chart types and to link charts together for fast filtering.

C1 Labs
Data
Science
DC.js combines d3.js with Square’s
crossfilter
• Built by
• Javascript library for very fast (<50ms) filtering
of multi-dimensional datasets
• Developed for transaction analysis (Perfect!)
• Very fast sorting and filtering
• Downside: Only practical up to a couple million
records.

C1 Labs
Data
Science
Need some backend processing to aggregate data
before we hit the web stack
• Developed by LinkedIn
• Fast, scalable
messaging publish-
subscribe service that
runs on a distributed
cluster
Transaction Stream Transaction Processing
• Part of the larger
Apache Spark compute
engine
• Fast, in-memory
streaming processing
over sliding windows
• Handles data
aggregation steps
• Can be used to run ML
algorithms

C1 Labs
Data
Science
What is Apache Spark?
Write programs in terms of transformations on
distributed datasets
Resilient Distributed
Datasets
• Collections of objects spread across
a cluster, stored in RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g. map, filter,
groupBy)
• Actions
(e.g. count, collect,
save)
Source: http://spark-summit.org/wp-
content/uploads/2013/10/McDonough-spark-tutorial_spark-summit-
2013.pdf

C1 Labs
Data
Science
Word Count in Spark vs Java MapReduce
scala> val rdd = sc.textFile("all_text_corpus.txt”)
scala> val allWords = rdd.flatMap(sentence=>sentence.split(" ”)
scala> val counts = allWords.map(word=>(word,1)).reduceByKey(_+_)
scala> counts.map{case (k,v)=>(v,k)}
.sortByKey(ascending=false)
.map{case (v,k)=>(k,v)}.take(25)
Array(("",70230), (the,63641), (and,38896), (of,34986), (to,31743), (a,22481),
(in,18710), (his,14712), (was,13963), (that,13735), (he,13588), (I,11761),
(with,11308), (had,9303), (her,8429), (not,7900), (as,7641), (it,7626), (for,7619),
(at,7574), (on,7350), (is,6383), (you,6173), (be,5525), (by,5315))

C1 Labs
Data
Science
Word Count in Spark vs Java MapReduce

C1 Labs
Data
Science
Transaction Aggregation with Spark
Batch up incoming transactions every 30 seconds, and compute average
transaction size and total number of transactions for every merchant, zip
code for a 5 min sliding window. Write batched results to MongoDB

C1 Labs
Data
Science
MongoDB for intermediate storage
• Use capped collection to immediately find last element.
• No costly O(N) or worse searches.
• Tap into Mongo with Node.js

C1 Labs
Data
Science
Node.js and Socket.io for server side updates
• Add socket.io listener in client side javascript

Real time data viz with Spark Streaming, Kafka and D3.js

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Real time data viz with Spark Streaming, Kafka and D3.js (20)

Recently uploaded (20)

Real time data viz with Spark Streaming, Kafka and D3.js