SlideShare a Scribd company logo
Sessionization
with Spark streaming
Ramūnas Urbonas
@ Platform Lunar
Disclosure
• This work was implemented in Adform
• Thanks the Hadoop team for permission and help
History
• Original idea from Ted Alaska @ 2014
How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop
• Hands on 2016 at Adform
The Problem
• Constant flow of page visits
110 GB average per day, volume variations, catch-up scenario
• Wait for session interrupts
Timeout, specific action, midnight, sanity checks
• Calculate session duration, length, reaction times
The Problem
• Constant ingress / egress
One car enters, car trailer exits
Join for every incoming car
• Some cars loop for hours
• Uncontrollable loop volume
Stream / Not
• Still not 100% sure if it’s worth streaming
People still frown when this topic is brought up
• More frequent ingress means less effective join
Is 2 minute period of ingress is still streaming? :)
• Another degree of complexity
Cons
• More complex application
Just like cars - ride to Work vs travel to Portugal
• Steady pace is required
Throttling is mandatory, volume control is essential, good GC
• Permanently reserved resources
Pros
• Fun
If this one is on your list, you should probably not do it :)
• Speed
This is “result speed”. Do you actually need it?
• Stability
You have to work really hard to get this benefit
Extra context
• User data is partitioned by nature
User ID (range) is obvious partition key
Helps us to control ingress size and most importantly - loop volume
• Loop volume is hard to control
Average flow was around 150 MB, the loop varied from 2 - 8 GB
Algorithm
ingress
state
updateStateByKey
join
Algorithm
complete
incomplete
decision calculate results
store for later
Copy & Paste
• Ted solution relies on updateStateByKey
This method requires checkpointing
• Checkpoints
Are good only on paper
They are meant for soft-recovery
The Thought
val sc = new SparkContext(…)
val ssc = new StreamingContext(sc, Minutes(2))
val ingress = ssc.textFileStream(“folder”).groupBy(userId)
val checkpoint = sc.textFile("checkpoint").groupBy(userId)
val sessions = checkpoint.fullOuterJoin(ingress)(userId)
.cache
sessions.filter(complete).map(enrich).saveAsTextFile("output")
sessions.filter(inComplete).saveAsTextFile("checkpoint")
fileStream
• Works based on file timestamp with some memory
Bit fuzzy, ugly for testing
• We wanted to have more control and monitoring
Our file names had meta information (source, oldest record time)
Custom implementation with external state (key-valuestore)
We could control ingress size
Tip: persisting actual job plan
Checkpoint
user-1 1477983123 page-26
user-1 1477983256 page-2
user-2 1477982342 home
user-2 1477982947 page-9
user-2 1477984343 home
Checkpoint
• Custom implementation
We wanted to maintain checkpoint grouping
• Nothing fancy
class SessionInputFormat
extends FileInputFormat[SessionKey, List[Record]]
fullOuterJoin
• Probably the most expensive operation
The average ratio is 1:35, with extremes of 1:100
We found IndexedRDD contribution
IndexedRDD
• IndexedRDD
https://github.com/amplab/spark-indexedrdd
• Partition control is essential
Avoid extra stage in your job, extra shuffles
Explicit partitioner, even if it is HashPartitioner
Get used to specifying partitioner for every groupBy / combineByKey
Exact and controllable partition count
IndexedRDD
cache & repetition
• Remember?
.cache .filter(complete).doStuff .filter(incomplete).doStuff
• You never want to repeat actions when streaming
We had to scan entire dataset twice
Also… two phase commit
Multi Output Format
• Custom implementation
We wanted different format for each output
Not that hard, but lot’s of copy-paste
Communication via Hadoop configuration
• MultipleOutputFormat
Why we did not use it?
Gotcha
val conf = new JobConf(rdd.context.hadoopConfiguration)


conf.set("mapreduce.job.outputformat.class",
classOf[SessionMultiOutputFormat].getName)


conf.set(COMPLETE_SESSIONS_PATH, job.outputPath)
conf.set(ONGOING_SESSION_PATH, job.checkpointPath)

sessions.saveAsNewAPIHadoopDataset(conf)
Non-natural partitioning
• Our ingress comes pre-partitioned
File names like server_oldest-record-timestamp.txt.gz
Where server works on a range of user ids
• Just foreachRDD
… or is it? :D
Resource utilisation
0
25
50
75
100
Resource utilisation
0
25
50
75
100
Parallelise
• Just rdds.par.foreach(processOne)
… or is it ? :D
• Limit thread pool
val par = rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
The Algorithm
val stream = new OurCustomDStream(..)
stream.foreachRDD(processUnion)
…
val par = unionRdd.rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
unionRdd.rdds.par.foreach(processOne)
The Algorithm
val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20))
val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...)
val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20))
val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc)
val split = sessions.flatMap(splitSessionFunc)
val conf = new JobConf(...)
split.saveAsNewAPIHadoopDataset(conf)
Result
Configuration
• Current configuration
Driver: 6 GB RAM
15 executors: 4GB RAM and 2 cores each
• Total size not that big
60 GB RAM and 30 cores
Previously it was 52 SQL instances.. doing other things too
• Hasn’t changed for half a year already
Metrics
My Pride
Other tips
• -XX:+UseG1GC
For both driver and executors
• Plan & store jobs, repeat if failed
When repeating, environment changes
• Use named RDDs
Helps to read your DAGs
Thanks

More Related Content

PDF
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
PPTX
Apache Spark in your likeness - low and high level customization
PPTX
Apache Spark Structured Streaming + Apache Kafka = ♡
PPTX
Using Cerberus and PySpark to validate semi-structured datasets
PDF
Deep dive into PostgreSQL statistics.
PDF
Troubleshooting PostgreSQL with pgCenter
PDF
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
PDF
Troubleshooting PostgreSQL Streaming Replication
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Apache Spark in your likeness - low and high level customization
Apache Spark Structured Streaming + Apache Kafka = ♡
Using Cerberus and PySpark to validate semi-structured datasets
Deep dive into PostgreSQL statistics.
Troubleshooting PostgreSQL with pgCenter
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Troubleshooting PostgreSQL Streaming Replication

What's hot (20)

PPTX
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
PDF
Advanced Postgres Monitoring
PDF
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PPTX
PostgreSQL Terminology
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
PDF
Advanced Apache Cassandra Operations with JMX
PDF
collectd & PostgreSQL
PDF
Managing PostgreSQL with PgCenter
PDF
PostgreSQL Replication Tutorial
PDF
Pgcenter overview
PPTX
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
PDF
Deep dive into PostgreSQL statistics.
PDF
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
ODP
PostgreSQL Administration for System Administrators
PDF
Centralized + Unified Logging
PDF
Deep dive into PostgreSQL statistics.
PDF
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
PDF
How to upgrade to MongoDB 4.0 - Percona Europe 2018
PDF
GitLab PostgresMortem: Lessons Learned
PDF
Managing data and operation distribution in MongoDB
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
Advanced Postgres Monitoring
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Terminology
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
Advanced Apache Cassandra Operations with JMX
collectd & PostgreSQL
Managing PostgreSQL with PgCenter
PostgreSQL Replication Tutorial
Pgcenter overview
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Deep dive into PostgreSQL statistics.
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
PostgreSQL Administration for System Administrators
Centralized + Unified Logging
Deep dive into PostgreSQL statistics.
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
How to upgrade to MongoDB 4.0 - Percona Europe 2018
GitLab PostgresMortem: Lessons Learned
Managing data and operation distribution in MongoDB
Ad

Viewers also liked (20)

PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
PPTX
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
PPTX
Debunking Six Common Myths in Stream Processing
PDF
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
PPTX
Michael Häusler – Everyday flink
PPTX
Slim Baltagi – Flink vs. Spark
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PPTX
Flink Case Study: OKKAM
PDF
Spark Summit EU talk by Christos Erotocritou
PPTX
Kafka for data scientists
PDF
Flink Case Study: Amadeus
PDF
Wrangling Big Data in a Small Tech Ecosystem
PPTX
Streaming datasets for personalization
PPTX
Online learning with structured streaming, spark summit brussels 2016
PPTX
Kafka Streams: The Stream Processing Engine of Apache Kafka
PPT
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
PDF
A little bit of clojure
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
PDF
Big Data & the Enterprise
PDF
Continuous Application with Structured Streaming 2.0
Kostas Tzoumas - Stream Processing with Apache Flink®
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
Debunking Six Common Myths in Stream Processing
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
Michael Häusler – Everyday flink
Slim Baltagi – Flink vs. Spark
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Flink Case Study: OKKAM
Spark Summit EU talk by Christos Erotocritou
Kafka for data scientists
Flink Case Study: Amadeus
Wrangling Big Data in a Small Tech Ecosystem
Streaming datasets for personalization
Online learning with structured streaming, spark summit brussels 2016
Kafka Streams: The Stream Processing Engine of Apache Kafka
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
A little bit of clojure
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Big Data & the Enterprise
Continuous Application with Structured Streaming 2.0
Ad

Similar to Sessionization with Spark streaming (20)

PDF
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
PPTX
SQL Server It Just Runs Faster
PDF
Oracle GoldenGate Architecture Performance
KEY
Introduction to memcached
PDF
Say YES to Premature Optimizations
PDF
OGG Architecture Performance
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
PPTX
Apache Tez – Present and Future
PPTX
Apache Tez – Present and Future
PDF
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
PDF
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
PPTX
End to-end async and await
PDF
Performance
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
PPT
Moving Towards a Streaming Architecture
ODP
MNPHP Scalable Architecture 101 - Feb 3 2011
PDF
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PPT
Operating System
PDF
Docker Logging and analysing with Elastic Stack
PDF
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
SQL Server It Just Runs Faster
Oracle GoldenGate Architecture Performance
Introduction to memcached
Say YES to Premature Optimizations
OGG Architecture Performance
Explore big data at speed of thought with Spark 2.0 and Snappydata
Apache Tez – Present and Future
Apache Tez – Present and Future
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
End to-end async and await
Performance
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Moving Towards a Streaming Architecture
MNPHP Scalable Architecture 101 - Feb 3 2011
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
Operating System
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack - Jakub Hajek

Recently uploaded (20)

PPTX
Leprosy and NLEP programme community medicine
PDF
Introduction to Data Science and Data Analysis
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Predictive modeling basics in data cleaning process
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
modul_python (1).pptx for professional and student
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Managing Community Partner Relationships
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Leprosy and NLEP programme community medicine
Introduction to Data Science and Data Analysis
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Database Infoormation System (DBIS).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Predictive modeling basics in data cleaning process
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction-to-Cloud-ComputingFinal.pptx
modul_python (1).pptx for professional and student
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
climate analysis of Dhaka ,Banglades.pptx
A Complete Guide to Streamlining Business Processes
SAP 2 completion done . PRESENTATION.pptx
Managing Community Partner Relationships
Galatica Smart Energy Infrastructure Startup Pitch Deck

Sessionization with Spark streaming

  • 3. Disclosure • This work was implemented in Adform • Thanks the Hadoop team for permission and help
  • 4. History • Original idea from Ted Alaska @ 2014 How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop • Hands on 2016 at Adform
  • 5. The Problem • Constant flow of page visits 110 GB average per day, volume variations, catch-up scenario • Wait for session interrupts Timeout, specific action, midnight, sanity checks • Calculate session duration, length, reaction times
  • 6. The Problem • Constant ingress / egress One car enters, car trailer exits Join for every incoming car • Some cars loop for hours • Uncontrollable loop volume
  • 7. Stream / Not • Still not 100% sure if it’s worth streaming People still frown when this topic is brought up • More frequent ingress means less effective join Is 2 minute period of ingress is still streaming? :) • Another degree of complexity
  • 8. Cons • More complex application Just like cars - ride to Work vs travel to Portugal • Steady pace is required Throttling is mandatory, volume control is essential, good GC • Permanently reserved resources
  • 9. Pros • Fun If this one is on your list, you should probably not do it :) • Speed This is “result speed”. Do you actually need it? • Stability You have to work really hard to get this benefit
  • 10. Extra context • User data is partitioned by nature User ID (range) is obvious partition key Helps us to control ingress size and most importantly - loop volume • Loop volume is hard to control Average flow was around 150 MB, the loop varied from 2 - 8 GB
  • 13. Copy & Paste • Ted solution relies on updateStateByKey This method requires checkpointing • Checkpoints Are good only on paper They are meant for soft-recovery
  • 14. The Thought val sc = new SparkContext(…) val ssc = new StreamingContext(sc, Minutes(2)) val ingress = ssc.textFileStream(“folder”).groupBy(userId) val checkpoint = sc.textFile("checkpoint").groupBy(userId) val sessions = checkpoint.fullOuterJoin(ingress)(userId) .cache sessions.filter(complete).map(enrich).saveAsTextFile("output") sessions.filter(inComplete).saveAsTextFile("checkpoint")
  • 15. fileStream • Works based on file timestamp with some memory Bit fuzzy, ugly for testing • We wanted to have more control and monitoring Our file names had meta information (source, oldest record time) Custom implementation with external state (key-valuestore) We could control ingress size Tip: persisting actual job plan
  • 16. Checkpoint user-1 1477983123 page-26 user-1 1477983256 page-2 user-2 1477982342 home user-2 1477982947 page-9 user-2 1477984343 home
  • 17. Checkpoint • Custom implementation We wanted to maintain checkpoint grouping • Nothing fancy class SessionInputFormat extends FileInputFormat[SessionKey, List[Record]]
  • 18. fullOuterJoin • Probably the most expensive operation The average ratio is 1:35, with extremes of 1:100 We found IndexedRDD contribution
  • 19. IndexedRDD • IndexedRDD https://github.com/amplab/spark-indexedrdd • Partition control is essential Avoid extra stage in your job, extra shuffles Explicit partitioner, even if it is HashPartitioner Get used to specifying partitioner for every groupBy / combineByKey Exact and controllable partition count
  • 21. cache & repetition • Remember? .cache .filter(complete).doStuff .filter(incomplete).doStuff • You never want to repeat actions when streaming We had to scan entire dataset twice Also… two phase commit
  • 22. Multi Output Format • Custom implementation We wanted different format for each output Not that hard, but lot’s of copy-paste Communication via Hadoop configuration • MultipleOutputFormat Why we did not use it?
  • 23. Gotcha val conf = new JobConf(rdd.context.hadoopConfiguration) 
 conf.set("mapreduce.job.outputformat.class", classOf[SessionMultiOutputFormat].getName) 
 conf.set(COMPLETE_SESSIONS_PATH, job.outputPath) conf.set(ONGOING_SESSION_PATH, job.checkpointPath)
 sessions.saveAsNewAPIHadoopDataset(conf)
  • 24. Non-natural partitioning • Our ingress comes pre-partitioned File names like server_oldest-record-timestamp.txt.gz Where server works on a range of user ids • Just foreachRDD … or is it? :D
  • 27. Parallelise • Just rdds.par.foreach(processOne) … or is it ? :D • Limit thread pool val par = rdds.par par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
  • 28. The Algorithm val stream = new OurCustomDStream(..) stream.foreachRDD(processUnion) … val par = unionRdd.rdds.par par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10)) unionRdd.rdds.par.foreach(processOne)
  • 29. The Algorithm val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20)) val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...) val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20)) val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc) val split = sessions.flatMap(splitSessionFunc) val conf = new JobConf(...) split.saveAsNewAPIHadoopDataset(conf)
  • 31. Configuration • Current configuration Driver: 6 GB RAM 15 executors: 4GB RAM and 2 cores each • Total size not that big 60 GB RAM and 30 cores Previously it was 52 SQL instances.. doing other things too • Hasn’t changed for half a year already
  • 34. Other tips • -XX:+UseG1GC For both driver and executors • Plan & store jobs, repeat if failed When repeating, environment changes • Use named RDDs Helps to read your DAGs