Sessionization with Spark streaming

Sessionization
with Spark streaming

Ramūnas Urbonas
@ Platform Lunar

Disclosure
• This work was implemented in Adform
• Thanks the Hadoop team for permission and help

History
• Original idea from Ted Alaska @ 2014
How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop
• Hands on 2016 at Adform

The Problem
• Constant ﬂow of page visits
110 GB average per day, volume variations, catch-up scenario
• Wait for session interrupts
Timeout, speciﬁc action, midnight, sanity checks
• Calculate session duration, length, reaction times

The Problem
• Constant ingress / egress
One car enters, car trailer exits
Join for every incoming car
• Some cars loop for hours
• Uncontrollable loop volume

Stream / Not
• Still not 100% sure if it’s worth streaming
People still frown when this topic is brought up
• More frequent ingress means less effective join
Is 2 minute period of ingress is still streaming? :)
• Another degree of complexity

Cons
• More complex application
Just like cars - ride to Work vs travel to Portugal
• Steady pace is required
Throttling is mandatory, volume control is essential, good GC
• Permanently reserved resources

Pros
• Fun
If this one is on your list, you should probably not do it :)
• Speed
This is “result speed”. Do you actually need it?
• Stability
You have to work really hard to get this beneﬁt

Extra context
• User data is partitioned by nature
User ID (range) is obvious partition key
Helps us to control ingress size and most importantly - loop volume
• Loop volume is hard to control
Average ﬂow was around 150 MB, the loop varied from 2 - 8 GB

Algorithm
ingress
state
updateStateByKey
join

Algorithm
complete
incomplete
decision calculate results
store for later

Copy & Paste
• Ted solution relies on updateStateByKey
This method requires checkpointing
• Checkpoints
Are good only on paper
They are meant for soft-recovery

The Thought
val sc = new SparkContext(…)
val ssc = new StreamingContext(sc, Minutes(2))
val ingress = ssc.textFileStream(“folder”).groupBy(userId)
val checkpoint = sc.textFile("checkpoint").groupBy(userId)
val sessions = checkpoint.fullOuterJoin(ingress)(userId)
.cache
sessions.filter(complete).map(enrich).saveAsTextFile("output")
sessions.filter(inComplete).saveAsTextFile("checkpoint")

fileStream
• Works based on file timestamp with some memory
Bit fuzzy, ugly for testing
• We wanted to have more control and monitoring
Our file names had meta information (source, oldest record time)
Custom implementation with external state (key-valuestore)
We could control ingress size
Tip: persisting actual job plan

Checkpoint
user-1 1477983123 page-26
user-1 1477983256 page-2
user-2 1477982342 home
user-2 1477982947 page-9
user-2 1477984343 home

Checkpoint
• Custom implementation
We wanted to maintain checkpoint grouping
• Nothing fancy
class SessionInputFormat
extends FileInputFormat[SessionKey, List[Record]]

fullOuterJoin
• Probably the most expensive operation
The average ratio is 1:35, with extremes of 1:100
We found IndexedRDD contribution

IndexedRDD
• IndexedRDD
https://github.com/amplab/spark-indexedrdd
• Partition control is essential
Avoid extra stage in your job, extra shufﬂes
Explicit partitioner, even if it is HashPartitioner
Get used to specifying partitioner for every groupBy / combineByKey
Exact and controllable partition count

cache & repetition
• Remember?
.cache .ﬁlter(complete).doStuff .ﬁlter(incomplete).doStuff
• You never want to repeat actions when streaming
We had to scan entire dataset twice
Also… two phase commit

Multi Output Format
• Custom implementation
We wanted different format for each output
Not that hard, but lot’s of copy-paste
Communication via Hadoop conﬁguration
• MultipleOutputFormat
Why we did not use it?

Gotcha
val conf = new JobConf(rdd.context.hadoopConfiguration)
 
conf.set("mapreduce.job.outputformat.class",
classOf[SessionMultiOutputFormat].getName)
 
conf.set(COMPLETE_SESSIONS_PATH, job.outputPath)
conf.set(ONGOING_SESSION_PATH, job.checkpointPath) 
sessions.saveAsNewAPIHadoopDataset(conf)

Non-natural partitioning
• Our ingress comes pre-partitioned
File names like server_oldest-record-timestamp.txt.gz
Where server works on a range of user ids
• Just foreachRDD
… or is it? :D

Resource utilisation
0
25
50
75
100

Parallelise
• Just rdds.par.foreach(processOne)
… or is it ? :D
• Limit thread pool
val par = rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))

The Algorithm
val stream = new OurCustomDStream(..)
stream.foreachRDD(processUnion)
…
val par = unionRdd.rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
unionRdd.rdds.par.foreach(processOne)

The Algorithm
val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20))
val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...)
val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20))
val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc)
val split = sessions.flatMap(splitSessionFunc)
val conf = new JobConf(...)
split.saveAsNewAPIHadoopDataset(conf)

Conﬁguration
• Current conﬁguration
Driver: 6 GB RAM
15 executors: 4GB RAM and 2 cores each
• Total size not that big
60 GB RAM and 30 cores
Previously it was 52 SQL instances.. doing other things too
• Hasn’t changed for half a year already

Other tips
• -XX:+UseG1GC
For both driver and executors
• Plan & store jobs, repeat if failed
When repeating, environment changes
• Use named RDDs
Helps to read your DAGs

Sessionization with Spark streaming

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Sessionization with Spark streaming (20)

Recently uploaded (20)

Sessionization with Spark streaming