SlideShare a Scribd company logo
Solving sessionization
problem with Apache
Spark batch and
streaming processing
Bartosz Konieczny
@waitingforcode1
About me
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode #github.com/bartosz25
#canalplus #Paris
2
3
Sessions
"user activity followed by a
closing action or a period of
inactivity"
4
5
© https://pixabay.com/users/maxmann-665103/ from https://pixabay.com
Batch architecture
6
data producer
sync consumer input logs
(DFS)
input logs
(streaming broker)
orchestrator
sessions
generator
<triggers>
previous
window raw
sessions
(DFS)
output
sessions
(DFS)
Streaming architecture
7
data producer
sessions
generator
output
sessions
(DFS)
metadata state
<uses>
checkpoint location
input logs
(streaming broker)
Batch
implementation
The code
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound)).cache()
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
9
Full outer join
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
10
processing logic
previous
window
active
sessions
new input
logs
full outer join
Watermark simulation
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
case class SessionIntermediaryState(userId:
Long, … expirationTimeMillisUtc: Long,
isActive: Boolean)
11
Save modes
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
SaveMode.Append ⇒
duplicates & invalid results
(e.g. multiplied revenue!)
SaveMode.ErrorIfExists ⇒
failures & maintenance
burden
SaveMode.Ignore ⇒ no
data & old data present in
case of reprocessing
SaveMode.Overwrite ⇒
always fresh data & easy
maintenance
12
Streaming
implementation
The code
val writeQuery = query.writeStream.outputMode(OutputMode.Update())
.option("checkpointLocation", s"s3://my-checkpoint-bucket")
.foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => {
BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}")
})
val dataFrame = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaConfiguration.broker).option(...) .load()
val query = dataFrame.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", Visit.Schema).as("data"))
.select($"data.*").withWatermark("event_time", "3 minutes")
.groupByKey(row => row.getAs[Long]("user_id"))
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout())
(mapStreamingLogsToSessions(sessionTimeout))
watermark - late events & state
expiration
stateful processing - sessions
generation
checkpoint - fault-tolerance
14
Checkpoint - fault-tolerance
load state
for t0
query
load offsets
to process &
write them
for t1
query
process data
write
processed
offsets
write state
checkpoint location
state store offset log commit log
val writeQuery = query.writeStream.outputMode(OutputMode.Update())
.option("checkpointLocation", s"s3://sessionization-demo/checkpoint")
.foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => {
BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}")
})
.start()
15
Checkpoint - fault-tolerance
load state
for t1
query
load offsets
to process &
write them
for t1
query
process data
confirm
processed
offsets &
next
watermark
commit state
t2
partition-based
checkpoint location
state store offset log commit log
16
Stateful processing
update
remove
get
getput,remove
write update
finalize file
make snapshot
recover state
def mapStreamingLogsToSessions(timeoutDurationMs: Long)(key: Long, logs: Iterator[Row],
currentState: GroupState[SessionIntermediaryState]): SessionIntermediaryState = {
if (currentState.hasTimedOut) {
val expiredState = currentState.get.expire
currentState.remove()
expiredState
} else {
val newState = currentState.getOption.map(state => state.updateWithNewLogs(logs, timeoutDurationMs))
.getOrElse(SessionIntermediaryState.createNew(logs, timeoutDurationMs))
currentState.update(newState)
currentState.setTimeoutTimestamp(currentState.getCurrentWatermarkMs() + timeoutDurationMs)
currentState.get
}
}
17
Stateful processing
update
remove
get
getput,remove
- write update
- finalize file
- make snapshot
recover state
18
.mapGroupsWithState(...)
state store
TreeMap[Long,
ConcurrentHashMap[UnsafeRow,
UnsafeRow]
]
in-memory storage for the most
recent versions
1.delta
2.delta
3.snapshot
checkpoint
location
Watermark
val sessionTimeout = TimeUnit.MINUTES.toMillis(5)
val query = dataFrame.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", Visit.Schema).as("data"))
.select($"data.*")
.withWatermark("event_time", "3 minutes")
.groupByKey(row => row.getAs[Long]("user_id"))
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout())
(Mapping.mapStreamingLogsToSessions(sessionTimeout))
19
Watermark - late events
on-time
event
late
event
20
.mapGroupsWithState(...)
Watermark - expired state
State representation [simplified]
{value, TTL configuration}
Algorithm:
1. Update all states with new data → eventually extend TTL
2. Retrieve TTL configuration for the query → here: watermark
3. Retrieve all states that expired → no new data in this query & TTL expired
4. Call mapGroupsWithState on it with hasTimedOut param = true & no new data
(Iterator.empty)
// full implementation: org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec.InputProcessor
21
Data reprocessing
Batch
reschedule your job
© https://pics.me.me/just-one-click-and-the-zoo-is-mine-8769663.png
Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
State store
1. Restored state is the most recent snapshot
2. Restored state is not the most recent snapshot but a snapshot exists
3. Restored state is not the most recent snapshot and a snapshot doesn't exist
27
1.delta 3.snapshot2.delta
1.delta 3.snapshot2.delta 4.delta
1.delta 3.delta2.delta 4.delta
State store configuration
spark.sql.streaming.stateStore:
→ .minDeltasForSnapshot
→ .maintenanceInterval
28
spark.sql.streaming:
→ .maxBatchesToRetainInMemory
Checkpoint configuration
spark.sql.streaming.minBatchesToRetain
29
Few takeaways
● yet another TDD acronym - Trade-Off Driven Development
○ simplicity for latency
○ simplicity for accuracy
○ scaling for latency
● AWS
○ Kinesis - short retention period = reprocessing boundary, connector
○ S3 - trade reliability for performance
○ EMR - transient cluster
○ Redshift - COPY
● Apache Spark
○ watermarks everywhere - batch simulation
○ state store configuration
○ restore mechanism
○ overwrite idempotent mode
30
Resources
● https://github.com/bartosz25/sessionization-de
mo
● https://www.waitingforcode.com/tags/spark-ai-s
ummit-europe-2019-articles
31
Thank you!Bartosz Konieczny
@waitingforcode / github.com/bartosz25 / waitingforcode.com
Canal+
@canaltechteam
Ad

Recommended

Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
confluent
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Allen (Xiaozhong) Wang
 
Enhance your multi-cloud application performance using Redis Enterprise P2
Enhance your multi-cloud application performance using Redis Enterprise P2
Ashnikbiz
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Daniel Hochman
 
Exploring KSQL Patterns
Exploring KSQL Patterns
confluent
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Ververica
 
NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定
Kan Itani
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
NVMe over Fabric
NVMe over Fabric
singh.gurjeet
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Jose De La Rosa
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
DataStax Academy
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
ScyllaDB
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
OFI Overview 2019 Webinar
OFI Overview 2019 Webinar
seanhefty
 
Introduction àJava
Introduction àJava
Christophe Vaudry
 
An Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
[AWSマイスターシリーズ] Instance Store & Elastic Block Store
[AWSマイスターシリーズ] Instance Store & Elastic Block Store
Amazon Web Services Japan
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
[JEEConf-2017] RxJava as a key component in mature Big Data product
[JEEConf-2017] RxJava as a key component in mature Big Data product
Igor Lozynskyi
 

More Related Content

What's hot (20)

One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
NVMe over Fabric
NVMe over Fabric
singh.gurjeet
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Jose De La Rosa
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
DataStax Academy
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
ScyllaDB
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
OFI Overview 2019 Webinar
OFI Overview 2019 Webinar
seanhefty
 
Introduction àJava
Introduction àJava
Christophe Vaudry
 
An Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
[AWSマイスターシリーズ] Instance Store & Elastic Block Store
[AWSマイスターシリーズ] Instance Store & Elastic Block Store
Amazon Web Services Japan
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Jose De La Rosa
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
DataStax Academy
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
ScyllaDB
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
OFI Overview 2019 Webinar
OFI Overview 2019 Webinar
seanhefty
 
An Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
[AWSマイスターシリーズ] Instance Store & Elastic Block Store
[AWSマイスターシリーズ] Instance Store & Elastic Block Store
Amazon Web Services Japan
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 

Similar to Using Apache Spark to Solve Sessionization Problem in Batch and Streaming (20)

Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
[JEEConf-2017] RxJava as a key component in mature Big Data product
[JEEConf-2017] RxJava as a key component in mature Big Data product
Igor Lozynskyi
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
confluent
 
Online Meetup: Why should container system / platform builders care about con...
Online Meetup: Why should container system / platform builders care about con...
Docker, Inc.
 
Meetup spark structured streaming
Meetup spark structured streaming
José Carlos García Serrano
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
Imre Nagi
 
Spark streaming with kafka
Spark streaming with kafka
Dori Waldman
 
Spark stream - Kafka
Spark stream - Kafka
Dori Waldman
 
A new kind of BPM with Activiti
A new kind of BPM with Activiti
Alfresco Software
 
Reactive programming every day
Reactive programming every day
Vadym Khondar
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
Anyscale
 
JavaZone 2014 - goto java;
JavaZone 2014 - goto java;
Martin (高馬丁) Skarsaune
 
こわくないよ❤️ Playframeworkソースコードリーディング入門
こわくないよ❤️ Playframeworkソースコードリーディング入門
tanacasino
 
Introducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event Processor
WSO2
 
Scalable Angular 2 Application Architecture
Scalable Angular 2 Application Architecture
FDConf
 
VPN Access Runbook
VPN Access Runbook
Taha Shakeel
 
spring aop.pptx aspt oreinted programmin
spring aop.pptx aspt oreinted programmin
zmulani8
 
Event Sourcing - what could go wrong - Devoxx BE
Event Sourcing - what could go wrong - Devoxx BE
Andrzej Ludwikowski
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
Martin Zapletal
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
[JEEConf-2017] RxJava as a key component in mature Big Data product
[JEEConf-2017] RxJava as a key component in mature Big Data product
Igor Lozynskyi
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
confluent
 
Online Meetup: Why should container system / platform builders care about con...
Online Meetup: Why should container system / platform builders care about con...
Docker, Inc.
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
Imre Nagi
 
Spark streaming with kafka
Spark streaming with kafka
Dori Waldman
 
Spark stream - Kafka
Spark stream - Kafka
Dori Waldman
 
A new kind of BPM with Activiti
A new kind of BPM with Activiti
Alfresco Software
 
Reactive programming every day
Reactive programming every day
Vadym Khondar
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
Anyscale
 
こわくないよ❤️ Playframeworkソースコードリーディング入門
こわくないよ❤️ Playframeworkソースコードリーディング入門
tanacasino
 
Introducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event Processor
WSO2
 
Scalable Angular 2 Application Architecture
Scalable Angular 2 Application Architecture
FDConf
 
VPN Access Runbook
VPN Access Runbook
Taha Shakeel
 
spring aop.pptx aspt oreinted programmin
spring aop.pptx aspt oreinted programmin
zmulani8
 
Event Sourcing - what could go wrong - Devoxx BE
Event Sourcing - what could go wrong - Devoxx BE
Andrzej Ludwikowski
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
Martin Zapletal
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
NASA ESE Study Results v4 05.29.2020.pptx
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
Crafting-Research-Recommendations Grade 12.pptx
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
Allotted-MBBS-Student-list-batch-2021.pdf
Allotted-MBBS-Student-list-batch-2021.pdf
subhansaifi0603
 
Informatics Market Insights AI Workforce.pdf
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
NASA ESE Study Results v4 05.29.2020.pptx
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
Crafting-Research-Recommendations Grade 12.pptx
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
Allotted-MBBS-Student-list-batch-2021.pdf
Allotted-MBBS-Student-list-batch-2021.pdf
subhansaifi0603
 
Informatics Market Insights AI Workforce.pdf
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 

Using Apache Spark to Solve Sessionization Problem in Batch and Streaming

  • 1. Solving sessionization problem with Apache Spark batch and streaming processing Bartosz Konieczny @waitingforcode1
  • 2. About me Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 #canalplus #Paris 2
  • 3. 3
  • 4. Sessions "user activity followed by a closing action or a period of inactivity" 4
  • 6. Batch architecture 6 data producer sync consumer input logs (DFS) input logs (streaming broker) orchestrator sessions generator <triggers> previous window raw sessions (DFS) output sessions (DFS)
  • 7. Streaming architecture 7 data producer sessions generator output sessions (DFS) metadata state <uses> checkpoint location input logs (streaming broker)
  • 9. The code val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)).cache() joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) 9
  • 10. Full outer join val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) 10 processing logic previous window active sessions new input logs full outer join
  • 11. Watermark simulation val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) case class SessionIntermediaryState(userId: Long, … expirationTimeMillisUtc: Long, isActive: Boolean) 11
  • 12. Save modes val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) SaveMode.Append ⇒ duplicates & invalid results (e.g. multiplied revenue!) SaveMode.ErrorIfExists ⇒ failures & maintenance burden SaveMode.Ignore ⇒ no data & old data present in case of reprocessing SaveMode.Overwrite ⇒ always fresh data & easy maintenance 12
  • 14. The code val writeQuery = query.writeStream.outputMode(OutputMode.Update()) .option("checkpointLocation", s"s3://my-checkpoint-bucket") .foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => { BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}") }) val dataFrame = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaConfiguration.broker).option(...) .load() val query = dataFrame.selectExpr("CAST(value AS STRING)") .select(functions.from_json($"value", Visit.Schema).as("data")) .select($"data.*").withWatermark("event_time", "3 minutes") .groupByKey(row => row.getAs[Long]("user_id")) .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout()) (mapStreamingLogsToSessions(sessionTimeout)) watermark - late events & state expiration stateful processing - sessions generation checkpoint - fault-tolerance 14
  • 15. Checkpoint - fault-tolerance load state for t0 query load offsets to process & write them for t1 query process data write processed offsets write state checkpoint location state store offset log commit log val writeQuery = query.writeStream.outputMode(OutputMode.Update()) .option("checkpointLocation", s"s3://sessionization-demo/checkpoint") .foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => { BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}") }) .start() 15
  • 16. Checkpoint - fault-tolerance load state for t1 query load offsets to process & write them for t1 query process data confirm processed offsets & next watermark commit state t2 partition-based checkpoint location state store offset log commit log 16
  • 17. Stateful processing update remove get getput,remove write update finalize file make snapshot recover state def mapStreamingLogsToSessions(timeoutDurationMs: Long)(key: Long, logs: Iterator[Row], currentState: GroupState[SessionIntermediaryState]): SessionIntermediaryState = { if (currentState.hasTimedOut) { val expiredState = currentState.get.expire currentState.remove() expiredState } else { val newState = currentState.getOption.map(state => state.updateWithNewLogs(logs, timeoutDurationMs)) .getOrElse(SessionIntermediaryState.createNew(logs, timeoutDurationMs)) currentState.update(newState) currentState.setTimeoutTimestamp(currentState.getCurrentWatermarkMs() + timeoutDurationMs) currentState.get } } 17
  • 18. Stateful processing update remove get getput,remove - write update - finalize file - make snapshot recover state 18 .mapGroupsWithState(...) state store TreeMap[Long, ConcurrentHashMap[UnsafeRow, UnsafeRow] ] in-memory storage for the most recent versions 1.delta 2.delta 3.snapshot checkpoint location
  • 19. Watermark val sessionTimeout = TimeUnit.MINUTES.toMillis(5) val query = dataFrame.selectExpr("CAST(value AS STRING)") .select(functions.from_json($"value", Visit.Schema).as("data")) .select($"data.*") .withWatermark("event_time", "3 minutes") .groupByKey(row => row.getAs[Long]("user_id")) .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout()) (Mapping.mapStreamingLogsToSessions(sessionTimeout)) 19
  • 20. Watermark - late events on-time event late event 20 .mapGroupsWithState(...)
  • 21. Watermark - expired state State representation [simplified] {value, TTL configuration} Algorithm: 1. Update all states with new data → eventually extend TTL 2. Retrieve TTL configuration for the query → here: watermark 3. Retrieve all states that expired → no new data in this query & TTL expired 4. Call mapGroupsWithState on it with hasTimedOut param = true & no new data (Iterator.empty) // full implementation: org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec.InputProcessor 21
  • 23. Batch
  • 24. reschedule your job © https://pics.me.me/just-one-click-and-the-zoo-is-mine-8769663.png
  • 27. State store 1. Restored state is the most recent snapshot 2. Restored state is not the most recent snapshot but a snapshot exists 3. Restored state is not the most recent snapshot and a snapshot doesn't exist 27 1.delta 3.snapshot2.delta 1.delta 3.snapshot2.delta 4.delta 1.delta 3.delta2.delta 4.delta
  • 28. State store configuration spark.sql.streaming.stateStore: → .minDeltasForSnapshot → .maintenanceInterval 28 spark.sql.streaming: → .maxBatchesToRetainInMemory
  • 30. Few takeaways ● yet another TDD acronym - Trade-Off Driven Development ○ simplicity for latency ○ simplicity for accuracy ○ scaling for latency ● AWS ○ Kinesis - short retention period = reprocessing boundary, connector ○ S3 - trade reliability for performance ○ EMR - transient cluster ○ Redshift - COPY ● Apache Spark ○ watermarks everywhere - batch simulation ○ state store configuration ○ restore mechanism ○ overwrite idempotent mode 30
  • 32. Thank you!Bartosz Konieczny @waitingforcode / github.com/bartosz25 / waitingforcode.com Canal+ @canaltechteam