SlideShare a Scribd company logo
Customizing Apache
Spark - beyond
SparkSessionExtensions
Bartosz Konieczny @waitingforcode
Implementing a custom state store
About me
Bartosz Konieczny
Data Engineer @OCTOTechnology
#ApacheSparkEnthusiast #DataOnTheCloud
👓 read my data & Spark articles at waitingforcode.com
🎓 learn data engineering with me at becomedataengineer.com
follow me @waitingforcode
check github.com/bartosz25 for data code snippets
A customized Apache Spark?
3 levels of customization (subjective)
User-Defined-*
3 levels of customization (subjective)
User-Defined-*
SQL plans, data sources/sinks, plugins, file committers,
checkpoint manager, state stores
3 levels of customization (subjective)
User-Defined-*
SQL plans, data sources/sinks, plugins, file committers,
checkpoint manager, state stores
topology mapper, recovery mode 😱
3 levels of customization (subjective)
User-Defined-*
SQL plans, data sources/sinks, plugins, file committers,
checkpoint manager, state stores
topology mapper, recovery mode 😱
A customized state store?
state store simplified definition by myself
A versioned partition-based map used to store intermediary
results (state) of stateful operations (aggregations, streaming
joins, arbitrary stateful processing, deduplication, global limit).
State store customization 101
▪ How?
▪ spark.sql.streaming.stateStore.providerClass
▪ What?
▪ org.apache.spark.sql.execution.streaming.state.StateStoreProvider
org.apache.spark.sql.execution.streaming.state.StateStore
▪ Why?
▪ RocksDB rocks 🤘
APIs - 5 main operation types
trait StateStore
def get(key: UnsafeRow): UnsafeRow
def put(key: UnsafeRow,
value: UnsafeRow): Unit
def remove(key: UnsafeRow): Unit
def commit(): Long
def abort(): Unit
def hasCommitted: Boolean
def iterator(): Iterator[UnsafeRowPair]
def getRange(start: Option[UnsafeRow],
end: Option[UnsafeRow]):
Iterator[UnsafeRowPair]
def metrics: StateStoreMetrics
trait StateStoreProvider
def doMaintenance(): Unit
def supportedCustomMetrics:
Seq[StateStoreCustomMetric]
CRUD
maintenance
"transaction"
management
state
expiration
state store
metrics
CRUD
initialize
state store
get current
value
(state)
set new
value
(state)
transform state
(Spark-defined function,
user-defined function for
arbitrary stateful
processing)
CRUD with API
initialize
state store
get current
value
(state)
set new
value
(state)
transform state
(Spark-defined function,
user-defined function for
arbitrary stateful
processing)
StateStore
#getStore(version:
Long): StateStore
+
StateStoreProvider
#createAndInit
StateStore
#get
StateStore
#put
StateStoreOps
#mapPartitionsWithS
tateStore
StateStoreRDD
or
state store manager
⚪ StreamingDeduplicateExec#store.put(key, EMPTY_ROW)
⚪ FlatMapGroupsWithStateExec#stateManager.putState(store,
stateData.keyRow, updatedStateObj,
currentTimeoutTimestamp)
examples
State expiration
list all
states
remove the
state
for every key apply expiration
predicate, eg. watermark predicate
State expiration - with API
list all
states
remove the
state
for every key apply expiration
predicate, eg. watermark predicate
StateStore
#getRange
StateStore
#iterator
StateStore
#remove
store.getRange(None, None).map { p =>
stateData.withNew(p.key, p.value,
getStateObject(p.value),
getTimestamp(p.value))
}
def getRange(start: Option[UnsafeRow],
end: Option[UnsafeRow]):
Iterator[UnsafeRowPair] = {
iterator()
} // StateStore default implementation
StreamingAggregationStateManagerBaseImpl {
override def iterator(store:
StateStore): Iterator[UnsafeRowPair] = {
store.iterator()
}
State finalization
after
processing
alive and
expired states
validate
modified
state
task
completed
invoke state
store listener
task
completion
listener
State finalization with API
after
processing
alive and
expired states
validate
modified
state
task
completed
invoke state
store listener
task
completion
listener
StateStore
#abort
gather & log
state metrics
StateStore
#metrics
"customMetrics" : {
"loadedMapCacheHitCount": 12,
"loadedMapCacheMissCount": 0,
"stateOnCurrentVersionSizeBytes": 208
}
CompletionIterator
NextIterator
StateStore
#commit
if failure (version
not committed)
all tasks
terminated
State maintenance
background
thread per
partition
(store)
every
spark.sql.streaming.stateStore.maintenanceInterval start
maintenance
job
State maintenance - with API
background
thread per
partition
(store)
every
spark.sql.streaming.stateStore.maintenanceInterval start
maintenance
job
StateStoreProvider
#doMaintenance
Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
▪ iterator() - UnsafeRowPair is mutable
▪ put() - UnsafeRow can be reused, use the copies Luke!
Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
▪ iterator() - UnsafeRowPair is mutable
▪ put() - UnsafeRow can be reused, use the copies Luke!
▪ consistency awareness - spark.sql.streaming.minBatchesToRetain
Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
▪ iterator() - UnsafeRowPair is mutable
▪ put() - UnsafeRow can be reused, use the copies Luke!
▪ consistency awareness - spark.sql.streaming.minBatchesToRetain
▪ state reloading semantic - incremental changes (delta) vs snapshot in time
▪ state reloading semantic - delete markers
Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
▪ iterator() - UnsafeRowPair is mutable
▪ put() - UnsafeRow can be reused, use the copies Luke!
▪ consistency awareness - spark.sql.streaming.minBatchesToRetain
▪ state reloading semantic - incremental changes (delta) vs snapshot in time
▪ state reloading semantic - delete markers
▪ state store implementation is immutable - remains the same between runs
▪ state store commit - micro-batch/epoch + 1!
Resources
▪ follow-up blog posts series: https://www.waitingforcode.com/tags/data-ai-summit-europe-2020-articles
▪ Github project - MapDB-backed state store, customized checkpoint manager and file committer:
https://github.com/bartosz25/data-ai-summit-2020
▪ blog posts/talks about custom:
data sources: https://databricks.com/session_eu19/extending-spark-sql-2-4-with-new-data-sources-
live-coding-session-continues
plugins:
https://issues.apache.org/jira/browse/SPARK-28091
https://databricks.com/session_eu20/what-is-new-with-apache-spark-performance-monitoring-
in-spark-3-0
SQL plan:
https://databricks.com/session/how-to-extend-apache-spark-with-customized-optimizations
https://www.waitingforcode.com/tags/spark-sql-customization
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Thank you!
@waitingforcode / waitingforcode.com
@OCTOTechnology / blog.octo.com/en

More Related Content

PDF
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
PDF
Parquet performance tuning: the missing guide
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
DataEngConf SF16 - Spark SQL Workshop
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Easy, scalable, fault tolerant stream processing with structured streaming - ...
DataEngConf SF16 - Collecting and Moving Data at Scale
Parquet performance tuning: the missing guide
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
DataEngConf SF16 - Spark SQL Workshop

What's hot (20)

PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
PDF
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
PPTX
Monitoring Spark Applications
PPTX
Spark 1.6 vs Spark 2.0
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
PDF
Using Apache Spark as ETL engine. Pros and Cons
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
PDF
Valerii Vasylkov Erlang. measurements and benefits.
PDF
Spark SQL Join Improvement at Facebook
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
PPTX
SORT & JOIN IN SPARK 2.0
PPTX
Tuning and Debugging in Apache Spark
PDF
Apache Spark RDDs
PDF
Hive dirty/beautiful hacks in TD
PDF
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
PDF
Introduction to Spark with Python
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring Spark Applications
Spark 1.6 vs Spark 2.0
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
Using Apache Spark as ETL engine. Pros and Cons
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Valerii Vasylkov Erlang. measurements and benefits.
Spark SQL Join Improvement at Facebook
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
SORT & JOIN IN SPARK 2.0
Tuning and Debugging in Apache Spark
Apache Spark RDDs
Hive dirty/beautiful hacks in TD
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Introduction to Spark with Python
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
Ad

Similar to Extending Apache Spark – Beyond Spark Session Extensions (20)

PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Efficient State Management With Spark 2.x And Scale-Out Databases
PDF
Making Apache Spark Better with Delta Lake
PDF
SF Big Analytics meetup : Hoodie From Uber
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
PDF
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
PPTX
Hoodie: Incremental processing on hadoop
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
PDF
Deep Dive into the New Features of Apache Spark 3.1
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PDF
Building large scale transactional data lake using apache hudi
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
A Practical Enterprise Feature Store on Delta Lake
PPTX
Hadoop, Hive, Spark and Object Stores
PDF
Optimizing Spark-based data pipelines - are you up for it?
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
Making Apache Spark Better with Delta Lake
SF Big Analytics meetup : Hoodie From Uber
Hoodie: How (And Why) We built an analytical datastore on Spark
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Hoodie: Incremental processing on hadoop
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Building large scale transactional data lake using apache hudi
Deep dive into stateful stream processing in structured streaming by Tathaga...
Python and Bigdata - An Introduction to Spark (PySpark)
Jump Start into Apache® Spark™ and Databricks
A Practical Enterprise Feature Store on Delta Lake
Hadoop, Hive, Spark and Object Stores
Optimizing Spark-based data pipelines - are you up for it?
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Business Analytics and business intelligence.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
annual-report-2024-2025 original latest.
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Data Science and Data Analysis
Acceptance and paychological effects of mandatory extra coach I classes.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Supervised vs unsupervised machine learning algorithms
Quality review (1)_presentation of this 21
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Galatica Smart Energy Infrastructure Startup Pitch Deck
SAP 2 completion done . PRESENTATION.pptx
Business Analytics and business intelligence.pdf
Reliability_Chapter_ presentation 1221.5784
Business Ppt On Nestle.pptx huunnnhhgfvu
Data_Analytics_and_PowerBI_Presentation.pptx
Qualitative Qantitative and Mixed Methods.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
1_Introduction to advance data techniques.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
annual-report-2024-2025 original latest.

Extending Apache Spark – Beyond Spark Session Extensions

  • 1. Customizing Apache Spark - beyond SparkSessionExtensions Bartosz Konieczny @waitingforcode Implementing a custom state store
  • 2. About me Bartosz Konieczny Data Engineer @OCTOTechnology #ApacheSparkEnthusiast #DataOnTheCloud 👓 read my data & Spark articles at waitingforcode.com 🎓 learn data engineering with me at becomedataengineer.com follow me @waitingforcode check github.com/bartosz25 for data code snippets
  • 4. 3 levels of customization (subjective) User-Defined-*
  • 5. 3 levels of customization (subjective) User-Defined-* SQL plans, data sources/sinks, plugins, file committers, checkpoint manager, state stores
  • 6. 3 levels of customization (subjective) User-Defined-* SQL plans, data sources/sinks, plugins, file committers, checkpoint manager, state stores topology mapper, recovery mode 😱
  • 7. 3 levels of customization (subjective) User-Defined-* SQL plans, data sources/sinks, plugins, file committers, checkpoint manager, state stores topology mapper, recovery mode 😱
  • 9. state store simplified definition by myself A versioned partition-based map used to store intermediary results (state) of stateful operations (aggregations, streaming joins, arbitrary stateful processing, deduplication, global limit).
  • 10. State store customization 101 ▪ How? ▪ spark.sql.streaming.stateStore.providerClass ▪ What? ▪ org.apache.spark.sql.execution.streaming.state.StateStoreProvider org.apache.spark.sql.execution.streaming.state.StateStore ▪ Why? ▪ RocksDB rocks 🤘
  • 11. APIs - 5 main operation types trait StateStore def get(key: UnsafeRow): UnsafeRow def put(key: UnsafeRow, value: UnsafeRow): Unit def remove(key: UnsafeRow): Unit def commit(): Long def abort(): Unit def hasCommitted: Boolean def iterator(): Iterator[UnsafeRowPair] def getRange(start: Option[UnsafeRow], end: Option[UnsafeRow]): Iterator[UnsafeRowPair] def metrics: StateStoreMetrics trait StateStoreProvider def doMaintenance(): Unit def supportedCustomMetrics: Seq[StateStoreCustomMetric] CRUD maintenance "transaction" management state expiration state store metrics
  • 12. CRUD initialize state store get current value (state) set new value (state) transform state (Spark-defined function, user-defined function for arbitrary stateful processing)
  • 13. CRUD with API initialize state store get current value (state) set new value (state) transform state (Spark-defined function, user-defined function for arbitrary stateful processing) StateStore #getStore(version: Long): StateStore + StateStoreProvider #createAndInit StateStore #get StateStore #put StateStoreOps #mapPartitionsWithS tateStore StateStoreRDD or state store manager ⚪ StreamingDeduplicateExec#store.put(key, EMPTY_ROW) ⚪ FlatMapGroupsWithStateExec#stateManager.putState(store, stateData.keyRow, updatedStateObj, currentTimeoutTimestamp) examples
  • 14. State expiration list all states remove the state for every key apply expiration predicate, eg. watermark predicate
  • 15. State expiration - with API list all states remove the state for every key apply expiration predicate, eg. watermark predicate StateStore #getRange StateStore #iterator StateStore #remove store.getRange(None, None).map { p => stateData.withNew(p.key, p.value, getStateObject(p.value), getTimestamp(p.value)) } def getRange(start: Option[UnsafeRow], end: Option[UnsafeRow]): Iterator[UnsafeRowPair] = { iterator() } // StateStore default implementation StreamingAggregationStateManagerBaseImpl { override def iterator(store: StateStore): Iterator[UnsafeRowPair] = { store.iterator() }
  • 16. State finalization after processing alive and expired states validate modified state task completed invoke state store listener task completion listener
  • 17. State finalization with API after processing alive and expired states validate modified state task completed invoke state store listener task completion listener StateStore #abort gather & log state metrics StateStore #metrics "customMetrics" : { "loadedMapCacheHitCount": 12, "loadedMapCacheMissCount": 0, "stateOnCurrentVersionSizeBytes": 208 } CompletionIterator NextIterator StateStore #commit if failure (version not committed) all tasks terminated
  • 19. State maintenance - with API background thread per partition (store) every spark.sql.streaming.stateStore.maintenanceInterval start maintenance job StateStoreProvider #doMaintenance
  • 20. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states
  • 21. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states ▪ iterator() - UnsafeRowPair is mutable ▪ put() - UnsafeRow can be reused, use the copies Luke!
  • 22. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states ▪ iterator() - UnsafeRowPair is mutable ▪ put() - UnsafeRow can be reused, use the copies Luke! ▪ consistency awareness - spark.sql.streaming.minBatchesToRetain
  • 23. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states ▪ iterator() - UnsafeRowPair is mutable ▪ put() - UnsafeRow can be reused, use the copies Luke! ▪ consistency awareness - spark.sql.streaming.minBatchesToRetain ▪ state reloading semantic - incremental changes (delta) vs snapshot in time ▪ state reloading semantic - delete markers
  • 24. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states ▪ iterator() - UnsafeRowPair is mutable ▪ put() - UnsafeRow can be reused, use the copies Luke! ▪ consistency awareness - spark.sql.streaming.minBatchesToRetain ▪ state reloading semantic - incremental changes (delta) vs snapshot in time ▪ state reloading semantic - delete markers ▪ state store implementation is immutable - remains the same between runs ▪ state store commit - micro-batch/epoch + 1!
  • 25. Resources ▪ follow-up blog posts series: https://www.waitingforcode.com/tags/data-ai-summit-europe-2020-articles ▪ Github project - MapDB-backed state store, customized checkpoint manager and file committer: https://github.com/bartosz25/data-ai-summit-2020 ▪ blog posts/talks about custom: data sources: https://databricks.com/session_eu19/extending-spark-sql-2-4-with-new-data-sources- live-coding-session-continues plugins: https://issues.apache.org/jira/browse/SPARK-28091 https://databricks.com/session_eu20/what-is-new-with-apache-spark-performance-monitoring- in-spark-3-0 SQL plan: https://databricks.com/session/how-to-extend-apache-spark-with-customized-optimizations https://www.waitingforcode.com/tags/spark-sql-customization
  • 26. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Thank you! @waitingforcode / waitingforcode.com @OCTOTechnology / blog.octo.com/en