SlideShare a Scribd company logo
Structured Streaming
For machine Learning! :)
kroszk@
Built with
experimental
APIs :)
Holden:
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Seth:
● Machine learning engineer at IBM’s Spark Technology Center
○ Working on high-performance, distributed machine learning for Spark MLlib
○ Also, structured streaming!
● Previously electrical engineering
● @shendrickson16
● Linkedin https://www.linkedin.com/in/sethah
● Github https://github.com/sethah
● SlideShare http://www.slideshare.net/SethHendrickson
What is going to be covered:
● Who we think y’all are
● Abridged Introduction to Datasets
● Abridged Introduction to Structured Streaming
● What Structured Streaming is and is not
● How to write simple structured streaming queries
● The exciting part: Building machine learning on top of structured streaming
● Possible future changes to make structured streaming & ML work together
nicely
Torsten Reuschling
Who we think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● Know some Apache Spark
● May or may not know the Dataset API
● Want to take advantage of Spark’s Structured Streaming
● May care about machine learning
● Possibly distracted with Pokemon GO
ALPHA =~ Please don’t use this
Image by Mr Thinktank
What are Datasets?
● New in Spark 1.6
● Provide templated compile time strongly typed version of DataFrames
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● Still an experimental component (API will change in future versions)
● The basis of the Structured Streaming
Houser Wolf
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
A typed query (specifies the
return type). Without the as[]
will return a DataFrame
(Dataset[Row])
Traditional functional
reduction:
arbitrary scala code :)
Robert Couse-Baker
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
And now we can use it for streaming too!
● StructuredStreaming - new to Spark 2.0
○ Emphasis on new - be cautious when using
● Extends the Dataset & DataFrame APIs to represent continuous tables
● Still very early stages - but lots of really cool optimizations possible now
● We can build a machine learning pipeline with it together :)
○ Well we have to use some hacks - but ssssssh don’t tell TD
https://github.com/holdenk/spark-structured-streaming-ml
Get a streaming dataframe
// Read a streaming dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val streamingDS = spark
.readStream
.schema(schema)
.format(“parquet”)
.load(path)
Dataset
isStreaming = true
source
scan
Build the recipe for each query
val happinessByCoffee = streamingDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = true
source
scan
groupBy
avg
Start a continuous query
val query = happinessByCoffee
.writeStream
.format(“parquet”)
.outputMode(“complete”)
.trigger(ProcessingTime(5.seconds))
.start()
StreamingQuery
source
scan
groupBy
avglogicalPlan =
Launch a new thread to listen for new data
Source
Available Offsets:
Sink
Committed Offsets:
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
Listening
Write new offsets to WAL
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets:
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
Commit to WAL
Check the source for new offsets
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
batchId=
42
getBatch()
Get the “recipe” for this micro batch
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
batchId=
42
source
scan
groupBy
avg
batchId=
42
Send the micro batch Dataset to the sink
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
groupBy
avg
MicroBatch Dataset
isStreaming = false
addBatch()
batchId=
42
Backed by an incremental
execution plan
Commit and listen again
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets:
0, 1, 2
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
Listening
Execution Summary
● Each query has its own thread - asynchronous
● Sources must be replayable
● Use write-ahead-logs for durability
● Sinks must be idempotent
● Each batch is executed with an incremental execution
plan
● Sinks get a micro batch view of the data
○ We exploit this for ML
Cool - lets build some ML with it!
Lauren Coolman
Batch ML pipelines
Tokenizer HashingTF String Indexer Naive Bayes
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
fit(df)
Estimator
Transformer
● In the batch setting, an estimator is trained on a dataset, and
produces a static, immutable transformer.
● There is no communication between the two.
Streaming ML Pipelines (Proof of Concept)
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
fit(df)
Estimator
Transformer
(mutable)
● For streaming, an estimator is not fit once, but repeatedly on incoming data
● In one implementation, the estimator produces an initial transformer, and
communicates updates to a specialized StreamingTransformer.
● Streaming transformers must provide a means of incorporating model
updates into predictions.
state state
Lauren Coolman
Streaming Estimator/Transformer (POC)
trait StreamingModel[S] extends Transformer {
def update(updates: S): Unit
}
trait StreamingEstimator[S] extends Estimator {
def model: StreamingModel[S]
def update(batch: Dataset[_]): Unit
}
Sufficient statistics for
model updates
Blinke
nArea
Getting a micro-batch view with distributed
collection*
case class ForeachDatasetSink(func: DataFrame => Unit) extends Sink
{
override def addBatch(batchId: Long, data: DataFrame): Unit = {
func(data)
}
}
https://github.com/holdenk/spark-structured-streaming-ml
And doing some ML with it:
def evilTrain(df: DataFrame): StreamingQuery = {
val sink = new ForeachDatasetSink({df: DataFrame => update(df)})
val sparkSession = df.sparkSession
val evilStreamingQueryManager =
EvilStreamingQueryManager(sparkSession.streams)
evilStreamingQueryManager.startQuery(
Some("snb-train"),
None,
df,
sink,
OutputMode.Append())
}
And doing some ML with it:
def update(batch: Dataset[_]): Unit = {
val newCountsByClass = add(batch)
model.update(newCountsByClass)
} Aggregate new batch
Merge with previous aggregates
And doing some ML with it*
(Algorithm specific)
def update(updates: Array[(Double, (Long, DenseVector))]): Unit = {
updates.foreach { case (label, (numDocs, termCounts)) =>
countsByClass.get(label) match {
case Some((n, c)) =>
axpy(1.0, termCounts, c)
countsByClass(label) = (n + numDocs, c)
case None =>
// new label encountered
countsByClass += (label -> (numDocs, termCounts))
}
}
}
Non-Evil alternatives to our Evil:
● ForeachWriter exists
● Since everything runs on the executors it's difficult to update the model
● You could:
○ Use accumulators
○ Write the updates to Kafka
○ Send the updates to a param server of some type with RPC
○ Or do the evil things we did instead :)
● Wait for the “future?”: https://github.com/apache/spark/pull/15178
_torne
Working with the results - foreach (1 of 2)
val foreachWriter: ForeachWriter[T] =
new ForeachWriter[T] {
def open(partitionId: Long, version: Long): Boolean = {
True // always open
}
def close(errorOrNull: Throwable): Unit = {
// No close logic - if we wanted to copy updates per-batch
}
def process(record: T): Unit = {
db.update(record)
}
}
Working with the results - foreach (2 of 2)
// Apply foreach
happinessByCoffee.writeStream.outputMode(OutputMode.Complete())
foreach(foreachWriter).start()
Structured Streaming in Review:
● Structured Streaming still uses Spark’s Microbatch approach
● JIRA discussion indicates an interest in swapping out the execution engine
(but no public design document has emerged yet)
● One of the areas that Matei is researching
○ Researching ==~ future , research !~ today
Windell Oskay
Ok but where can we not use it?
● A lot of random methods on DataFrames & Datasets won’t work
● They will fail at runtime rather than compile time - so have tests!
● Anything which roundtrips through an rdd() is going to be pretty sad (aka fail)
○ Lot’s of internals randomly do (like toJson) for historical reasons
● Need to run a query inside of a sink? That is not going to work
● Need a complex receiver type? Most receivers are not ported yet
● Also you will need distinct query names - even if you stop the previous query.
● Aggregations and Append output mode (and the only file sink requires
Append)
● Need Kafka or other non-file based receiver
● DataFrame/Dataset transformations inside of a sink
Open questions for ML pipelines
● How to train and predict simultaneously, on the same data?
○ Transform thread should be executed first
○ Do we actually need to support this or is this just a common demo?
● How to ensure robustness to failures?
○ Store models in main memory for quick access
○ Back models by durable model store - write out every update
○ Similar to aggregate state store in structured streaming, but happens on the driver
● Model training must be idempotent - should not train on the same data twice
○ Leverage batch ID, similar to `FileStreamSink`
● How to extend MLWritable for streaming
○ Spark’s format isn’t really all that useful - maybe PMML or PFA
Photo by
bullet101
Structured Streaming ML vs DStreams ML
What could be different for ML on structured streaming vs ML on DStreams?
● Structured streaming is built on the Spark SQL engine
○ Catalyst optimizer
○ Project tungsten
● Pipeline integration
○ ML pipelines have been improved and iterated across 5 releases, we can leverage their
mature design for streaming pipelines
○ This will make adding and working with new algorithms much easier than in the past
● Event time handling
○ Streaming ML algorithms typically use a decay factor
○ Structured streaming provides native support for event time, which is more appropriate for
decay
Batch vs Streaming Pipelines (Draft POC API)
val df = spark
.read
.schema(schema)
.parquet(path)
val tokenizer = new RegexTokenizer()
val htf = new HashingTF()
val nb = new NaiveBayes()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, htf, nb))
val pipelineModel = pipeline.fit(df)
val df = spark
.readStream
.schema(schema)
.parquet(path)
val tokenizer = new RegexTokenizer()
val htf = new HashingTF()
val snb = new StreamingNaiveBayes()
val pipeline = new StreamingPipeline()
.setStages(Array(tokenizer, htf, snb))
.setCheckpointLocation(path)
val query = pipeline.fitStreaming(df)
query.awaitTermination()
https://github.com/sethah/spark/tree/structured-streaming-fun
Additional Spark Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/
● Books
● Videos
● Spark Office Hours
○ Normally in the bay area - will do Google Hangouts ones soon
○ follow me on twitter for future ones - https://twitter.com/holdenkarau
Structured Streaming Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/structured-streaming-programming
-guide.html
● https://github.com/holdenk/spark-structured-streaming-ml
● TD
https://spark-summit.org/2016/events/a-deep-dive-into-structured-st
reaming/
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
And the next book…..
First five chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Probably a book signing too!
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
Surveys!!!!!!!! :D
● Interested in Structured Streaming?
○ http://bit.ly/structuredStreamingML - Let us know your thoughts
● Pssst: Care about Python DataFrame UDF
Performance?
○ http://bit.ly/pySparkUDF
● Care about Spark Testing?
○ http://bit.ly/holdenTestingSpark
Michael
Himbeault
And some upcoming talks:
● September
○ This Talk (yay)
● October
○ PyData DC - Making Spark go fast in Python (vroom vroom)
○ Salt Lake City Spark Meetup - TBD
○ London - OSCON - Getting Started Contributing to Spark
● November
○ Possible London Office Hours if interest?
● December
○ Strata Singapore (Introduction to Datasets)
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

More Related Content

PDF
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
PDF
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Getting the best performance with PySpark - Spark Summit West 2016
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introduction to Spark Datasets - Functional and relational together at last
Improving PySpark performance: Spark Performance Beyond the JVM
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Introduction to Spark ML Pipelines Workshop
Getting the best performance with PySpark - Spark Summit West 2016

What's hot (20)

PPTX
Beyond shuffling - Strata London 2016
PDF
Scaling with apache spark (a lesson in unintended consequences) strange loo...
PPTX
Beyond parallelize and collect - Spark Summit East 2016
PDF
Introduction to and Extending Spark ML
PDF
Spark ML for custom models - FOSDEM HPC 2017
PPTX
Beyond shuffling global big data tech conference 2015 sj
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
A super fast introduction to Spark and glance at BEAM
PDF
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
PDF
Extending spark ML for custom models now with python!
PDF
Beyond shuffling - Scala Days Berlin 2016
PDF
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
PDF
Streaming & Scaling Spark - London Spark Meetup 2016
PDF
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
PDF
Testing and validating spark programs - Strata SJ 2016
PPTX
Introduction to Spark ML
PDF
Beyond Parallelize and Collect by Holden Karau
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Holden Karau - Spark ML for Custom Models
Beyond shuffling - Strata London 2016
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Beyond parallelize and collect - Spark Summit East 2016
Introduction to and Extending Spark ML
Spark ML for custom models - FOSDEM HPC 2017
Beyond shuffling global big data tech conference 2015 sj
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
A super fast introduction to Spark and glance at BEAM
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Extending spark ML for custom models now with python!
Beyond shuffling - Scala Days Berlin 2016
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming & Scaling Spark - London Spark Meetup 2016
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Testing and validating spark programs - Strata SJ 2016
Introduction to Spark ML
Beyond Parallelize and Collect by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Holden Karau - Spark ML for Custom Models
Ad

Viewers also liked (7)

PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
PDF
Debugging PySpark - Spark Summit East 2017
PPTX
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PDF
Getting started contributing to Apache Spark
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
PDF
Debugging Apache Spark - Scala & Python super happy fun times 2017
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Apache Spark Super Happy Funtimes - CHUG 2016
Debugging PySpark - Spark Summit East 2017
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
Getting started contributing to Apache Spark
A really really fast introduction to PySpark - lightning fast cluster computi...
Debugging Apache Spark - Scala & Python super happy fun times 2017
Ad

Similar to Apache Spark Structured Streaming for Machine Learning - StrataConf 2016 (20)

PDF
20170126 big data processing
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Structured streaming for machine learning
PPTX
ETL with SPARK - First Spark London meetup
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Fast federated SQL with Apache Calcite
PDF
Meetup ml spark_ppt
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
PPTX
Emerging technologies /frameworks in Big Data
PPTX
Dive into spark2
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Apache Flink internals
PDF
Buildingsocialanalyticstoolwithmongodb
PDF
Productionizing your Streaming Jobs
PDF
SF Big Analytics meetup : Hoodie From Uber
PDF
Testing and validating distributed systems with Apache Spark and Apache Beam ...
PDF
Data Science
20170126 big data processing
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
AI與大數據數據處理 Spark實戰(20171216)
Alpine academy apache spark series #1 introduction to cluster computing wit...
Structured streaming for machine learning
ETL with SPARK - First Spark London meetup
Big Data Beyond the JVM - Strata San Jose 2018
Fast federated SQL with Apache Calcite
Meetup ml spark_ppt
An introduction into Spark ML plus how to go beyond when you get stuck
Emerging technologies /frameworks in Big Data
Dive into spark2
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Apache Flink internals
Buildingsocialanalyticstoolwithmongodb
Productionizing your Streaming Jobs
SF Big Analytics meetup : Hoodie From Uber
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Data Science

Recently uploaded (20)

PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Introduction to the R Programming Language
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Predictive modeling basics in data cleaning process
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Transcultural that can help you someday.
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Leprosy and NLEP programme community medicine
PDF
Lecture1 pattern recognition............
PDF
Mega Projects Data Mega Projects Data
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
annual-report-2024-2025 original latest.
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Data_Analytics_and_PowerBI_Presentation.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
[EN] Industrial Machine Downtime Prediction
Introduction to the R Programming Language
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Predictive modeling basics in data cleaning process
Optimise Shopper Experiences with a Strong Data Estate.pdf
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Transcultural that can help you someday.
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Leprosy and NLEP programme community medicine
Lecture1 pattern recognition............
Mega Projects Data Mega Projects Data
ISS -ESG Data flows What is ESG and HowHow
annual-report-2024-2025 original latest.
Acceptance and paychological effects of mandatory extra coach I classes.pptx
SAP 2 completion done . PRESENTATION.pptx

Apache Spark Structured Streaming for Machine Learning - StrataConf 2016

  • 1. Structured Streaming For machine Learning! :) kroszk@ Built with experimental APIs :)
  • 2. Holden: ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming out this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3. Seth: ● Machine learning engineer at IBM’s Spark Technology Center ○ Working on high-performance, distributed machine learning for Spark MLlib ○ Also, structured streaming! ● Previously electrical engineering ● @shendrickson16 ● Linkedin https://www.linkedin.com/in/sethah ● Github https://github.com/sethah ● SlideShare http://www.slideshare.net/SethHendrickson
  • 4. What is going to be covered: ● Who we think y’all are ● Abridged Introduction to Datasets ● Abridged Introduction to Structured Streaming ● What Structured Streaming is and is not ● How to write simple structured streaming queries ● The exciting part: Building machine learning on top of structured streaming ● Possible future changes to make structured streaming & ML work together nicely Torsten Reuschling
  • 5. Who we think you wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ● Know some Apache Spark ● May or may not know the Dataset API ● Want to take advantage of Spark’s Structured Streaming ● May care about machine learning ● Possibly distracted with Pokemon GO
  • 6. ALPHA =~ Please don’t use this Image by Mr Thinktank
  • 7. What are Datasets? ● New in Spark 1.6 ● Provide templated compile time strongly typed version of DataFrames ● Make it easier to intermix functional & relational code ○ Do you hate writing UDFS? So do I! ● Still an experimental component (API will change in future versions) ● The basis of the Structured Streaming Houser Wolf
  • 8. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.filter($"happy" === true). select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 9. So what was that? ds.filter($"happy" === true). select($"attributes"(0).as[Double]). reduce((x, y) => x + y) A typed query (specifies the return type). Without the as[] will return a DataFrame (Dataset[Row]) Traditional functional reduction: arbitrary scala code :) Robert Couse-Baker
  • 10. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} }
  • 11. And now we can use it for streaming too! ● StructuredStreaming - new to Spark 2.0 ○ Emphasis on new - be cautious when using ● Extends the Dataset & DataFrame APIs to represent continuous tables ● Still very early stages - but lots of really cool optimizations possible now ● We can build a machine learning pipeline with it together :) ○ Well we have to use some hacks - but ssssssh don’t tell TD https://github.com/holdenk/spark-structured-streaming-ml
  • 12. Get a streaming dataframe // Read a streaming dataframe val schema = new StructType() .add("happiness", "double") .add("coffees", "integer") val streamingDS = spark .readStream .schema(schema) .format(“parquet”) .load(path) Dataset isStreaming = true source scan
  • 13. Build the recipe for each query val happinessByCoffee = streamingDS .groupBy($"coffees") .agg(avg($"happiness")) Dataset isStreaming = true source scan groupBy avg
  • 14. Start a continuous query val query = happinessByCoffee .writeStream .format(“parquet”) .outputMode(“complete”) .trigger(ProcessingTime(5.seconds)) .start() StreamingQuery source scan groupBy avglogicalPlan =
  • 15. Launch a new thread to listen for new data Source Available Offsets: Sink Committed Offsets: MicroBatch Thread StreamingQuery source scan groupBy avglogicalPlan = Listening
  • 16. Write new offsets to WAL Source Available Offsets: 0, 1, 2 Sink Committed Offsets: MicroBatch Thread StreamingQuery source scan groupBy avglogicalPlan = Commit to WAL
  • 17. Check the source for new offsets Source Available Offsets: 0, 1, 2 Sink Committed Offsets MicroBatch Thread StreamingQuery source scan groupBy avglogicalPlan = batchId= 42 getBatch()
  • 18. Get the “recipe” for this micro batch Source Available Offsets: 0, 1, 2 Sink Committed Offsets MicroBatch Thread StreamingQuery source scan groupBy avglogicalPlan = batchId= 42 source scan groupBy avg batchId= 42
  • 19. Send the micro batch Dataset to the sink Source Available Offsets: 0, 1, 2 Sink Committed Offsets MicroBatch Thread StreamingQuery source scan groupBy avglogicalPlan = groupBy avg MicroBatch Dataset isStreaming = false addBatch() batchId= 42 Backed by an incremental execution plan
  • 20. Commit and listen again Source Available Offsets: 0, 1, 2 Sink Committed Offsets: 0, 1, 2 MicroBatch Thread StreamingQuery source scan groupBy avglogicalPlan = Listening
  • 21. Execution Summary ● Each query has its own thread - asynchronous ● Sources must be replayable ● Use write-ahead-logs for durability ● Sinks must be idempotent ● Each batch is executed with an incremental execution plan ● Sinks get a micro batch view of the data ○ We exploit this for ML
  • 22. Cool - lets build some ML with it! Lauren Coolman
  • 23. Batch ML pipelines Tokenizer HashingTF String Indexer Naive Bayes Tokenizer HashingTF Streaming String Indexer Streaming Naive Bayes fit(df) Estimator Transformer ● In the batch setting, an estimator is trained on a dataset, and produces a static, immutable transformer. ● There is no communication between the two.
  • 24. Streaming ML Pipelines (Proof of Concept) Tokenizer HashingTF Streaming String Indexer Streaming Naive Bayes Tokenizer HashingTF Streaming String Indexer Streaming Naive Bayes fit(df) Estimator Transformer (mutable) ● For streaming, an estimator is not fit once, but repeatedly on incoming data ● In one implementation, the estimator produces an initial transformer, and communicates updates to a specialized StreamingTransformer. ● Streaming transformers must provide a means of incorporating model updates into predictions. state state Lauren Coolman
  • 25. Streaming Estimator/Transformer (POC) trait StreamingModel[S] extends Transformer { def update(updates: S): Unit } trait StreamingEstimator[S] extends Estimator { def model: StreamingModel[S] def update(batch: Dataset[_]): Unit } Sufficient statistics for model updates Blinke nArea
  • 26. Getting a micro-batch view with distributed collection* case class ForeachDatasetSink(func: DataFrame => Unit) extends Sink { override def addBatch(batchId: Long, data: DataFrame): Unit = { func(data) } } https://github.com/holdenk/spark-structured-streaming-ml
  • 27. And doing some ML with it: def evilTrain(df: DataFrame): StreamingQuery = { val sink = new ForeachDatasetSink({df: DataFrame => update(df)}) val sparkSession = df.sparkSession val evilStreamingQueryManager = EvilStreamingQueryManager(sparkSession.streams) evilStreamingQueryManager.startQuery( Some("snb-train"), None, df, sink, OutputMode.Append()) }
  • 28. And doing some ML with it: def update(batch: Dataset[_]): Unit = { val newCountsByClass = add(batch) model.update(newCountsByClass) } Aggregate new batch Merge with previous aggregates
  • 29. And doing some ML with it* (Algorithm specific) def update(updates: Array[(Double, (Long, DenseVector))]): Unit = { updates.foreach { case (label, (numDocs, termCounts)) => countsByClass.get(label) match { case Some((n, c)) => axpy(1.0, termCounts, c) countsByClass(label) = (n + numDocs, c) case None => // new label encountered countsByClass += (label -> (numDocs, termCounts)) } } }
  • 30. Non-Evil alternatives to our Evil: ● ForeachWriter exists ● Since everything runs on the executors it's difficult to update the model ● You could: ○ Use accumulators ○ Write the updates to Kafka ○ Send the updates to a param server of some type with RPC ○ Or do the evil things we did instead :) ● Wait for the “future?”: https://github.com/apache/spark/pull/15178 _torne
  • 31. Working with the results - foreach (1 of 2) val foreachWriter: ForeachWriter[T] = new ForeachWriter[T] { def open(partitionId: Long, version: Long): Boolean = { True // always open } def close(errorOrNull: Throwable): Unit = { // No close logic - if we wanted to copy updates per-batch } def process(record: T): Unit = { db.update(record) } }
  • 32. Working with the results - foreach (2 of 2) // Apply foreach happinessByCoffee.writeStream.outputMode(OutputMode.Complete()) foreach(foreachWriter).start()
  • 33. Structured Streaming in Review: ● Structured Streaming still uses Spark’s Microbatch approach ● JIRA discussion indicates an interest in swapping out the execution engine (but no public design document has emerged yet) ● One of the areas that Matei is researching ○ Researching ==~ future , research !~ today Windell Oskay
  • 34. Ok but where can we not use it? ● A lot of random methods on DataFrames & Datasets won’t work ● They will fail at runtime rather than compile time - so have tests! ● Anything which roundtrips through an rdd() is going to be pretty sad (aka fail) ○ Lot’s of internals randomly do (like toJson) for historical reasons ● Need to run a query inside of a sink? That is not going to work ● Need a complex receiver type? Most receivers are not ported yet ● Also you will need distinct query names - even if you stop the previous query. ● Aggregations and Append output mode (and the only file sink requires Append) ● Need Kafka or other non-file based receiver ● DataFrame/Dataset transformations inside of a sink
  • 35. Open questions for ML pipelines ● How to train and predict simultaneously, on the same data? ○ Transform thread should be executed first ○ Do we actually need to support this or is this just a common demo? ● How to ensure robustness to failures? ○ Store models in main memory for quick access ○ Back models by durable model store - write out every update ○ Similar to aggregate state store in structured streaming, but happens on the driver ● Model training must be idempotent - should not train on the same data twice ○ Leverage batch ID, similar to `FileStreamSink` ● How to extend MLWritable for streaming ○ Spark’s format isn’t really all that useful - maybe PMML or PFA Photo by bullet101
  • 36. Structured Streaming ML vs DStreams ML What could be different for ML on structured streaming vs ML on DStreams? ● Structured streaming is built on the Spark SQL engine ○ Catalyst optimizer ○ Project tungsten ● Pipeline integration ○ ML pipelines have been improved and iterated across 5 releases, we can leverage their mature design for streaming pipelines ○ This will make adding and working with new algorithms much easier than in the past ● Event time handling ○ Streaming ML algorithms typically use a decay factor ○ Structured streaming provides native support for event time, which is more appropriate for decay
  • 37. Batch vs Streaming Pipelines (Draft POC API) val df = spark .read .schema(schema) .parquet(path) val tokenizer = new RegexTokenizer() val htf = new HashingTF() val nb = new NaiveBayes() val pipeline = new Pipeline() .setStages(Array(tokenizer, htf, nb)) val pipelineModel = pipeline.fit(df) val df = spark .readStream .schema(schema) .parquet(path) val tokenizer = new RegexTokenizer() val htf = new HashingTF() val snb = new StreamingNaiveBayes() val pipeline = new StreamingPipeline() .setStages(Array(tokenizer, htf, snb)) .setCheckpointLocation(path) val query = pipeline.fitStreaming(df) query.awaitTermination() https://github.com/sethah/spark/tree/structured-streaming-fun
  • 38. Additional Spark Resources ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/ ● Books ● Videos ● Spark Office Hours ○ Normally in the bay area - will do Google Hangouts ones soon ○ follow me on twitter for future ones - https://twitter.com/holdenkarau
  • 39. Structured Streaming Resources ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/structured-streaming-programming -guide.html ● https://github.com/holdenk/spark-structured-streaming-ml ● TD https://spark-summit.org/2016/events/a-deep-dive-into-structured-st reaming/
  • 40. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  • 41. And the next book….. First five chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark ● Probably a book signing too! Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 42. Surveys!!!!!!!! :D ● Interested in Structured Streaming? ○ http://bit.ly/structuredStreamingML - Let us know your thoughts ● Pssst: Care about Python DataFrame UDF Performance? ○ http://bit.ly/pySparkUDF ● Care about Spark Testing? ○ http://bit.ly/holdenTestingSpark Michael Himbeault
  • 43. And some upcoming talks: ● September ○ This Talk (yay) ● October ○ PyData DC - Making Spark go fast in Python (vroom vroom) ○ Salt Lake City Spark Meetup - TBD ○ London - OSCON - Getting Started Contributing to Spark ● November ○ Possible London Office Hours if interest? ● December ○ Strata Singapore (Introduction to Datasets)
  • 44. k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout ([email protected]) if you feel comfortable doing so :)