SlideShare a Scribd company logo
Writing Continuous
Applications with Structured
Streaming in PySpark
Jules S. Damji
Spark + AI Summit , SF
April 24, 2019
I have used Apache Spark 2.x Before…
Apache Spark Community & DeveloperAdvocate@ Databricks
DeveloperAdvocate@ Hortonworks
Software engineering @Sun Microsystems, Netscape, @Home, VeriSign,
Scalix, Centrify, LoudCloud/Opsware, ProQuest
Program Chair Spark + AI Summit
https://www.linkedin.com/in/dmatrix
@2twitme
Accelerate innovation by unifying data science,
engineering and business
• Original creators of
• 2000+ global companies use our platform across big
data & machine learning lifecycle
VISION
WHO WE
ARE
Unified Analytics PlatformSOLUTION
Agenda for Today’s Talk
• Why Apache Spark
• Why Streaming Applications are Difficult
• What’s Structured Streaming
• Anatomy of a Continunous Application
• Tutorials
• Q & A
How to think about data in 2019 - 2020
“Data is the new currency"
10101010. . . 10101010. . .
Why Apache Spark?
What is Apache Spark?
• General cluster computing engine
that extends MapReduce
• Rich set of APIs and libraries
• Unified Engine
• Large community: 1000+ orgs,
clusters up to 8000 nodes
• Supports DL Frameworks
Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation
SQLStreaming ML Graph
…
DL
Unique Thing about Spark
• Unification: same engine and same API for diverse use cases
• Streaming, batch, or interactive
• ETL, SQL, machine learning, or graph
• Deep Learning Frameworks w/Horovod
– TensorFlow
– Keras
– PyTorch
Faster, Easier to Use, Unified
10
First	Distributed
Processing	Engine
Specialized	Data	
Processing	Engines
Unified	Data	
Processing	Engine
Benefits of Unification
1. Simpler to use and operate
2. Code reuse: e.g. only write monitoring, FT, etc once
3. New apps that span processing types: e.g. interactive
queries on a stream, online machine learning
An Analogy
Specialized devices Unified device
New applications
Why Streaming Applications
are Inherently Difficult?
building
robust
stream
processing
apps is hard
Complexities in stream processing
COMPLEX DATA
Diverse data formats
(json, avro, txt, csv, binary, …)
Data can be dirty,
And tardy (out-of-order)
COMPLEX SYSTEMS
Diverse storage systems
(Kafka, S3, Kinesis, RDBMS, …)
System failures
COMPLEX WORKLOADS
Combining streaming with
interactive queries
Machine learning
Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
you
should not have to
reason about streaming
Treat Streams as Unbounded Tables
data stream unbounded inputtable
newdata in the
data stream
=
newrows appended
to a unboundedtable
you
should write queries
&
Apache Spark
should continuously update the answer
DataFrames,
Datasets, SQL
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path") Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Writeto
Parquet
Apache Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new batches of data
Series of Incremental
Execution Plans
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Parquet
Sink
Optimized
Physical Plan
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
Structured Streaming – Processing Modes
21
Structured Streaming Processing Modes
22
Anatomy of a Continunous
Application
Streaming word count
Anatomy of a Streaming Query
Simple Streaming ETL
Anatomy of a Streaming Query: Step 1
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.
Source
• Specify one or more locations
to read data from
• Built in support for
Files/Kafka/Socket,
pluggable.
Anatomy of a Streaming Query: Step 2
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(“value.cast("string") as key”)
.agg(count("*") as “value”)
Transformation
• Using DataFrames,Datasets and/or
SQL.
• Internal processingalways exactly-
once.
Anatomy of a Streaming Query: Step 3
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(“value.cast("string") as key”)
.agg(count("*") as “value”)
.writeStream()
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Sink
• Accepts the output of each
batch.
• When supported sinks are
transactional and exactly
once (Files).
Anatomy of a Streaming Query: Output Modes
from pyspark.sql import Trigger
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(“value.cast("string") as key”)
.agg(count("*") as 'value’)
.writeStream()
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Output mode – What's output
• Complete – Output the whole answer
every time
• Update – Output changed rows
• Append– Output new rowsonly
Trigger – When to output
• Specifiedas a time, eventually
supportsdata size
• No trigger means as fast as possible
Streaming Query: Output Modes
Output mode –
What's output
• Complete – Output the
whole answer every time
• Update – Output changed
rows
• Append– Output new
rows only
Anatomy of a Streaming Query: Checkpoint
from pyspark.sql import Trigger
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(“value.cast("string") as key”)
.agg(count("*") as 'value)
.writeStream()
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.withWatermark(“timestamp” “2 minutes”)
.start()
Checkpoint & Watermark
• Tracks the progress of a
query in persistent storage
• Can be used to restart the
query if there is a failure.
• trigger( Trigger. Continunous(“ 1 second”))
Set checkpoint location &
watermark to drop very late
events
Fault-tolerance with Checkpointing
Checkpointing – tracks progress
(offsets) of consuming data from
the source and intermediate state.
Offsets and metadata saved as JSON
Can resume after changing your
streaming transformations
end-to-end
exactly-once
guarantees
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
write
ahead
log
Complex Streaming ETL
Traditional ETL
• Raw, dirty, un/semi-structured is data dumped as files
• Periodic jobs run every few hours to convert raw data to
structured data ready for further analytics
• Problem:
• Hours of delay beforetaking decisions on latest data
• Unacceptablewhen timeis ofessence
– [intrusion , anomaly or fraud detection, monitoring IoTdevices, etc.]
file
dump
seconds hours
table
SQL
Web
ML10101010...
1. Streaming ETL w/ Structured Streaming
Structured Streaming Changes the Equation:
• eliminates latencies
• adds immediacy
• transforms data continuously
seconds
table
10101010...
SQL
Web
ML
2. Streaming ETL w/ Structured
Streaming & Delta Lake
seconds
10101010...
Transactional	
Log
Parquet	Files
Transactional	
Log
Parquet	Files
Delta Lake ensures data reliability
Streaming
● ACID Transactions
● Schema Enforcement
● Unified Batch & Streaming
● Time Travel/Data Snapshots
Key Features
High Quality & Reliable Data
always ready for analytics
Batch
Updates/Delete
s
Streaming ETL w/ Structured Streaming
Example
1. Json data being received
in Kafka
2. Parse nested json and
flatten it
3. Store in structured
Parquet table
4. Get end-to-end failure
guarantees
from pyspark.sql import Trigger
rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
parsedData = rawData
.selectExpr("cast (value as string) as json"))
.select(from_json("json", schema).as("data"))
.select("data.*") # do your ETL/Transformation
query = parsedData.writeStream
.option("checkpointLocation", "/checkpoint")
.partitionBy("date")
.format("parquet") .format(”delta")
.trigger( Trigger. Continunous(“5 second”))
.start("/parquetTable") .start("/deltaTable”)
Reading from Kafka
raw_data_df = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
rawData dataframe has
the following columns
key value topic partition offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
json
{ "timestamp": 1486087873, "device": "devA", …}
{ "timestamp": 1486082418, "device": "devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
powerful built-in Python
APIs to perform complex
data transformations
from_json, to_json, explode,...
100s offunctions
(see our blogpost & tutorial)
Writing to
Save parsed data as Parquet
table or Delta table in the given
path
Partition files by date so that
future queries on time slices of
data is fast
e.g. query on last 48 hours of data
queryP = parsedData.writeStream
.option("checkpointLocation", ...)
.partitionBy("date")
.format("parquet")
.start("/parquetTable") #pathname
queryD = parsedData.writeStream
.option("checkpointLocation", ...)
.partitionBy("date")
.format("delta")
.start("/deltaTable") #pathname
Tutorials
https://dbricks.co/sais_pyspark_sf
43
Enter your cluster
name
Use DBR 5.3 and
Apache Spark 2.4,
Scala 2.11
Summary
• Apache Spark best suited for unified analytics &
processing at scale
• Structured Streaming APIs Enables Continunous
Applications
• Populate in Parquet tables or Delta Lake
• Demonstrated Continunous Application
Resources
• Getting Started Guide with Apache Spark on Databricks
• docs.databricks.com
• Spark Programming Guide
• Structured Streaming Programming Guide
• Anthology of Technical Assets for Structured Streaming
• Databricks Engineering Blogs
• https://databricks.com/training/instructor-led-training
• https://delta.io
Thank You J
jules@databricks.com
@2twitme
https://www.linkedin.com/in/dmatrix/

More Related Content

PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PPTX
Great Expectations Presentation
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PDF
A Learning to Rank Project on a Daily Song Ranking Problem
PDF
Spark (Structured) Streaming vs. Kafka Streams
PPTX
Spark architecture
PPTX
Apache Spark Architecture
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Great Expectations Presentation
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Learning to Rank Project on a Daily Song Ranking Problem
Spark (Structured) Streaming vs. Kafka Streams
Spark architecture
Apache Spark Architecture
Web-Scale Graph Analytics with Apache Spark with Tim Hunter

What's hot (20)

PPTX
05 pig user defined functions (udfs)
PDF
GraphFrames: Graph Queries In Spark SQL
PDF
Pinot: Near Realtime Analytics @ Uber
PDF
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
PDF
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
Reinforcement Learning 7. n-step Bootstrapping
PDF
Productizing Structured Streaming Jobs
PDF
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
PPTX
Chapter 2 Flutter Basics Lecture 1.pptx
PPTX
MongoDB Sharding
PPTX
데이터를 얻으려는 노오오력
PDF
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
PPTX
Movie lens movie recommendation system
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Overview of recommender system
PDF
Sniffing via dsniff
PDF
Flutter + tensor flow lite = awesome sauce
PDF
Spark graphx
PDF
Introduction to Apache Flink - Fast and reliable big data processing
05 pig user defined functions (udfs)
GraphFrames: Graph Queries In Spark SQL
Pinot: Near Realtime Analytics @ Uber
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Designing Structured Streaming Pipelines—How to Architect Things Right
Reinforcement Learning 7. n-step Bootstrapping
Productizing Structured Streaming Jobs
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Chapter 2 Flutter Basics Lecture 1.pptx
MongoDB Sharding
데이터를 얻으려는 노오오력
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Movie lens movie recommendation system
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Overview of recommender system
Sniffing via dsniff
Flutter + tensor flow lite = awesome sauce
Spark graphx
Introduction to Apache Flink - Fast and reliable big data processing
Ad

Similar to Writing Continuous Applications with Structured Streaming PySpark API (20)

PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
20170126 big data processing
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Making Structured Streaming Ready for Production
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
PDF
Continuous Application with Structured Streaming 2.0
PDF
Spark what's new what's coming
PPTX
ETL with SPARK - First Spark London meetup
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
ETL 2.0 Data Engineering for developers
PDF
Spark streaming , Spark SQL
PDF
Intro to Spark and Spark SQL
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming in PySpark
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
20170126 big data processing
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Making Structured Streaming Ready for Production
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Continuous Application with Structured Streaming 2.0
Spark what's new what's coming
ETL with SPARK - First Spark London meetup
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Real-Time Spark: From Interactive Queries to Streaming
ETL 2.0 Data Engineering for developers
Spark streaming , Spark SQL
Intro to Spark and Spark SQL
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Foundation of Data Science unit number two notes
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PDF
Data Science Trends & Career Guide---ppt
PPTX
1_Introduction to advance data techniques.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Quality review (1)_presentation of this 21
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Mega Projects Data Mega Projects Data
Taxes Foundatisdcsdcsdon Certificate.pdf
.pdf is not working space design for the following data for the following dat...
Foundation of Data Science unit number two notes
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Data Science Trends & Career Guide---ppt
1_Introduction to advance data techniques.pptx
Launch Your Data Science Career in Kochi – 2025
Quality review (1)_presentation of this 21
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
Mega Projects Data Mega Projects Data

Writing Continuous Applications with Structured Streaming PySpark API

  • 1. Writing Continuous Applications with Structured Streaming in PySpark Jules S. Damji Spark + AI Summit , SF April 24, 2019
  • 2. I have used Apache Spark 2.x Before…
  • 3. Apache Spark Community & DeveloperAdvocate@ Databricks DeveloperAdvocate@ Hortonworks Software engineering @Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest Program Chair Spark + AI Summit https://www.linkedin.com/in/dmatrix @2twitme
  • 4. Accelerate innovation by unifying data science, engineering and business • Original creators of • 2000+ global companies use our platform across big data & machine learning lifecycle VISION WHO WE ARE Unified Analytics PlatformSOLUTION
  • 5. Agenda for Today’s Talk • Why Apache Spark • Why Streaming Applications are Difficult • What’s Structured Streaming • Anatomy of a Continunous Application • Tutorials • Q & A
  • 6. How to think about data in 2019 - 2020 “Data is the new currency" 10101010. . . 10101010. . .
  • 8. What is Apache Spark? • General cluster computing engine that extends MapReduce • Rich set of APIs and libraries • Unified Engine • Large community: 1000+ orgs, clusters up to 8000 nodes • Supports DL Frameworks Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation SQLStreaming ML Graph … DL
  • 9. Unique Thing about Spark • Unification: same engine and same API for diverse use cases • Streaming, batch, or interactive • ETL, SQL, machine learning, or graph • Deep Learning Frameworks w/Horovod – TensorFlow – Keras – PyTorch
  • 10. Faster, Easier to Use, Unified 10 First Distributed Processing Engine Specialized Data Processing Engines Unified Data Processing Engine
  • 11. Benefits of Unification 1. Simpler to use and operate 2. Code reuse: e.g. only write monitoring, FT, etc once 3. New apps that span processing types: e.g. interactive queries on a stream, online machine learning
  • 12. An Analogy Specialized devices Unified device New applications
  • 13. Why Streaming Applications are Inherently Difficult?
  • 15. Complexities in stream processing COMPLEX DATA Diverse data formats (json, avro, txt, csv, binary, …) Data can be dirty, And tardy (out-of-order) COMPLEX SYSTEMS Diverse storage systems (Kafka, S3, Kinesis, RDBMS, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning
  • 16. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  • 17. you should not have to reason about streaming
  • 18. Treat Streams as Unbounded Tables data stream unbounded inputtable newdata in the data stream = newrows appended to a unboundedtable
  • 19. you should write queries & Apache Spark should continuously update the answer
  • 20. DataFrames, Datasets, SQL input = spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Read from Kafka Project device, signal Filter signal > 15 Writeto Parquet Apache Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Physical Plan process newdata t = 1 t = 2 t = 3 process newdata process newdata
  • 21. Structured Streaming – Processing Modes 21
  • 23. Anatomy of a Continunous Application
  • 24. Streaming word count Anatomy of a Streaming Query Simple Streaming ETL
  • 25. Anatomy of a Streaming Query: Step 1 spark.readStream .format("kafka") .option("subscribe", "input") .load() . Source • Specify one or more locations to read data from • Built in support for Files/Kafka/Socket, pluggable.
  • 26. Anatomy of a Streaming Query: Step 2 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as “value”) Transformation • Using DataFrames,Datasets and/or SQL. • Internal processingalways exactly- once.
  • 27. Anatomy of a Streaming Query: Step 3 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as “value”) .writeStream() .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Sink • Accepts the output of each batch. • When supported sinks are transactional and exactly once (Files).
  • 28. Anatomy of a Streaming Query: Output Modes from pyspark.sql import Trigger spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as 'value’) .writeStream() .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append– Output new rowsonly Trigger – When to output • Specifiedas a time, eventually supportsdata size • No trigger means as fast as possible
  • 29. Streaming Query: Output Modes Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append– Output new rows only
  • 30. Anatomy of a Streaming Query: Checkpoint from pyspark.sql import Trigger spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as 'value) .writeStream() .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .withWatermark(“timestamp” “2 minutes”) .start() Checkpoint & Watermark • Tracks the progress of a query in persistent storage • Can be used to restart the query if there is a failure. • trigger( Trigger. Continunous(“ 1 second”)) Set checkpoint location & watermark to drop very late events
  • 31. Fault-tolerance with Checkpointing Checkpointing – tracks progress (offsets) of consuming data from the source and intermediate state. Offsets and metadata saved as JSON Can resume after changing your streaming transformations end-to-end exactly-once guarantees process newdata t = 1 t = 2 t = 3 process newdata process newdata write ahead log
  • 33. Traditional ETL • Raw, dirty, un/semi-structured is data dumped as files • Periodic jobs run every few hours to convert raw data to structured data ready for further analytics • Problem: • Hours of delay beforetaking decisions on latest data • Unacceptablewhen timeis ofessence – [intrusion , anomaly or fraud detection, monitoring IoTdevices, etc.] file dump seconds hours table SQL Web ML10101010...
  • 34. 1. Streaming ETL w/ Structured Streaming Structured Streaming Changes the Equation: • eliminates latencies • adds immediacy • transforms data continuously seconds table 10101010... SQL Web ML
  • 35. 2. Streaming ETL w/ Structured Streaming & Delta Lake seconds 10101010... Transactional Log Parquet Files
  • 36. Transactional Log Parquet Files Delta Lake ensures data reliability Streaming ● ACID Transactions ● Schema Enforcement ● Unified Batch & Streaming ● Time Travel/Data Snapshots Key Features High Quality & Reliable Data always ready for analytics Batch Updates/Delete s
  • 37. Streaming ETL w/ Structured Streaming Example 1. Json data being received in Kafka 2. Parse nested json and flatten it 3. Store in structured Parquet table 4. Get end-to-end failure guarantees from pyspark.sql import Trigger rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json", schema).as("data")) .select("data.*") # do your ETL/Transformation query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .format(”delta") .trigger( Trigger. Continunous(“5 second”)) .start("/parquetTable") .start("/deltaTable”)
  • 38. Reading from Kafka raw_data_df = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
  • 39. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  • 40. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") powerful built-in Python APIs to perform complex data transformations from_json, to_json, explode,... 100s offunctions (see our blogpost & tutorial)
  • 41. Writing to Save parsed data as Parquet table or Delta table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data queryP = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable") #pathname queryD = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("delta") .start("/deltaTable") #pathname
  • 44. Summary • Apache Spark best suited for unified analytics & processing at scale • Structured Streaming APIs Enables Continunous Applications • Populate in Parquet tables or Delta Lake • Demonstrated Continunous Application
  • 45. Resources • Getting Started Guide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Anthology of Technical Assets for Structured Streaming • Databricks Engineering Blogs • https://databricks.com/training/instructor-led-training • https://delta.io