SlideShare a Scribd company logo
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Spark (Structured) Streaming vs.
Kafka Streams
Two stream processing platforms compared
Guido Schmutz
25.4.2018
@gschmutz guidoschmutz.wordpress.com
Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Our company.
Trivadis is a market leader in IT consulting, system integration, solution engineering
and the provision of IT services focusing on and
technologies
in Switzerland, Germany, Austria and Denmark. We offer our services in the following
strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers
Agenda
1. Introducing Stream Processing
2. Spark Streaming vs. Kafka Streams – Overview
3. Spark Streaming vs. Kafka Streams – in Action
4. Demo
5. Summary
Introducing Stream Processing
When to use Stream Processing / When not?
Constant low
Milliseconds & under
Low milliseconds to seconds,
delay in case of failures
10s of seconds of more,
Re-run in case of failures
Real-Time Near-Real-Time Batch
Source: adapted from Cloudera
Typical Stream Processing Use Cases
• Notifications and Alerting - a notification or alert should be triggered if
some sort of event or series of events occurs.
• Real-Time Reporting – run real-time dashboards that
employees/customers can look at
• Incremental ETL – still ETL, but not in Batch but in streaming, continuous
mode
• Update data to serve in real-time – compute data that get served
interactively by other applications
• Real-Time decision making – analyzing new inputs and responding to
them automatically using business logic, i.e. Fraud Detection
• Online Machine Learning – train a model on a combination of historical
and streaming data and use it for real-time decision making
"Data at Rest" vs. "Data in Motion"
Data at Rest Data in Motion
Stream Processing & Analytics Ecosystem
Complex Event Processing
Simple Event Processing
Open Source Closed Source
Event Stream Processing
Source: adapted from Tibco
Edge
Demo Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
jdbc-source
join_dangerous
_driving_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type
Spark Streaming vs. Kafka Streams
- Overview
Apache Spark Streaming as part of Spark Stack
Spark (Structured) Streaming
Resilient Distributed Dataset (RDD)
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
NoSQL S3
Libraries
Low Level API
Cluster Resource Managers Data Stores
Advanced Analytics Libraries & Ecosystem
Data Frame
Structured API
Datasets SQL
Distributed Variables
Spark Streaming – 1st Generation
• one of the first APIs to enable stream processing using high-level functional operators
like map and reduce
• Like RDD API the DStreams API is based on
relatively low-level operations on
Java/Python objects
• Micro-batching
• Used by many organizations in production
• Spark 2.0 added a Structured API with support for DataFrame / Dataset and SQL
tables
Spark Structured Streaming – 2nd Generation
• Stream processing on Structured API
• DataFrames / Datasets rather than RDDs
• Code reuse between batch and streaming
• Potential to increase performance (Catalyst
SQL optimizer and Data Frame optimizations)
• Windowing and late out-of-order data handling
is much easier
• Traditional Spark Streaming to be considered
obsolete going forward
• marked production ready in Spark 2.2.0
• Support for Java, Scala, Python, R and SQL
Apache Kafka – A Streaming Platform
High-Level Architecture
Distributed Log at the Core
Scale-Out Architecture
Logs do not (necessarily) forget
Kafka Streams - Introduction
• Designed as a simple and lightweight library in Apache Kafka
• no external dependencies on systems
other than Apache Kafka
• Part of open source Apache Kafka,
introduced in 0.10+
• Leverages Kafka as its internal
messaging layer
• Supports fault-tolerant local state
• Continuous processing with millisecond latency
• Windowing with out-of-order data
• Support for Java and SQL (KSQL)
Stream-Table Duality
We can view a table as a stream
We can view a stream as a table
A stream can be considered a
changelog of a table, where each
data record in the stream captures
a state change of the table
A table can be considered a
snapshot of the latest value for
each key in a stream
Source: Confluent
Spark Streaming vs. Kafka Streams
– in Action
Concepts – Main Abstractions
Dataset/Data Frame API
• DataFrames and Datasets can represent
static, bounded data, as well as streaming,
unbounded data
• Use readStream() instead of read()
Transformation & Actions
• Almost all transformations from working on
bounded data (Batch) are also usable for
streaming
• Transformations are lazy
• Only action is starting a stream
Input Sources and Sinks
Triggers
• triggers define when data is output
• As soon as last group is finished
• Fixed interval between micro-batches
• One-time micro-batch
Output Mode
• Define how data is output
• Append – only add new records to
output
• Update – update changed records in
place
• Complete – rewrite full output
Concepts – Main Abstractions
Topologyval schema = new StructType()
.add(...)
val inputDf = spark
.readStream
.format(...)
.option(...)
.load()
val filteredDf = inputDf.where(...)
val query = filteredDf
.writeStream
.format(...)
.option(...)
.start()
query.stop
I
F
O
Concepts – Main Abstractions
Stream Processing Application
• any program that makes use of the Kafka
Streams library
Application Instance
• any running instance or "copy" of your
application
Topology
• defines logic that needs to be performed by
stream processing
• Defined using functional DSL or low-level
Processor API
Stream Processor
• a node in the processor topology
KStream
• Abstraction of a record stream
• Interpreted as events
• partitioned
KTable
• Abstraction of a change log stream
• Interpreted as update of same record key
• partitioned
GlobalKTable
• Like KTable, but not partitioned => all data
is available on all parallel application
instances
Concepts – Main Abstractions
Topologypublic static void main(String[] args) {
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(...);
final StreamsBuilder builder = new StreamsBuilder();
KStream<..,..> stream = builder.stream(...);
KStream<..,..> filtered = stream.filter(…)
filtered.to(...)
KafkaStreams streams = new KafkaStreams(
builder.build(),streamsConfiguration);
streams.start();
}
I
F
O
Streaming Data Sources
• File Source
• Reads files as a stream of data
• Supports text, csv, json, orc parquet
• Files must be atomically placed
• Kafka Source
• Reads from Kafka Topic
• Supports Kafka broker > 0.10.x
• Socket Source (for testing)
• Reads UTF8 text from socket
connection
• Rate Source (for testing)
• Generate data at specified number of
rows per second
val rawDf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("subscribe", "truck_position")
.load()
Streaming Data Sources
Supports "Kafka only"
KStream from Topic
KTable from Topic
Use Kafka Connect for reading
other data sources into Kafka
first
KStream<String, TruckPosition> positions =
builder.stream("truck_position"
, Consumed.with(Serdes.String()
, truckPositionSerde));
KTable<String, Driver> driver =
builder.table("trucking_driver"
, Consumed.with(Serdes.String()
, driverSerde)
, Materialized.as("driver-store"));
Streaming Sinks
• File Sink – stores output to a directory
• Kafka Sink – publishes to Kafka
• Foreach Sink - Runs arbitrary computation on the records in the output. See later in
the section for more details.
• Console Sink – for debugging, prints output to console
• Memory Sink – for debugging, stores output in-memory table
val query = jsonTruckPlusDriverDf
.selectExpr("to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("topic","dangerous_driving ")
.option("checkpointLocation", "/tmp")
.start()
Streaming Sinks
Supports "Kafka only"
For testing only:
Use Kafka Connect for
writing out to other targets
KStream<String, TruckPosition> posDriver = ..
posDriver.to("dangerous_driving"
,Produced.with(Serdes.String()
, truckPositionDriverSerde));
KStream<String, TruckPosition> posDriver = ..
// print to system output
posDriver.print(Printed.toSysOut())
// shortcut for
posDriver.foreach((key,value) ->
System.out.println(key + "=" + value))
Stateless Operations – Selection & Projection
Most common operations on
DataFrame/Dataset are supported for
streaming as well
select, filter, map, flatMap, …
KStream and KTable interfaces support
variety of transformation operations
filter, filterNot, map, mapValues,
flatMap, flatMapValues, branch,
selectKey, groupByKey …
val filteredDf =
truckPosDf.where(
"eventType !='Normal'")
KStream<> filtered =
positions.filter((key,value) ->
!value.eventType.equals("Normal")
)
Stateful Operations – Aggregations
Held in distributed memory with option to
spill to disk (fault tolerant through
checkpointing to Hadoop-like FS)
Output modes: Complete, Append,
Update
count, sum, mapGroupsWithState,
flatMapGroupsWithState, reduce ...
Require state store which can be in-
memory, RocksDB or custom impl (fault
tolerant through Kafka topics)
Result of Aggregation is a KTable
count, sum, avg, reduce, aggregate
...
val c = source
.withWatermark("timestamp"
, "10 minutes")
.groupBy()
.count()
KTable<..> c = stream
.groupByKey(..)
.count(...);
Stateful Operations – Time Abstraction
Clock
Event Time
Processing Time
Ingestion Time
1 2 3 4 5
adapted from Matthias Niehoff (codecentric)
Stateful Operations – Time Abstraction
Event-Time
• New with Spark Structured Streaming
• Extracted from the message (payload)
Processing Time
• Spark Streaming only supported processing
time
• generate the timestamp upon processing
Ingestion Time
• Only for sources which capture the
ingestion time
Event-time
• Point in time when event occurred
• Extracted from the message (payload or
header)
Processing-time
• Point in time when event happens to be
processed by stream processing application
Ingestion Time
• Point in time when event is stored in Kafka
(sent in message header)
df.withColumn("processingTime"
,current_timestamp())
.option("includeTimestamp", true)
Stateful Operations - Windowing
Due to size and never-ending nature of it, it’s
not feasible to keep entire stream of data in memory
Computations over events done using windows of data
• Fixed Window (aka Tumbling Window) - eviction policy is always based on the
window being full and the trigger policy is based on either the count of items in the
window or time
• Hopping Window (aka Sliding Window) - uses eviction and trigger policies that are
based on time: window length and sliding interval length
• Session Window – sessions are composed of sequences of temporarily related
events terminated by a gap of inactivity greater than some timeout
Stateful Operations - Windowing
Support for Tumbling & Hopping (Sliding)
Time Windows
Handling Late Data with
Watermarking
val c = source
.withWatermark("timestamp"
, "10 minutes")
.groupBy(window($"eventTime"
, "1 minutes"
, "30 seconds")
, $"word")
.count()
Data older than watermark
not expected / get discarded
event time
Trailing gap
of 10 mins
max event time
watermark
12:20
12:10
12:25
Trailing gap
of 10 mins
processing time
Stateful Operations - Windowing
Support for Tumbling & Hopping Windows
Supports Session Windows
Handling Late Data with Data
Retention (optional)
KTable<..> c = stream
.groupByKey(...)
.windowedBy(
SessionWindows
.with(5 * 60 * 1000)
).count();
KTable<..> c = stream
.groupByKey(..)
.windowedBy(
TimeWindows.of(60 * 1000)
.advanceBy(30 * 1000)
.until(10 * 60 * 1000)
).count(...);
Data older than watermark
not expected / get discarded
event time
Trailing gap
of 10 mins
max event time
Data Retention
12:20
12:10
12:25
Trailing gap
of 10 mins
processing time
Stateful Operations - Joins
Joining streaming-to-static and
streaming-to-streaming (since 2.3)
Dataset/DataFrame
Watermarking helps Spark to know for
how long to retain data
• Optional for Inner Joins
• Mandatory for Outer Joins
Support for Inner, Left Outer, Right
Outer and Full Outer
val jsonTruckPlusDriverDf =
jsonFilteredDf.join(driverDf
, Seq("driverId")
, "left")
Source: Spark Documentation
Supports following joins
• KStream-to-KStream
• KTable-to-KTable
• KStream-to-KTable
• KStream-to-GlobalKTable
• KTable-to-GlobalKTable
Stateful Operations - Joins
KStream<String, TruckPositionDriver> joined =
filteredRekeyed.leftJoin(driver
, (left,right) -> new TruckPositionDriver(left
, StringUtils.defaultIfEmpty(right.first_name,"")
, StringUtils.defaultIfEmpty(right.last_name,""))
, Joined.with(Serdes.String()
, truckPositionSerde
, driverSerde));
Source: Confluent Documentation
Streaming SQL with KSQL
Enables stream processing with zero
coding required
The simples way to process streams of
data in real-time
Powered by Kafka Streams
available as Developer preview!
STREAM and TABLE as first-class
citizens
STREAM = data in motion
TABLE = collected state of a stream
join STREAM and TABLE
ksql> CREATE STREAM truck_position_s 
(timestamp BIGINT, 
truckId BIGINT, 
driverId BIGINT, 
routeId BIGINT, 
eventType VARCHAR, 
latitude DOUBLE, 
longitude DOUBLE, 
correlationid VARCHAR) 
WITH (kafka_topic='truck_position', 
value_format='JSON');
ksql> SELECT * FROM truck_position_s;
1506922133306 | "truck/13/position0 |
2017-10-02T07:28:53 | 31 | 13 | 371182829
| Memphis to Little Rock | Normal | 41.76 |
-89.6 | -2084263951914664106
There is more ….
• Streaming Deduplication
• Run-Once Trigger / fixed Interval
Micro-Batching
• Continuous Trigger with fixed
checkpoint interval (experimental in
2.3)
• Streaming Machine Learning
• REPL
• KSQL
• Queryable State
• Processor API
• At-least Once vs. Exactly Once
• Microservices with Kafka Streams
• Scale-up / Scale-Down
• Stand-by replica of local state
Demo
Demo Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
jdbc-source
join_dangerous
_driving_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type
Demo Use Case
Demo Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
jdbc-source
join_dangerous
_driving_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type
trucking_driver
Demo Use Case
Summary
Spark Structured Streaming vs. Kafka Streams
• Runs on top of a Spark cluster
• Reuse your investments into Spark
(knowledge and maybe code)
• A HDFS like file system needs to be
available
• Higher latency due to micro-batching
• Multi-Language support: Java, Python,
Scala, R
• Supports ad-hoc, notebook-style
development/environment
• Available as a Java library
• Can be the implementation choice of a
microservice
• Can only work with Kafka for both input and
output
• low latency due to continuous processing
• Currently only supports Java, Scala support
available soon
• KSQL abstraction provides SQL on top of
Kafka Streams
Comparison
Kafka Streams Spark Streaming Spark Structured Streaming
Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL
Processing Model Continuous Streaming Micro-Batching Micro-Batching / Continuous Streaming
(experimental)
Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset
Programming Model Declarative/Imperative Declarative Declarative
Time Support Event / Ingestion / Processing Processing Event / Processing
State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk
Join Stream-Stream, Stream-Static Stream-Static Stream-Static, Stream-Stream (2.3)
Event Pattern detection No No No
Queryable State Interactive Queries No No
Scalability & Reliability Yes Yes Yes
Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial)
Latency Sub-second seconds sub-second
Deployment Java Library Cluster (HDFS like FS needed for
resiliency)
Cluster (HDFS like FS needed for
resiliency)
Technology on its own won't help you.
You need to know how to use it properly.

More Related Content

PPTX
A visual introduction to Apache Kafka
PDF
Introduction to Kafka Streams
PDF
Fundamentals of Apache Kafka
ODP
Introduction to Kafka connect
PPTX
Apache Kafka
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
From Zero to Hero with Kafka Connect
PDF
Unified Stream and Batch Processing with Apache Flink
A visual introduction to Apache Kafka
Introduction to Kafka Streams
Fundamentals of Apache Kafka
Introduction to Kafka connect
Apache Kafka
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
From Zero to Hero with Kafka Connect
Unified Stream and Batch Processing with Apache Flink

What's hot (20)

PDF
Introduction to apache kafka
PDF
Change Data Feed in Delta
PDF
Productizing Structured Streaming Jobs
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PPTX
Introduction to Apache Kafka
PDF
Securing Kafka
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
Databricks Delta Lake and Its Benefits
PPTX
Kafka presentation
PPTX
Kafka 101
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
ksqlDB - Stream Processing simplified!
PPTX
Kafka Tutorial: Kafka Security
PDF
Getting Started with Confluent Schema Registry
PDF
How Apache Kafka® Works
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PDF
Kappa vs Lambda Architectures and Technology Comparison
Introduction to apache kafka
Change Data Feed in Delta
Productizing Structured Streaming Jobs
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Introduction to Apache Kafka
Securing Kafka
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks Delta Lake and Its Benefits
Kafka presentation
Kafka 101
Common Strategies for Improving Performance on Your Delta Lakehouse
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
ksqlDB - Stream Processing simplified!
Kafka Tutorial: Kafka Security
Getting Started with Confluent Schema Registry
How Apache Kafka® Works
APACHE KAFKA / Kafka Connect / Kafka Streams
Developing Real-Time Data Pipelines with Apache Kafka
Kappa vs Lambda Architectures and Technology Comparison
Ad

Similar to Spark (Structured) Streaming vs. Kafka Streams - two stream processing platforms compared (20)

PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Spark (Structured) Streaming vs. Kafka Streams
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Spark streaming state of the union
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PPTX
Apache Spark Components
PDF
Structured Streaming with Kafka
PDF
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PPTX
Streaming options in the wild
PDF
Introduction to Spark Streaming
PPTX
Spark Structured Streaming
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
PDF
Building end to end streaming application on Spark
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams
Stream, stream, stream: Different streaming methods with Spark and Kafka
Strata NYC 2015: What's new in Spark Streaming
Spark streaming state of the union
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Apache Spark Components
Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Streaming options in the wild
Introduction to Spark Streaming
Spark Structured Streaming
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Building end to end streaming application on Spark
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Ad

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
PDF
Event Broker (Kafka) in a Modern Data Architecture
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
Kafka as your Data Lake - is it Feasible?
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
Streaming Visualisation
PDF
Kafka as an event store - is it good enough?
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Fundamentals Big Data and AI Architecture
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Streaming Visualization
PDF
Streaming Visualization
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Kafka as your Data Lake - is it Feasible?
Event Hub (i.e. Kafka) in Modern Data Architecture
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Building Event Driven (Micro)services with Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Fundamentals Big Data and AI Architecture
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization
Streaming Visualization

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Spectroscopy.pptx food analysis technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectroscopy.pptx food analysis technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Tartificialntelligence_presentation.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
A comparative analysis of optical character recognition models for extracting...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Programs and apps: productivity, graphics, security and other tools
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platforms compared

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Spark (Structured) Streaming vs. Kafka Streams Two stream processing platforms compared Guido Schmutz 25.4.2018 @gschmutz guidoschmutz.wordpress.com
  • 2. Guido Schmutz Working at Trivadis for more than 21 years Oracle ACE Director for Fusion Middleware and SOA Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: [email protected] Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz
  • 3. Our company. Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany, Austria and Denmark. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. O P E R A T I O N
  • 4. COPENHAGEN MUNICH LAUSANNE BERN ZURICH BRUGG GENEVA HAMBURG DÜSSELDORF FRANKFURT STUTTGART FREIBURG BASEL VIENNA With over 600 specialists and IT experts in your region. 14 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
  • 5. Agenda 1. Introducing Stream Processing 2. Spark Streaming vs. Kafka Streams – Overview 3. Spark Streaming vs. Kafka Streams – in Action 4. Demo 5. Summary
  • 7. When to use Stream Processing / When not? Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch Source: adapted from Cloudera
  • 8. Typical Stream Processing Use Cases • Notifications and Alerting - a notification or alert should be triggered if some sort of event or series of events occurs. • Real-Time Reporting – run real-time dashboards that employees/customers can look at • Incremental ETL – still ETL, but not in Batch but in streaming, continuous mode • Update data to serve in real-time – compute data that get served interactively by other applications • Real-Time decision making – analyzing new inputs and responding to them automatically using business logic, i.e. Fraud Detection • Online Machine Learning – train a model on a combination of historical and streaming data and use it for real-time decision making
  • 9. "Data at Rest" vs. "Data in Motion" Data at Rest Data in Motion
  • 10. Stream Processing & Analytics Ecosystem Complex Event Processing Simple Event Processing Open Source Closed Source Event Stream Processing Source: adapted from Tibco Edge
  • 12. Spark Streaming vs. Kafka Streams - Overview
  • 13. Apache Spark Streaming as part of Spark Stack Spark (Structured) Streaming Resilient Distributed Dataset (RDD) Spark Standalone MESOS YARN HDFS Elastic Search NoSQL S3 Libraries Low Level API Cluster Resource Managers Data Stores Advanced Analytics Libraries & Ecosystem Data Frame Structured API Datasets SQL Distributed Variables
  • 14. Spark Streaming – 1st Generation • one of the first APIs to enable stream processing using high-level functional operators like map and reduce • Like RDD API the DStreams API is based on relatively low-level operations on Java/Python objects • Micro-batching • Used by many organizations in production • Spark 2.0 added a Structured API with support for DataFrame / Dataset and SQL tables
  • 15. Spark Structured Streaming – 2nd Generation • Stream processing on Structured API • DataFrames / Datasets rather than RDDs • Code reuse between batch and streaming • Potential to increase performance (Catalyst SQL optimizer and Data Frame optimizations) • Windowing and late out-of-order data handling is much easier • Traditional Spark Streaming to be considered obsolete going forward • marked production ready in Spark 2.2.0 • Support for Java, Scala, Python, R and SQL
  • 16. Apache Kafka – A Streaming Platform High-Level Architecture Distributed Log at the Core Scale-Out Architecture Logs do not (necessarily) forget
  • 17. Kafka Streams - Introduction • Designed as a simple and lightweight library in Apache Kafka • no external dependencies on systems other than Apache Kafka • Part of open source Apache Kafka, introduced in 0.10+ • Leverages Kafka as its internal messaging layer • Supports fault-tolerant local state • Continuous processing with millisecond latency • Windowing with out-of-order data • Support for Java and SQL (KSQL)
  • 18. Stream-Table Duality We can view a table as a stream We can view a stream as a table A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table A table can be considered a snapshot of the latest value for each key in a stream Source: Confluent
  • 19. Spark Streaming vs. Kafka Streams – in Action
  • 20. Concepts – Main Abstractions Dataset/Data Frame API • DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data • Use readStream() instead of read() Transformation & Actions • Almost all transformations from working on bounded data (Batch) are also usable for streaming • Transformations are lazy • Only action is starting a stream Input Sources and Sinks Triggers • triggers define when data is output • As soon as last group is finished • Fixed interval between micro-batches • One-time micro-batch Output Mode • Define how data is output • Append – only add new records to output • Update – update changed records in place • Complete – rewrite full output
  • 21. Concepts – Main Abstractions Topologyval schema = new StructType() .add(...) val inputDf = spark .readStream .format(...) .option(...) .load() val filteredDf = inputDf.where(...) val query = filteredDf .writeStream .format(...) .option(...) .start() query.stop I F O
  • 22. Concepts – Main Abstractions Stream Processing Application • any program that makes use of the Kafka Streams library Application Instance • any running instance or "copy" of your application Topology • defines logic that needs to be performed by stream processing • Defined using functional DSL or low-level Processor API Stream Processor • a node in the processor topology KStream • Abstraction of a record stream • Interpreted as events • partitioned KTable • Abstraction of a change log stream • Interpreted as update of same record key • partitioned GlobalKTable • Like KTable, but not partitioned => all data is available on all parallel application instances
  • 23. Concepts – Main Abstractions Topologypublic static void main(String[] args) { Properties streamsConfiguration = new Properties(); streamsConfiguration.put(...); final StreamsBuilder builder = new StreamsBuilder(); KStream<..,..> stream = builder.stream(...); KStream<..,..> filtered = stream.filter(…) filtered.to(...) KafkaStreams streams = new KafkaStreams( builder.build(),streamsConfiguration); streams.start(); } I F O
  • 24. Streaming Data Sources • File Source • Reads files as a stream of data • Supports text, csv, json, orc parquet • Files must be atomically placed • Kafka Source • Reads from Kafka Topic • Supports Kafka broker > 0.10.x • Socket Source (for testing) • Reads UTF8 text from socket connection • Rate Source (for testing) • Generate data at specified number of rows per second val rawDf = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("subscribe", "truck_position") .load()
  • 25. Streaming Data Sources Supports "Kafka only" KStream from Topic KTable from Topic Use Kafka Connect for reading other data sources into Kafka first KStream<String, TruckPosition> positions = builder.stream("truck_position" , Consumed.with(Serdes.String() , truckPositionSerde)); KTable<String, Driver> driver = builder.table("trucking_driver" , Consumed.with(Serdes.String() , driverSerde) , Materialized.as("driver-store"));
  • 26. Streaming Sinks • File Sink – stores output to a directory • Kafka Sink – publishes to Kafka • Foreach Sink - Runs arbitrary computation on the records in the output. See later in the section for more details. • Console Sink – for debugging, prints output to console • Memory Sink – for debugging, stores output in-memory table val query = jsonTruckPlusDriverDf .selectExpr("to_json(struct(*)) AS value") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("topic","dangerous_driving ") .option("checkpointLocation", "/tmp") .start()
  • 27. Streaming Sinks Supports "Kafka only" For testing only: Use Kafka Connect for writing out to other targets KStream<String, TruckPosition> posDriver = .. posDriver.to("dangerous_driving" ,Produced.with(Serdes.String() , truckPositionDriverSerde)); KStream<String, TruckPosition> posDriver = .. // print to system output posDriver.print(Printed.toSysOut()) // shortcut for posDriver.foreach((key,value) -> System.out.println(key + "=" + value))
  • 28. Stateless Operations – Selection & Projection Most common operations on DataFrame/Dataset are supported for streaming as well select, filter, map, flatMap, … KStream and KTable interfaces support variety of transformation operations filter, filterNot, map, mapValues, flatMap, flatMapValues, branch, selectKey, groupByKey … val filteredDf = truckPosDf.where( "eventType !='Normal'") KStream<> filtered = positions.filter((key,value) -> !value.eventType.equals("Normal") )
  • 29. Stateful Operations – Aggregations Held in distributed memory with option to spill to disk (fault tolerant through checkpointing to Hadoop-like FS) Output modes: Complete, Append, Update count, sum, mapGroupsWithState, flatMapGroupsWithState, reduce ... Require state store which can be in- memory, RocksDB or custom impl (fault tolerant through Kafka topics) Result of Aggregation is a KTable count, sum, avg, reduce, aggregate ... val c = source .withWatermark("timestamp" , "10 minutes") .groupBy() .count() KTable<..> c = stream .groupByKey(..) .count(...);
  • 30. Stateful Operations – Time Abstraction Clock Event Time Processing Time Ingestion Time 1 2 3 4 5 adapted from Matthias Niehoff (codecentric)
  • 31. Stateful Operations – Time Abstraction Event-Time • New with Spark Structured Streaming • Extracted from the message (payload) Processing Time • Spark Streaming only supported processing time • generate the timestamp upon processing Ingestion Time • Only for sources which capture the ingestion time Event-time • Point in time when event occurred • Extracted from the message (payload or header) Processing-time • Point in time when event happens to be processed by stream processing application Ingestion Time • Point in time when event is stored in Kafka (sent in message header) df.withColumn("processingTime" ,current_timestamp()) .option("includeTimestamp", true)
  • 32. Stateful Operations - Windowing Due to size and never-ending nature of it, it’s not feasible to keep entire stream of data in memory Computations over events done using windows of data • Fixed Window (aka Tumbling Window) - eviction policy is always based on the window being full and the trigger policy is based on either the count of items in the window or time • Hopping Window (aka Sliding Window) - uses eviction and trigger policies that are based on time: window length and sliding interval length • Session Window – sessions are composed of sequences of temporarily related events terminated by a gap of inactivity greater than some timeout
  • 33. Stateful Operations - Windowing Support for Tumbling & Hopping (Sliding) Time Windows Handling Late Data with Watermarking val c = source .withWatermark("timestamp" , "10 minutes") .groupBy(window($"eventTime" , "1 minutes" , "30 seconds") , $"word") .count() Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time watermark 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  • 34. Stateful Operations - Windowing Support for Tumbling & Hopping Windows Supports Session Windows Handling Late Data with Data Retention (optional) KTable<..> c = stream .groupByKey(...) .windowedBy( SessionWindows .with(5 * 60 * 1000) ).count(); KTable<..> c = stream .groupByKey(..) .windowedBy( TimeWindows.of(60 * 1000) .advanceBy(30 * 1000) .until(10 * 60 * 1000) ).count(...); Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time Data Retention 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  • 35. Stateful Operations - Joins Joining streaming-to-static and streaming-to-streaming (since 2.3) Dataset/DataFrame Watermarking helps Spark to know for how long to retain data • Optional for Inner Joins • Mandatory for Outer Joins Support for Inner, Left Outer, Right Outer and Full Outer val jsonTruckPlusDriverDf = jsonFilteredDf.join(driverDf , Seq("driverId") , "left") Source: Spark Documentation
  • 36. Supports following joins • KStream-to-KStream • KTable-to-KTable • KStream-to-KTable • KStream-to-GlobalKTable • KTable-to-GlobalKTable Stateful Operations - Joins KStream<String, TruckPositionDriver> joined = filteredRekeyed.leftJoin(driver , (left,right) -> new TruckPositionDriver(left , StringUtils.defaultIfEmpty(right.first_name,"") , StringUtils.defaultIfEmpty(right.last_name,"")) , Joined.with(Serdes.String() , truckPositionSerde , driverSerde)); Source: Confluent Documentation
  • 37. Streaming SQL with KSQL Enables stream processing with zero coding required The simples way to process streams of data in real-time Powered by Kafka Streams available as Developer preview! STREAM and TABLE as first-class citizens STREAM = data in motion TABLE = collected state of a stream join STREAM and TABLE ksql> CREATE STREAM truck_position_s (timestamp BIGINT, truckId BIGINT, driverId BIGINT, routeId BIGINT, eventType VARCHAR, latitude DOUBLE, longitude DOUBLE, correlationid VARCHAR) WITH (kafka_topic='truck_position', value_format='JSON'); ksql> SELECT * FROM truck_position_s; 1506922133306 | "truck/13/position0 | 2017-10-02T07:28:53 | 31 | 13 | 371182829 | Memphis to Little Rock | Normal | 41.76 | -89.6 | -2084263951914664106
  • 38. There is more …. • Streaming Deduplication • Run-Once Trigger / fixed Interval Micro-Batching • Continuous Trigger with fixed checkpoint interval (experimental in 2.3) • Streaming Machine Learning • REPL • KSQL • Queryable State • Processor API • At-least Once vs. Exactly Once • Microservices with Kafka Streams • Scale-up / Scale-Down • Stand-by replica of local state
  • 39. Demo
  • 45. Spark Structured Streaming vs. Kafka Streams • Runs on top of a Spark cluster • Reuse your investments into Spark (knowledge and maybe code) • A HDFS like file system needs to be available • Higher latency due to micro-batching • Multi-Language support: Java, Python, Scala, R • Supports ad-hoc, notebook-style development/environment • Available as a Java library • Can be the implementation choice of a microservice • Can only work with Kafka for both input and output • low latency due to continuous processing • Currently only supports Java, Scala support available soon • KSQL abstraction provides SQL on top of Kafka Streams
  • 46. Comparison Kafka Streams Spark Streaming Spark Structured Streaming Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL Processing Model Continuous Streaming Micro-Batching Micro-Batching / Continuous Streaming (experimental) Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset Programming Model Declarative/Imperative Declarative Declarative Time Support Event / Ingestion / Processing Processing Event / Processing State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk Join Stream-Stream, Stream-Static Stream-Static Stream-Static, Stream-Stream (2.3) Event Pattern detection No No No Queryable State Interactive Queries No No Scalability & Reliability Yes Yes Yes Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial) Latency Sub-second seconds sub-second Deployment Java Library Cluster (HDFS like FS needed for resiliency) Cluster (HDFS like FS needed for resiliency)
  • 47. Technology on its own won't help you. You need to know how to use it properly.