Spark (Structured) Streaming vs. Kafka Streams - two stream processing platforms compared

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Spark (Structured) Streaming vs.
Kafka Streams
Two stream processing platforms compared
Guido Schmutz
25.4.2018
@gschmutz guidoschmutz.wordpress.com

Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz

Our company.
Trivadis is a market leader in IT consulting, system integration, solution engineering
and the provision of IT services focusing on and
technologies
in Switzerland, Germany, Austria and Denmark. We offer our services in the following
strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N

COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers

Agenda
1. Introducing Stream Processing
2. Spark Streaming vs. Kafka Streams – Overview
3. Spark Streaming vs. Kafka Streams – in Action
4. Demo
5. Summary

When to use Stream Processing / When not?
Constant low
Milliseconds & under
Low milliseconds to seconds,
delay in case of failures
10s of seconds of more,
Re-run in case of failures
Real-Time Near-Real-Time Batch
Source: adapted from Cloudera

Typical Stream Processing Use Cases
• Notifications and Alerting - a notification or alert should be triggered if
some sort of event or series of events occurs.
• Real-Time Reporting – run real-time dashboards that
employees/customers can look at
• Incremental ETL – still ETL, but not in Batch but in streaming, continuous
mode
• Update data to serve in real-time – compute data that get served
interactively by other applications
• Real-Time decision making – analyzing new inputs and responding to
them automatically using business logic, i.e. Fraud Detection
• Online Machine Learning – train a model on a combination of historical
and streaming data and use it for real-time decision making

"Data at Rest" vs. "Data in Motion"
Data at Rest Data in Motion

Stream Processing & Analytics Ecosystem
Complex Event Processing
Simple Event Processing
Open Source Closed Source
Event Stream Processing
Source: adapted from Tibco
Edge

Demo Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
jdbc-source
join_dangerous
_driving_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type

Spark Streaming vs. Kafka Streams
- Overview

Apache Spark Streaming as part of Spark Stack
Spark (Structured) Streaming
Resilient Distributed Dataset (RDD)
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
NoSQL S3
Libraries
Low Level API
Cluster Resource Managers Data Stores
Advanced Analytics Libraries & Ecosystem
Data Frame
Structured API
Datasets SQL
Distributed Variables

Spark Streaming – 1st Generation
• one of the first APIs to enable stream processing using high-level functional operators
like map and reduce
• Like RDD API the DStreams API is based on
relatively low-level operations on
Java/Python objects
• Micro-batching
• Used by many organizations in production
• Spark 2.0 added a Structured API with support for DataFrame / Dataset and SQL
tables

Spark Structured Streaming – 2nd Generation
• Stream processing on Structured API
• DataFrames / Datasets rather than RDDs
• Code reuse between batch and streaming
• Potential to increase performance (Catalyst
SQL optimizer and Data Frame optimizations)
• Windowing and late out-of-order data handling
is much easier
• Traditional Spark Streaming to be considered
obsolete going forward
• marked production ready in Spark 2.2.0
• Support for Java, Scala, Python, R and SQL

Apache Kafka – A Streaming Platform
High-Level Architecture
Distributed Log at the Core
Scale-Out Architecture
Logs do not (necessarily) forget

Kafka Streams - Introduction
• Designed as a simple and lightweight library in Apache Kafka
• no external dependencies on systems
other than Apache Kafka
• Part of open source Apache Kafka,
introduced in 0.10+
• Leverages Kafka as its internal
messaging layer
• Supports fault-tolerant local state
• Continuous processing with millisecond latency
• Windowing with out-of-order data
• Support for Java and SQL (KSQL)

Stream-Table Duality
We can view a table as a stream
We can view a stream as a table
A stream can be considered a
changelog of a table, where each
data record in the stream captures
a state change of the table
A table can be considered a
snapshot of the latest value for
each key in a stream
Source: Confluent

Spark Streaming vs. Kafka Streams
– in Action

Concepts – Main Abstractions
Dataset/Data Frame API
• DataFrames and Datasets can represent
static, bounded data, as well as streaming,
unbounded data
• Use readStream() instead of read()
Transformation & Actions
• Almost all transformations from working on
bounded data (Batch) are also usable for
streaming
• Transformations are lazy
• Only action is starting a stream
Input Sources and Sinks
Triggers
• triggers define when data is output
• As soon as last group is finished
• Fixed interval between micro-batches
• One-time micro-batch
Output Mode
• Define how data is output
• Append – only add new records to
output
• Update – update changed records in
place
• Complete – rewrite full output

Topologyval schema = new StructType()
.add(...)
val inputDf = spark
.readStream
.format(...)
.option(...)
.load()
val filteredDf = inputDf.where(...)
val query = filteredDf
.writeStream
.format(...)
.option(...)
.start()
query.stop
I
F
O

Stream Processing Application
• any program that makes use of the Kafka
Streams library
Application Instance
• any running instance or "copy" of your
application
Topology
• defines logic that needs to be performed by
stream processing
• Defined using functional DSL or low-level
Processor API
Stream Processor
• a node in the processor topology
KStream
• Abstraction of a record stream
• Interpreted as events
• partitioned
KTable
• Abstraction of a change log stream
• Interpreted as update of same record key
• partitioned
GlobalKTable
• Like KTable, but not partitioned => all data
is available on all parallel application
instances

Topologypublic static void main(String[] args) {
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(...);
final StreamsBuilder builder = new StreamsBuilder();
KStream<..,..> stream = builder.stream(...);
KStream<..,..> filtered = stream.filter(…)
filtered.to(...)
KafkaStreams streams = new KafkaStreams(
builder.build(),streamsConfiguration);
streams.start();
}
I
F
O

Streaming Data Sources
• File Source
• Reads files as a stream of data
• Supports text, csv, json, orc parquet
• Files must be atomically placed
• Kafka Source
• Reads from Kafka Topic
• Supports Kafka broker > 0.10.x
• Socket Source (for testing)
• Reads UTF8 text from socket
connection
• Rate Source (for testing)
• Generate data at specified number of
rows per second
val rawDf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("subscribe", "truck_position")
.load()

Streaming Data Sources
Supports "Kafka only"
KStream from Topic
KTable from Topic
Use Kafka Connect for reading
other data sources into Kafka
first
KStream<String, TruckPosition> positions =
builder.stream("truck_position"
, Consumed.with(Serdes.String()
, truckPositionSerde));
KTable<String, Driver> driver =
builder.table("trucking_driver"
, Consumed.with(Serdes.String()
, driverSerde)
, Materialized.as("driver-store"));

Streaming Sinks
• File Sink – stores output to a directory
• Kafka Sink – publishes to Kafka
• Foreach Sink - Runs arbitrary computation on the records in the output. See later in
the section for more details.
• Console Sink – for debugging, prints output to console
• Memory Sink – for debugging, stores output in-memory table
val query = jsonTruckPlusDriverDf
.selectExpr("to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker-1:9092")
.option("topic","dangerous_driving ")
.option("checkpointLocation", "/tmp")
.start()

Streaming Sinks
Supports "Kafka only"
For testing only:
Use Kafka Connect for
writing out to other targets
KStream<String, TruckPosition> posDriver = ..
posDriver.to("dangerous_driving"
,Produced.with(Serdes.String()
, truckPositionDriverSerde));
KStream<String, TruckPosition> posDriver = ..
// print to system output
posDriver.print(Printed.toSysOut())
// shortcut for
posDriver.foreach((key,value) ->
System.out.println(key + "=" + value))

Stateless Operations – Selection & Projection
Most common operations on
DataFrame/Dataset are supported for
streaming as well
select, filter, map, flatMap, …
KStream and KTable interfaces support
variety of transformation operations
filter, filterNot, map, mapValues,
flatMap, flatMapValues, branch,
selectKey, groupByKey …
val filteredDf =
truckPosDf.where(
"eventType !='Normal'")
KStream<> filtered =
positions.filter((key,value) ->
!value.eventType.equals("Normal")
)

Stateful Operations – Aggregations
Held in distributed memory with option to
spill to disk (fault tolerant through
checkpointing to Hadoop-like FS)
Output modes: Complete, Append,
Update
count, sum, mapGroupsWithState,
flatMapGroupsWithState, reduce ...
Require state store which can be in-
memory, RocksDB or custom impl (fault
tolerant through Kafka topics)
Result of Aggregation is a KTable
count, sum, avg, reduce, aggregate
...
val c = source
.withWatermark("timestamp"
, "10 minutes")
.groupBy()
.count()
KTable<..> c = stream
.groupByKey(..)
.count(...);

Stateful Operations – Time Abstraction
Clock
Event Time
Processing Time
Ingestion Time
1 2 3 4 5
adapted from Matthias Niehoff (codecentric)

Stateful Operations – Time Abstraction
Event-Time
• New with Spark Structured Streaming
• Extracted from the message (payload)
Processing Time
• Spark Streaming only supported processing
time
• generate the timestamp upon processing
Ingestion Time
• Only for sources which capture the
ingestion time
Event-time
• Point in time when event occurred
• Extracted from the message (payload or
header)
Processing-time
• Point in time when event happens to be
processed by stream processing application
Ingestion Time
• Point in time when event is stored in Kafka
(sent in message header)
df.withColumn("processingTime"
,current_timestamp())
.option("includeTimestamp", true)

Stateful Operations - Windowing
Due to size and never-ending nature of it, it’s
not feasible to keep entire stream of data in memory
Computations over events done using windows of data
• Fixed Window (aka Tumbling Window) - eviction policy is always based on the
window being full and the trigger policy is based on either the count of items in the
window or time
• Hopping Window (aka Sliding Window) - uses eviction and trigger policies that are
based on time: window length and sliding interval length
• Session Window – sessions are composed of sequences of temporarily related
events terminated by a gap of inactivity greater than some timeout

Support for Tumbling & Hopping (Sliding)
Time Windows
Handling Late Data with
Watermarking
val c = source
.withWatermark("timestamp"
, "10 minutes")
.groupBy(window($"eventTime"
, "1 minutes"
, "30 seconds")
, $"word")
.count()
Data older than watermark
not expected / get discarded
event time
Trailing gap
of 10 mins
max event time
watermark
12:20
12:10
12:25
Trailing gap
of 10 mins
processing time

Support for Tumbling & Hopping Windows
Supports Session Windows
Handling Late Data with Data
Retention (optional)
.groupByKey(...)
.windowedBy(
SessionWindows
.with(5 * 60 * 1000)
).count();
.groupByKey(..)
.windowedBy(
TimeWindows.of(60 * 1000)
.advanceBy(30 * 1000)
.until(10 * 60 * 1000)
).count(...);
Data older than watermark
not expected / get discarded
event time
Trailing gap
of 10 mins
max event time
Data Retention
12:20
12:10
12:25
Trailing gap
of 10 mins
processing time

Stateful Operations - Joins
Joining streaming-to-static and
streaming-to-streaming (since 2.3)
Dataset/DataFrame
Watermarking helps Spark to know for
how long to retain data
• Optional for Inner Joins
• Mandatory for Outer Joins
Support for Inner, Left Outer, Right
Outer and Full Outer
val jsonTruckPlusDriverDf =
jsonFilteredDf.join(driverDf
, Seq("driverId")
, "left")
Source: Spark Documentation

Supports following joins
• KStream-to-KStream
• KTable-to-KTable
• KStream-to-KTable
• KStream-to-GlobalKTable
• KTable-to-GlobalKTable
Stateful Operations - Joins
KStream<String, TruckPositionDriver> joined =
filteredRekeyed.leftJoin(driver
, (left,right) -> new TruckPositionDriver(left
, StringUtils.defaultIfEmpty(right.first_name,"")
, StringUtils.defaultIfEmpty(right.last_name,""))
, Joined.with(Serdes.String()
, truckPositionSerde
, driverSerde));
Source: Confluent Documentation

Streaming SQL with KSQL
Enables stream processing with zero
coding required
The simples way to process streams of
data in real-time
Powered by Kafka Streams
available as Developer preview!
STREAM and TABLE as first-class
citizens
STREAM = data in motion
TABLE = collected state of a stream
join STREAM and TABLE
ksql> CREATE STREAM truck_position_s
(timestamp BIGINT,
truckId BIGINT,
driverId BIGINT,
routeId BIGINT,
eventType VARCHAR,
latitude DOUBLE,
longitude DOUBLE,
correlationid VARCHAR)
WITH (kafka_topic='truck_position',
value_format='JSON');
ksql> SELECT * FROM truck_position_s;
1506922133306 | "truck/13/position0 |
2017-10-02T07:28:53 | 31 | 13 | 371182829
| Memphis to Little Rock | Normal | 41.76 |
-89.6 | -2084263951914664106

There is more ….
• Streaming Deduplication
• Run-Once Trigger / fixed Interval
Micro-Batching
• Continuous Trigger with fixed
checkpoint interval (experimental in
2.3)
• Streaming Machine Learning
• REPL
• KSQL
• Queryable State
• Processor API
• At-least Once vs. Exactly Once
• Microservices with Kafka Streams
• Scale-up / Scale-Down
• Stand-by replica of local state

Demo Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
jdbc-source
join_dangerous
_driving_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type
trucking_driver

Spark Structured Streaming vs. Kafka Streams
• Runs on top of a Spark cluster
• Reuse your investments into Spark
(knowledge and maybe code)
• A HDFS like file system needs to be
available
• Higher latency due to micro-batching
• Multi-Language support: Java, Python,
Scala, R
• Supports ad-hoc, notebook-style
development/environment
• Available as a Java library
• Can be the implementation choice of a
microservice
• Can only work with Kafka for both input and
output
• low latency due to continuous processing
• Currently only supports Java, Scala support
available soon
• KSQL abstraction provides SQL on top of
Kafka Streams

Comparison
Kafka Streams Spark Streaming Spark Structured Streaming
Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL
Processing Model Continuous Streaming Micro-Batching Micro-Batching / Continuous Streaming
(experimental)
Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset
Programming Model Declarative/Imperative Declarative Declarative
Time Support Event / Ingestion / Processing Processing Event / Processing
State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk
Join Stream-Stream, Stream-Static Stream-Static Stream-Static, Stream-Stream (2.3)
Event Pattern detection No No No
Queryable State Interactive Queries No No
Scalability & Reliability Yes Yes Yes
Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial)
Latency Sub-second seconds sub-second
Deployment Java Library Cluster (HDFS like FS needed for
resiliency)
Cluster (HDFS like FS needed for
resiliency)

Technology on its own won't help you.
You need to know how to use it properly.

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platforms compared

More Related Content

What's hot (20)

Similar to Spark (Structured) Streaming vs. Kafka Streams - two stream processing platforms compared (20)

More from Guido Schmutz (20)

Recently uploaded (20)

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platforms compared