Seattle spark-meetup-032317

Design and
Implementation of Spark
Streaming Connectors
ARIJIT TARAFDAR
NAN ZHU
MARCH 23,2017

About Us
Arijit Tarafdar
Software Engineer@Azure HDInsight. Work on Spark Streaming/Structured Streaming
service in Azure. Committee Member of XGBoost@DMLC and Apache MxNet (incubator).
Spark Contributor. Known as CodingCat in GitHub.
Nan Zhu
Software Engineer@Azure HDInsight. Work on Spark/Spark Streaming on Azure.
Previously worked with other distributed platforms like DryadLinq and MPI. Also
worked on graph coloring algorithms which was contributed to ADOL-C
(https://projects.coin-or.org/ADOL-C).

Real Time Data Analytics Results
Processing Engine
Continuous Data Source Control Manager
Continuous Data Source API
Persistent Data Storage Layer
Spark Streaming, Structured Streaming
Deliver real time data to Spark at scale
Real time view of data (message queue
or files filtered by timestamp)
Blobs/Queues/Tables/Files
Continuous Application Architecture and Role of Spark Connectors
Not only size of data is increasing, but also the velocity of data
◦ Sensors, IoT devices, social networks and online transactions are all generating
data that needs to be monitored constantly and acted upon quickly.

Outline
•Recap of Spark Streaming
•Introduction to Event Hubs
•Connecting Azure Event Hubs and Spark Streaming
•Design Considerations for Spark Streaming Connector
•Contributions Back to Community
•Future Work

Spark Streaming - Background
Task 1
Task 2
Task L
RDD 1 @ t RDD 1 @ t-1 RDD 1 @ 0
Stream 1
Task 1
Task 2
Task M
RDD N @ t RDD N @ t-1 RDD N @ 0
Stream N
Micro Batch @ t
Task 1
Task 2
Task L
Task 1
Task 2
Task M
Window Duration
Batch Duration

Azure Event Hubs - Introduction
Partition 1
Partition 2
Partition J
Event Hubs 1
Partition 1
Partition 2
Partition K
Event Hubs L
Event Hubs Namespace 1
Partition 1
Partition 2
Partition K
Event Hubs 1
Partition 1
Partition 2
Partition P
Event Hubs N
Event Hubs Namespace M

Azure Event Hubs - Introduction
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs

Data Flow – Event Hubs
• Proactive message delivery
• Efficient in terms of communication cost
• Data source treated as commit log of events
• Events read in batch per receive() call
New Old
Event Hubs Partition
(Event Hubs Server)
Prefetch Queue
(Event Hubs Client)
Streaming
Application

Event Hubs – Offset Management
• Event Hubs expect offset management to be performed on the receiver side
• Spark streaming uses DFS based persistent store (HDFS, ADLS, etc.)
• Offset is stored per consumer group per partition per event hubs per event hubs namespace
/* An interface to read/write offset for a given Event Hubs
namespace/name/partition */
@SerialVersionUID(1L)
trait OffsetStore extends Serializable {
def open(): Unit
def write(offset: String): Unit
def read() : String
def close(): Unit
}

First Version: Receiver-based
Spark Streaming Connector for
Azure Event Hub

Application
Driver (Spark)
Receiver
Executor (Spark)
Streaming
Context
Spark
Context
Eventhubs
Receiver
Task Executor
(Spark)
User Defined
Functions
ADLS
ADLS
Write Ahead Log (WAL)
Checkpoint Directory
Memory
Block
Data
Block
Metadata
Jobs
Tasks
Checkpoint
Data
Azure
Eventhubs
Input
Stream
ADLS
WASB
Output Storage
Fault Tolerance – Spark Receiver Based Event Hubs Connector

Restarted
Application
Driver (Spark)
Restarted Receiver
Executor (Spark)
Restarted
Streaming
Context
Restarted
Spark
Context
Restarted
Eventhubs
Receiver
Restarted Task
Executor (Spark)
User Defined
Functions
ADLS
ADLS
Write Ahead Log (WAL)
Checkpoint Directory
Memory
Recover Block
Data
Recover Block
Metadata
Jobs
Tasks
Restart
Computation
From
Checkpoint
Data Azure
Eventhubs
ADLS
WASB
Output Storage
Spark Streaming – Recovery After Failure

Event Hubs Receiver – Class Signature
private[eventhubs] class EventHubsReceiver(
eventhubsParams: Map[String, String],
partitionId: String,
storageLevel: StorageLevel,
offsetStore: Option[OffsetStore],
receiverClient: EventHubsClientWrapper,
maximumEventRate: Int) extends
Receiver[Array[Byte]](storageLevel) with
Logging { ... }

Event Hubs Receiver – Methods Used/Implemented
@DeveloperApi
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {
def onStart(): Unit
def onStop(): Unit
def store(dataItem: T) {
supervisor.pushSingle(dataItem)
}
def store(dataBuffer: ArrayBuffer[T]) {
supervisor.pushArrayBuffer(dataBuffer, None, None)
}
def restart(message: String, error: Throwable) {
supervisor.restartReceiver(message, Some(error))
}
}

Azure Event Hubs/Spark
Expectations
Receiver-based Connection
(Event Hubs) Long-Running
Receiver/Proactive Message
Fetching
Long-Running Receiver Tasks
(Spark) Logging Data Before
Ack
WAL/Spark Checkpoint
(Event Hubs) Client-side
Offset Management
Offset Store
A Natural Fit!!!
Why Receiver based connector?

Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Lessons learnt from based connector?

Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Logging Data
Before Ack
WAL/Spark
Checkpoint
Performance/Data
loss due to Spark
bug/No easy
recovery from code
update
https://issues.apache.org/jira/browse/SPARK-18957

Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Logging Data
Before Ack
WAL/Spark
Checkpoint
Performance/Data
loss due to Spark
bug/No easy
recovery from code
update
Client-side Offset
Management
Offset Store Looks fine….

Bridging Spark Streaming and
Event Hubs WITHOUT Receiver
How the Idea Extends to Other
Data Sources (in Azure & Your IT
Infrastructure)?

Extra Resources
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from
Code Update
Client-side Offset
Management

From Event Hubs to General
Data Sources (1)
•Communication Pattern
• Azure Event Hubs: Long-Running Receiver, Proactive Data Delivery
• Kafka: Receiver Start/Shutdown in a free-style, Passive Data
Delivery
Most Critical Factor in Designing a Resource-Efficient
Spark Streaming Connector!

Tackling Extra Resource
Requirement
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
Reduce Resource Requirements:
Customized Receiver
Logic
User-Defined
Lambdas
EventHubsRDD
.map()
MapPartitionsRDD
Spark Tasks
Compact Data Receiving and Processing in the same Task
Inspired by Kafka
Direct DStream!
Being More Challenging with a
Different Communication Pattern!

Bridging Spark Execution Model and
Communication Pattern Expectation
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
Customized Receiver
Logic
User-Defined
Lambdas
EventHubsRDD
.map()
MapPartitionsRDD
Spark Task
Passive
Message
Delivery Layer
Recv(expectedMsgNum:
Int) – Blocking API
Long-running/Proactive Receiver expected by Event Hubs
vs.
Transient Tasks started for each Batch by Spark

Takeaways (1)
Requirements in
Event Hubs
Receiver-based
Connection
Problems Solution
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Compact Data
Receiving/Processi
ng, with the
facilitates from
Passive Message
Delivery
Communication Pattern in Data Sources Plays the Key
Role in Resource-Efficient Design of Spark Streaming
Connector

Fault Tolerance
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from Code
Update
Client-side Offset
Management

Data Sources (2)
•Fault-Tolerance
• Capability
• Guarantee graceful recovery (no data loss, recover from where
you stopped, etc.) with application stops for various reasons
• Efficiency
• Minimum impact to application performance and user
deployment

…RDD L-t RDD L-(t-1) RDD L-0 Stream L
Unexpected Application Stop
Checkpoint Time
RDD L-(t-1)RDD L-t
Recovery
From Checkpoint, or Re-evaluated
Capability – Recover from
unexpected stop

…RDD L-(t-1) RDD L-0 Stream L
Application Upgrade
…
Application Stop
Spark Checkpoint Mechanism Serializes
Everything and does not recognize a re-compiled
class
Capability – Recover from
planned stop
RDD L-(2t)
Resume Application
with updated
Implementation
Fetch the latest offset
Offset Store
Your Connector shall maintain this!!!

Efficiency - What to be
Contained in Checkpoint Files?
• Checkpointing takes your computing resources!!!
• Received Event Data
• too large
• The range of messages to be processed in each batch
• Small enough to quickly persist data
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
EventHubsRDD
.map()
MapPartitionsRDD
Passive
Message
Delivery Layer
Recv(expectedMsgNum:
Int) – Blocking API
Persist this mapping relationship, i.e. using EventHubs itself as data backup

Efficiency - Checkpoint
Cleanup
•Connectors for Data Source Requiring Client-
side offset management generates Data/Files
for each Batch
• You have to clean up SAFELY
• Keep recovery feasible
• Coordinate with Spark’s checkpoint process
• Override clearCheckpointData() in EventHubsDStream (our
implementation of Dstream)
• Triggered by Batch Completion
• Delete all offset records out of the remembering window

Takeaways (2)
Requirements in
Event Hubs
Receiver-based
Connection
Problems Solution
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from Code
Update
Checkpoint Mapping
Relationship instead
of Data/Self-
management Offset
Store/Coordinate
Checkpoint Cleanup
Fault Tolerance Design is about Interaction with Spark
Streaming Checkpoint

Offset Management
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Data Loss due to
Spark Bug
Client-side Offset
Management
Is it really fine???

Data Sources (3)
•Message Addressing Rate Control

Message Addressing
• Why Message Addressing?
• When creating a client instance of data source in a Spark task, where to start receiving?
• Without this info, you have to replay the stream for every newly created client
Data
Source
Client
Start from the first msg
Data
Source
Client
Start from where?
• Design Options:
• Xth message (X: 0, 1, 2, 3, 4….)
• server side metadata to map the message ID to the offset in storage system
• Actual offset
• Simpler server side design
Fault
Or
Next Batch

Rate Control
• Why Rate Control
• Prevent the messages flooding into the processing pipelines
• e.g. just start processing a queued up data sources
• Design Options
• Number of messages: I want to consume 1000 messages in next batch
• Assuming the homogeneous processing overhead
• Size of messages: I want to receive at most 1000 bytes in next batch
• Complicated Server side logic -> track the delivered size
• Larger messages, longer processing time is not always true
Data
Source
Client
Start from the first msg
Data
Source
Client
Consume all messages at
once? May crash your
processing engine!!!A Long Stop!!!

Kafka Choice
• Message Addressing:
• Xth message: 0, 1, 2, 3, 4, ..
• Rate Control
• Number of Messages: 0, 1, 2, 3, 4, …
Driver
Executor
Executor
Kafka
Message Addressing and Rate Control:
Batch 0: How many messages are to be
processed in next batch, and where to start? 0
- 999
Batch 1: How many messages are to be processed
in next batch, and where to start? 1000 - 1999

Azure Event Hubs’ Choice
• Message Addressing:
• Offset of messages: 0, size of msg 0, size of (msg 0 + msg 1),…
• Rate Control
• Number of Messages: 0, 1, 2, 3, 4, …
This brings totally different connector
design/implementation!!!

Distributed Information for Rate
Control and Message Addressing
Driver
Executor
Executor
Rate Control:
Batch 0: How many messages are to
be processed in next batch, and
where to start? 0 - 999
Azure EventHubs
Batch 1: How many messages are to be
processed in next batch, and where to
start? 1000 - 1999
What’s the offset of
1000th message???
The answer appeared in Executor
side (when Task receives the
message (as part of message
metadata))
Build a Channel to Pass
Information from Executor to
Driver!!!

HDFS-based Channel
Implementation
LastOffset_P1_Batch_i
LastOffset_PN_Batch_i
EventHubsRDD
Tasks
.map()
MapPartitionsRDD
Tasks
What’s the next step??? Simply let
Driver-side logic read the files?
•APIs like RDD.take(x) evaluates only some of the partitions…Batch 0
generate 3 files, and Batch 1 generates 5 files…
•You have to merge the latest files with the historical results and
commit and then direct the driver-side logic to read
No!!!

HDFS-based Channel
Implementation
LastOffset_P1_Batch_i
LastOffset_PN_Batch_i
EventHubsRDD
Tasks
.map()
MapPartitionsRDD
Tasks
•APIs like RDD.take(x) evaluates only some of the partitions…Batch 0
generate 3 files, and Batch 1 generates 5 files…
•You have to merge the latest files with the historical results and
commit ...
•Ensure that all streams’ offset are committed transactionally
•Discard the partially merged/committed results to rerun the batch

HDFS-based Channel
Implementation
RDD Generation “Thread” Job Execution “Thread”
Generate RDD
Blocking
(wait)
Processing
RDD
BatchComplete
Event
SparkListenerBus
CustomizedListen
er
CommitOffsets
and Notify

HDFS-based Channel
Implementation
RDD Generation “Thread” Job Execution “Thread”
Generate RDD
Blocking
(wait)
Processing
Micro Batch
BatchComplete
Event
SparkListenerBus
CustomizedListen
er
CommitOffsets
and Notify
DStream.graph: DStreamGraph

Takeaways (3)
• There are multiple options on the Server-side design for Message
Addressing and Rate Control
• To design and implement a Spark Streaming connector, you have to
understand what are the options adopted in server side
The key is the combination!!!

Contribute Back to
Community
Failed Recovery from checkpoint caused by the multi-threads issue in
Spark Streaming scheduler
One Realistic Example of its Impact: You are potentially getting wrong
data when you use Kafka and reduceByWindow and recover from a
failure
Data loss caused by improper post-batch-completed processing
Inconsistent Behavior of Spark Streaming Checkpoint

Summary
• Spark Streaming Connector for Azure Event Hubs enables the user to perform
various types of analytics over streaming data from a fully managed, cloud-scale
message telemetry ingestion service
• https://github.com/hdinsight/spark-eventhubs
• Design and Implementation of Spark Streaming Connectors
• Coordinate Execution Model and Communication Pattern
• Fault Tolerance (Spark Streaming Checkpoint v.s. self-managed fault tolerance facilitates)
• Message Addressing and Rate Control (Server&Connector Co-Design)
• Contributing Back to the Community
• Microsoft is the organization with the most open source contributors in 2016!!!
• http://www.businessinsider.com/microsoft-github-open-source-2016-9

If you do not want to handle
this complexity
Move to Azure HDInsight…

Future Work
Structured Streaming integration with Event Hubs (will release at the
end of month)
Streaming Data Visualization with PowerBI (alpha released mode)
Streaming ETL Solutions on Azure HDInsight!

Thank You!!!
Build a Powerful&Robust Data Analytic
Pipeline with Spark@Azure HDInsight!!!

Seattle spark-meetup-032317

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Seattle spark-meetup-032317 (20)

Recently uploaded (20)

Seattle spark-meetup-032317

Editor's Notes