Unlocking the Power of Apache Flink: An Introduction in 4 Acts

Unlocking the Power of Apache Flink:
An Introduction in 4 Actsin
David Anderson
Software Practice Lead, Conﬂuent
Apache Flink Committer

Today’s consumers expect real-time services

Real-time
Data
A Sale
A Shipment
A Trade
Rich Front-End
Customer Experiences
A Customer
Experience
Real-Time Backend
Operations
Real-time services rely on stream processing
Real-time Stream Processing

Driving business value with Apache Flink
Real-time analytics Event-driven applications
Streaming data
pipelines
Continuously produce and
update results which are
displayed and delivered to users
as real-time data streams are
consumed
● Ad/campaign performance
● Content performance
● Quality monitoring of Telco
networks
● Usage metering and billing
Recognize patterns and react to
incoming events by triggering
computations, state updates, or
external actions
● Fraud detection
● Anomaly detection
● Business process
monitoring
● Geo-fencing
Real-time data pipelines that
continuously ingest, enrich,
and transform data streams,
loading them into destination
systems for timely action (vs.
batch processing)
● Continuous ETL
● Real-time search index
building
● ML pipelines
● Data lake ingestion

Developers are choosing Flink because of
Its performance and rich feature set
Scalability & high
performance
Flink supports stream
processing workloads at
tremendous scale
Flink supports Java,
Python, & SQL, enabling
developers to work in
their language of choice
Flink supports stream
processing, batch
processing, and ad-hoc
analytics through one
technology
Uniﬁed processing
Flink's checkpointing
mechanism provides
exactly-once guarantees
automatically
Fault tolerance &
high availability
Language ﬂexibility
Flink is a top 5 Apache project and has a very active community

@yourtwitterhandle | developer.confluent.io
Streaming
The four cornerstones on which Flink is built
State
Time Snapshots

● A stream is a sequence of events
● Business data is always a stream: bounded or unbounded
● For Flink, batch processing is just a special case in the runtime
now
past future
bounded stream
unbounded stream
Streaming

Kafka
Databases
Key/Value Stores
Files
Apps
Sources
Sinks

Kafka
Databases
Key/Value Stores
Files
Apps
Sources Sinks

The Job Graph (or Topology)
OPERATOR
CONNECTION

Stream processing
• Parallel
• Forward
• Repartition
• Rebalance
grouped by
shape
SOURCE

Stream processing
• Parallel
• Forward
• Repartition
• Rebalance
group by
color
FILTER

Stream processing
• Parallel
• Forward
• Repartition
• Rebalance
COUNT
1
2
3
1
2
3
4

Stream processing with SQL
INSERT INTO results
SELECT color, COUNT(*)
FROM events
WHERE color <> orange
GROUP BY color;
GROUP BY color
results
COUNT
events

INSERT INTO results
FROM events
GROUP BY color;
GROUP BY color
events
results
COUNT

INSERT INTO results
FROM events
GROUP BY color;
GROUP BY color
results
COUNT
events
events

Flink’s APIs
Apache Flink Runtime
Low-Level
Stream Operator API
Optimizer / Planner
Table / SQL API
DataStream API

Flink supports streaming
● Bounded or unbounded streams
● Entire pipeline must always be running
● Input must be processed as it arrives
● Results are reported as they become ready
● Failure recovery resumes from a recent snapshot
● Flink guarantees effectively exactly-once results
despite out-of-order data and restarts due to
failures, etc.
● Only bounded streams
● Execution proceeds in stages, running as needed
● Input may be pre-sorted by time and key
● Results are reported at the end of the job
● Failure recovery does a reset and full restart
● Effectively exactly-once guarantees are more
straightforward
and batch

@yourtwitterhandle | developer.confluent.io
Streaming State
Time Snapshots

Stateful stream processing with Flink SQL
INSERT INTO results
FROM events
GROUP BY color;
GROUP BY color
events
results
COUNT

Stateful stream processing with Flink SQL
● Counting requires
state
GROUP BY color
events
results
COUNT

State
• Local
• Fast
• Fault tolerant

Time
• Synchronize
• Wait
• Timeout
09:05:44
When the event was created at its
original source.
Event time
09:08:01
When the event is being processed.
This time varies between applications.
Processing time

● Streams are (roughly) ordered by time
Out-of-order event streams
10:10
10:14
10:10
10:14

Coping with out of order events
This event will
be read next

These events
follow

Imagine a window counting events for the hour ending at 2:00. How long should this
window wait before producing its results?

Watermarks measure progress of event time
Watermark
● This watermark has been generated by assuming that the stream is at most 5 minutes
out-of-order

out-of-order
● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate
1:50 - 5 = 1:45

out-of-order
● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate
● A watermark is an assertion about the completeness of the stream
Now this stream is
complete up to 1:45

Imagine a window counting events for the hour ending at 2:00. How long should this
window wait before producing its results?
It should wait for a watermark with a timestamp of at least 2:00.

What are
watermarks for?
They make things happen when the time is right.

The idle stream
problem
● Streams that are idle do not
advance the watermark
● This prevents windows from
producing results

The idle stream
problem
● Streams that are idle do not
advance the watermark
● This prevents windows from
producing results
Solutions
● Balance the partitions so none are
empty or idle, or
● Send keep-alive events, or
● Conﬁgure the watermarking to
use idleness detection

Watermarks
● Not needed for applications that only use wall-clock (processing) time
● Not needed for batch processing
● Are needed for triggering actions based on event-time, e.g., closing a
window
● Are generated based on an assumption of how out of order the data might
be
● Provide control over the tradeoff between completeness and latency
● Flink SQL drops late events; the DataStream API offers more control
● Allow for consistent, reproducible results
● Potentially idle sources require special attention

A checkpoint is an
automatic snapshot
created by Flink,
primarily for the
purpose of failure
recovery

A checkpoint is an
automatic snapshot
created by Flink,
primarily for the
purpose of failure
recovery
A savepoint is a
manual snapshot
created for some
operational purpose
(e.g., a stateful
upgrade)

Snapshots
events
results
COUNT
FILTER
GROUP BY color

Snapshots
Source Filter Count by color Sink
events
results
COUNT
FILTER
GROUP BY color

Snapshots
Offsets for some
partitions
Offsets for other
partitions
events
results
COUNT
FILTER
GROUP BY color

Snapshots
Offsets for some
partitions
______________________
Offsets for other
partitions
______________________
events
results
COUNT
FILTER
GROUP BY color

Snapshots
Offsets for some
partitions
______________________ Counters for some
colors
Offsets for other
partitions
______________________ Counters for other
colors
events
results
COUNT
FILTER
GROUP BY color

Snapshots
Offsets for some
partitions
______________________ Counters for some
colors
Transaction ID
Offsets for other
partitions
______________________ Counters for other
colors
__________________
events
results
COUNT
FILTER
GROUP BY color

Taking a
snapshot
does NOT
stop the
world
Checkpoints and savepoints are created
asynchronously, while the job continues to
process events and produce results

Because
these are
self-consistent,
global
snapshots
● Flink provides (effectively) exactly-once
guarantees
● Recovery involves restarting the entire job
from the most recent checkpoint
Recovery

Streaming
Unfamiliar to many
developers, but
ultimately
straightforward
Watermarks
encapsulate something
complex in one place –
the sources
● how out-of-order?
● can it be idle?
Transparent to
application developers
State snapshots for
recovery
Delightfully simple
● local
● key/value
● single-threaded
State Event time and
watermarks

Where is the
Flink community?
To subscribe to the mailing lists, or get an invite
to Slack, see https://ﬂink.apache.org/community/

Unlocking the Power of Apache Flink: An Introduction in 4 Acts

Your Apache Flink®
journey begins here
developer.conﬂuent.io

Unlocking the Power of Apache Flink: An Introduction in 4 Acts

More Related Content

What's hot (20)

Similar to Unlocking the Power of Apache Flink: An Introduction in 4 Acts (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Unlocking the Power of Apache Flink: An Introduction in 4 Acts