SlideShare a Scribd company logo
© 2019 Ververica
Seth Wiesman, Solutions Architect
Deep Dive on Apache Flink State
© 2019 Ververica
© 2019 Ververica
Agenda
• Serialization
• State Backends
• Checkpoint Tuning
• Schema Migration
• Upcoming Features
3
© 2019 Ververica
Serializers
© 2019 Ververica
Flink’s Serialization System
• Natively Supported Types
• Primitive Types
• Tuples, Scala Case Classes
• Pojo’s
• Unsupported Types Fall Back to Kryo
5
© 2019 Ververica
Flink’s Serialization System
Benchmark Results For Flink 1.8
6
Serializer Ops/s
PojoSerializer 305 / 293*
RowSerializer 475
TupleSerializer 498
Kryo 102 / 67*
Avro (Reflect API) 127
Avro (SpecificRecord API) 297
Protobuf (via Kryo) 376
Apache Thrift (via Kryo) 129 / 112*
public static class MyPojo {
  public int id;
  private String name;
  private String[] operationNames;
  private MyOperation[] operations;
  private int otherId1;
  private int otherId2;
  private int otherId3;
  private Object someObject; // used with String
}
MyOperation {
  int id;
  protected String name;
}
© 2019 Ververica
Custom Serializers
• registerKryoType(Class<?>)
• Registers a type with Kryo for more compact binary format
• registerTypeWithKryoSerializer(Class<?>, Class<? extends Serializer>)
• Provides a default serializer for the given class
• Provided serializer class must extends com.esotericsoftware.kryo.Serializer
• addDefaultKryoSerializer(Class<?>, Serializer<?> serializer)
• Registers a serializer as the default serializer for the given type
Registration with Kryo via ExecutionConfig
7
© 2019 Ververica
Custom Serializer’s
@TypeInfo Annotation
8
@TypeInfo(MyTupleTypeInfoFactory.class)
public class MyTuple<T0, T1> {
  public T0 myfield0;
  public T1 myfield1;
}
public class MyTupleTypeInfoFactory extends TypeInfoFactory<MyTuple> {
  @Override
  public TypeInformation<MyTuple> createTypeInfo(Type t, Map<String, TypeInformation<?>> genericParameters) {
    return new MyTupleTypeInfo(genericParameters.get("T0"), genericParameters.get("T1"));
  
}
© 2019 Ververica
State Backends
© 2019 Ververica10
Task Manager Process Memory Layout
Task Manager JVM Process
Java Heap
Off Heap / Native
Flink Framework etc.
Network Buffers
Timer State
Keyed State
Typical Size
© 2019 Ververica11
Task Manager Process Memory Layout
Task Manager JVM Process
Java Heap
Off Heap / Native
Flink Framework etc.
Network Buffers
Timer State
Keyed State
Typical Size
© 2019 Ververica12
Task Manager Process Memory Layout
Task Manager JVM Process
Java Heap
Off Heap / Native
Flink Framework etc.
Network Buffers
Timer State
Keyed State
Typical Size
© 2019 Ververica13
Keyed State Backends
Based on Java Heap Objects Based on RocksDB
© 2019 Ververica
Heap Keyed State Backend
• State lives as Java objects on the heap
• Organized as chained hash table, key ↦ state
• One hash table per registered state
• Supports asynchronous state snapshots
• Data is de / serialized only during state snapshot and restore
• Highest Performance
• Affected by garbage collection overhead / pauses
• Currently no incremental checkpoints
• High memory overhead of representation
• State is limited by available heap memory
14
© 2019 Ververica
Heap State Table Architecture
15
- Hash buckets (Object[]), 4B-8B per slot
- Load factor <= 75%
- Incremental rehash
Entry
Entry
Entry
© 2019 Ververica
Heap State Table Architecture
16
- Hash buckets (Object[]), 4B-8B per slot
- Load factor <= 75%
- Incremental rehash
Entry
Entry
Entry
▪ 4 References:
▪ Key
▪ Namespace
▪ State
▪ Next
▪ 3 int:
▪ Entry Version
▪ State Version
▪ Hash Code
K
N
S
4 x (4B-8B)
+3 x 4B
+ ~8B-16B (Object overhead)
Object sizes and
overhead.
Some objects might
be shared.
© 2019 Ververica
Heap State Table Snapshot
17
Original Snapshot
A C
B
Entry
Entry
Entry
Copy of hash bucket array is snapshot overhead
© 2019 Ververica
Heap State Table Snapshot
18
Original Snapshot
A C
B
D
No conflicting modification = no overhead
© 2019 Ververica
Heap State Table Snapshot
19
Original Snapshot
A’ C
B
D A
Modifications trigger deep copy of entry - only as much as required. This depends on
what was modified and what is immutable (as determined by type serializer).
Worst case overhead = size of original at time of snapshot.
© 2019 Ververica
Heap Backend Tuning Considerations
• Choose TypeSerializers with efficient copy-methods
• Flag immutability of objects where possible to avoid copy completely
• Flatten POJOs / avoid deep objects
• Reduces object overheads and following references
• GC choice / tuning
• Scale out using multiple task managers per node
20
© 2019 Ververica
RocksDB Keyed State Backend Characteristics
• State lives as serialized byte-strings in off-heap memory and on local disk
• One column family per registered state (~table)
• Key / Value store, organized as a log-structured merge tree (LSM tree)
• Key: serialized bytes of <keygroup, key, namespace>
• LSM naturally supports MVCC
• Data is de / serialized on every read and update
• Not affected by garbage collection
• Relatively low overhead of representation
• LSM naturally supports incremental snapshots
• State size is limited by available local disk space
• Lower performance (~ order of magnitude compared to Heap state backend)
21
© 2019 Ververica
RocksDB Architecture
22
Local Disk
WAL
WAL
Compaction
Memory Persistent Store
Flush
In Flink:
- disable WAL and sync
- persistence via checkpointsActive
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge
© 2019 Ververica
RocksDB Architecture
23
Local Disk
WAL
WAL
Compaction
Memory Persistent Store
Flush
In Flink:
- disable WAL and sync
- persistence via checkpointsActive
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge
Set per column
family (~table)
© 2019 Ververica
RocksDB Architecture
24
ReadOp
Local Disk
WAL
WAL
Memory Persistent Store
Flush
Merge
Active
MemTable
ReadOnly
MemTable
Full/Switch
WriteOp
SST SST
SSTSST
In Flink:
- disable WAL and sync
- persistence via checkpoints
© 2019 Ververica
RocksDB Architecture
25
ReadOp
Local Disk
WAL
WAL
Memory Persistent Store
Flush
Merge
Active
MemTable
ReadOnly
MemTable
Full/Switch
WriteOp
SST SST
SSTSST
In Flink:
- disable WAL and sync
- persistence via checkpointsActive
MemTable
ReadOnly
MemTable
WriteOp
ReadOp
Local Disk
WAL
WAL
Compaction
Memory Persistent Store
Full/Switch
Read Only
Block Cache
Flush
SST SST
SSTSST
Merge
In Flink:
- disable WAL and sync
- persistence via checkpoints
© 2019 Ververica
RocksDB Resource Consumption
• One RocksDB instance per operator subtask
• block_cache_size
• Size of the block cache
• write_buffer_size
• Max size of a MemTable
• max_write_buffer_number
• The maximum number of MemTable’s allowed in memory before flush to SST file
• Indexes and bloom filters
• Optional
• Table Cache
• Caches open file descriptors to SST files
• Default: unlimited!
26
© 2019 Ververica
Performance Tuning
Amplification Factors
27
Write Amplification
Read Amplification Space Amplification
More details: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Parameter
Space
© 2019 Ververica
Performance Tuning
Amplification Factors
28
Write Amplification
Read Amplification Space Amplification
More details: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Parameter
Space
Example: More compaction effort =
increased write amplification
and reduced read amplification
© 2019 Ververica
General Performance Considerations
• Use efficient TypeSerializer’s and serialization formats
• Decompose user code objects
• ValueState<List<Integer>> ListState<Integer>
• ValueState<Map<Integer, Integer>> MapState<Integer, Integer>
• Use the correct configuration for your hardware setup
• Consider enabling RocksDB native metrics to profile your applications
• File Systems
• Working directory on fast storage, ideally local SSD. Could even be memory.
• EBS performance can be problematic
29
© 2019 Ververica
Timer Service
© 2019 Ververica
Heap Timers
31
▪ 2 References:
▪ Key
▪ Namespace
▪ 1 long:
▪ Timestamp
▪ 1 int:
▪ Array Index
K
N
Object sizes and
overhead.
Some objects might
be shared.
Binary heap of timers in array
Peek: O(1)
Poll: O(log(n))
Insert: O(log(n))
Delete: O(n)
Contains O(n)
Timer
© 2019 Ververica
Heap Timers
32
▪ 2 References:
▪ Key
▪ Namespace
▪ 1 long:
▪ Timestamp
▪ 1 int:
▪ Array Index
K
N
Object sizes and
overhead.
Some objects might
be shared.
Binary heap of timers in array
HashMap<Timer, Timer> : fast deduplication and deletes
Key Value
Peek: O(1)
Poll: O(log(n))
Insert: O(log(n))
Delete: O(log(n))
Contains O(1)
MapEntry
Timer
© 2019 Ververica
Heap Timers
33
Binary heap of timers in array
HashMap<Timer, Timer> : fast deduplication and deletes
MapEntry
Key Value
Snapshot (net values of a timer are immutable)
Timer
© 2019 Ververica
RocksDB Timers
34
0 20 A X
0 40 D Z
1 10 D Z
1 20 C Y
2 50 B Y
2 60 A X
…
…
Key
Group
Time
stamp
Key
Name
space
…
Lexicographically ordered
byte sequences as key, no value
Column Family - only key, no value
© 2019 Ververica
RocksDB Timers
35
0 20 A X
0 40 D Z
1 10 D Z
1 20 C Y
2 50 B Y
2 60 A X
…
…
Key
Group
Time
stamp
Key
Name
space
Column Family - only key, no value
Key group queues
(caching first k timers)
Priority queue of
key group queues
© 2019 Ververica
3 Task Manager Memory Layout
36
Task Manager JVM Process
Off Heap / Native
Flink Framework etc.
Network Buffers
Timer State
Keyed State
Task Manager JVM Process
Java Heap
Off Heap / Native
Flink Framework etc.
Network Buffers
Timer State
Keyed State
Task Manager JVM Process
Java Heap
Off Heap / Native
Flink Framework etc.
Network Buffers
Keyed State
Timer State
© 2019 Ververica
Full / Incremental Checkpoints
© 2019 Ververica
Full Checkpoint
38
G
H
C
D
Checkpoint 1 Checkpoint 2 Checkpoint 3
I
E
A
B
C
D
A
B
C
D
A
F
C
D
E
@t1 @t2 @t3
A
F
C
D
E
G
H
C
D
I
E
© 2019 Ververica
Full Checkpoint Overview
• Creation iterates and writes full database snapshots as a stream to stable storage
• Restore reads data as a stream from stable storage and re-inserts into the state backend
• Each checkpoint is self contained, and size is proportional to the size of full state
• Optional: compression with snappy
39
© 2019 Ververica
Incremental Checkpoint
40
H
C
D
Checkpoint 1 Checkpoint 2 Checkpoint 3
I
E
A
B
C
D
A
B
C
D
A
F
C
D
E
E
F
G
H
I
@t1 @t2 @t3
builds upon builds upon
𝚫𝚫 𝚫
© 2019 Ververica
Incremental Checkpoints with RocksDB
41
Local Disk
WAL
WAL
Compaction
Memory Persistent Store
Flush
Incremental checkpoint:
Observe created/deleted
SST files since last checkpoint
Active
MemTable
ReadOnly
MemTable
WriteOp
Full/Switch
SST SST
SSTSST
Merge
© 2019 Ververica
Incremental Checkpoint Overview
• Expected trade-off: faster* checkpoints, slower recovery
• Creation only copies deltas (new local SST files) to stable storage
• Creates write amplification because we also upload compacted SST files so that we can prune checkpoint
history
• Sum of all increments that we read from stable storage can be larger than the full state size
• No rebuild is required because we simply re-open the RocksDB backend from the SST files
• SST files are snappy compressed by default
42
© 2019 Ververica
Schema Migration
© 2019 Ververica
Anatomy of a Flink Stream Job Upgrade
44
Flink job user code
Local State Backend
Persistent Savepoint
local reads / writes that

manipulate state
© 2019 Ververica
Anatomy of a Flink Stream Job Upgrade
45
Flink job user code
Local State Backend
Persistent Savepoint
Application Upgrade
© 2019 Ververica
Anatomy of a Flink Stream Job Upgrade
46
Flink job user code
Local State Backend
Persistent Savepoint
Continue To Access State
© 2019 Ververica
Upcoming Features
© 2019 Ververica
Upcoming Features
• A new state backend
• Unified savepoint binary format
• State Processor API
48
© 2019 Ververica
Questions?

More Related Content

PPTX
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
PDF
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
Fundamentals of Apache Kafka
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Evening out the uneven: dealing with skew in Flink
Introducing the Apache Flink Kubernetes Operator
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Where is my bottleneck? Performance troubleshooting in Flink
Batch Processing at Scale with Flink & Iceberg
Fundamentals of Apache Kafka

What's hot (20)

ODP
Stream processing using Kafka
PDF
Producer Performance Tuning for Apache Kafka
PPTX
Kafka 101
PDF
Flink powered stream processing platform at Pinterest
PPTX
PDF
An Introduction to Apache Kafka
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Making Apache Spark Better with Delta Lake
PDF
Kafka Streams: What it is, and how to use it?
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PDF
Apache Kafka - Martin Podval
PDF
Apache Kafka
PPTX
Kafka presentation
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Apache Kafka Introduction
PPTX
Apache Kudu: Technical Deep Dive


PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PDF
A Deep Dive into Kafka Controller
Stream processing using Kafka
Producer Performance Tuning for Apache Kafka
Kafka 101
Flink powered stream processing platform at Pinterest
An Introduction to Apache Kafka
Tuning Apache Kafka Connectors for Flink.pptx
Making Apache Spark Better with Delta Lake
Kafka Streams: What it is, and how to use it?
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Apache Kafka - Martin Podval
Apache Kafka
Kafka presentation
Tame the small files problem and optimize data layout for streaming ingestion...
Apache Kafka Introduction
Apache Kudu: Technical Deep Dive


Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
A Deep Dive into Kafka Controller
Ad

Similar to Webinar: Deep Dive on Apache Flink State - Seth Wiesman (20)

PDF
Tuning Flink For Robustness And Performance
PDF
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PDF
Performance Tuning RocksDB for Kafka Streams’ State Stores
PDF
Nodejs - Should Ruby Developers Care?
PDF
Openstack HA
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
PPTX
Accelerate with ibm storage ibm spectrum virtualize hyper swap deep dive
PPTX
How does Apache Pegasus (incubating) community develop at SensorsData
PDF
VMworld 2013: Extreme Performance Series: Storage in a Flash
PDF
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
PPTX
Real time data pipline with kafka streams
PPTX
Ceph - High Performance Without High Costs
PDF
VMworld Europe 2014: Advanced SQL Server on vSphere Techniques and Best Pract...
PDF
Using Snap Clone with Enterprise Manager 12c
PPTX
How is Kafka so Fast?
PDF
MySQL Replication
PDF
Critical Attributes for a High-Performance, Low-Latency Database
PPTX
VMworld 2015: Advanced SQL Server on vSphere
PPTX
Kafka streams decoupling with stores
Tuning Flink For Robustness And Performance
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams’ State Stores
Nodejs - Should Ruby Developers Care?
Openstack HA
Capital One Delivers Risk Insights in Real Time with Stream Processing
Accelerate with ibm storage ibm spectrum virtualize hyper swap deep dive
How does Apache Pegasus (incubating) community develop at SensorsData
VMworld 2013: Extreme Performance Series: Storage in a Flash
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
Real time data pipline with kafka streams
Ceph - High Performance Without High Costs
VMworld Europe 2014: Advanced SQL Server on vSphere Techniques and Best Pract...
Using Snap Clone with Enterprise Manager 12c
How is Kafka so Fast?
MySQL Replication
Critical Attributes for a High-Performance, Low-Latency Database
VMworld 2015: Advanced SQL Server on vSphere
Kafka streams decoupling with stores
Ad

More from Ververica (20)

PDF
2020-05-06 Apache Flink Meetup London: The Easiest Way to Get Operational wit...
PDF
Webinar: How to contribute to Apache Flink - Robert Metzger
PDF
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
PDF
Deploying Flink on Kubernetes - David Anderson
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
PDF
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
PPTX
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
PDF
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
PDF
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
PDF
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
PPTX
Kostas Kloudas - Extending Flink's Streaming APIs
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
PPTX
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
PDF
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
PPTX
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
2020-05-06 Apache Flink Meetup London: The Easiest Way to Get Operational wit...
Webinar: How to contribute to Apache Flink - Robert Metzger
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Deploying Flink on Kubernetes - David Anderson
Webinar: Flink SQL in Action - Fabian Hueske
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
Stephan Ewen - Experiences running Flink at Very Large Scale
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Kostas Kloudas - Extending Flink's Streaming APIs
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
August Patch Tuesday
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
1. Introduction to Computer Programming.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
August Patch Tuesday
Encapsulation_ Review paper, used for researhc scholars
OMC Textile Division Presentation 2021.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Machine learning based COVID-19 study performance prediction
cloud_computing_Infrastucture_as_cloud_p
TLE Review Electricity (Electricity).pptx
SOPHOS-XG Firewall Administrator PPT.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
1. Introduction to Computer Programming.pptx

Webinar: Deep Dive on Apache Flink State - Seth Wiesman

  • 1. © 2019 Ververica Seth Wiesman, Solutions Architect Deep Dive on Apache Flink State
  • 3. © 2019 Ververica Agenda • Serialization • State Backends • Checkpoint Tuning • Schema Migration • Upcoming Features 3
  • 5. © 2019 Ververica Flink’s Serialization System • Natively Supported Types • Primitive Types • Tuples, Scala Case Classes • Pojo’s • Unsupported Types Fall Back to Kryo 5
  • 6. © 2019 Ververica Flink’s Serialization System Benchmark Results For Flink 1.8 6 Serializer Ops/s PojoSerializer 305 / 293* RowSerializer 475 TupleSerializer 498 Kryo 102 / 67* Avro (Reflect API) 127 Avro (SpecificRecord API) 297 Protobuf (via Kryo) 376 Apache Thrift (via Kryo) 129 / 112* public static class MyPojo {   public int id;   private String name;   private String[] operationNames;   private MyOperation[] operations;   private int otherId1;   private int otherId2;   private int otherId3;   private Object someObject; // used with String } MyOperation {   int id;   protected String name; }
  • 7. © 2019 Ververica Custom Serializers • registerKryoType(Class<?>) • Registers a type with Kryo for more compact binary format • registerTypeWithKryoSerializer(Class<?>, Class<? extends Serializer>) • Provides a default serializer for the given class • Provided serializer class must extends com.esotericsoftware.kryo.Serializer • addDefaultKryoSerializer(Class<?>, Serializer<?> serializer) • Registers a serializer as the default serializer for the given type Registration with Kryo via ExecutionConfig 7
  • 8. © 2019 Ververica Custom Serializer’s @TypeInfo Annotation 8 @TypeInfo(MyTupleTypeInfoFactory.class) public class MyTuple<T0, T1> {   public T0 myfield0;   public T1 myfield1; } public class MyTupleTypeInfoFactory extends TypeInfoFactory<MyTuple> {   @Override   public TypeInformation<MyTuple> createTypeInfo(Type t, Map<String, TypeInformation<?>> genericParameters) {     return new MyTupleTypeInfo(genericParameters.get("T0"), genericParameters.get("T1"));    }
  • 10. © 2019 Ververica10 Task Manager Process Memory Layout Task Manager JVM Process Java Heap Off Heap / Native Flink Framework etc. Network Buffers Timer State Keyed State Typical Size
  • 11. © 2019 Ververica11 Task Manager Process Memory Layout Task Manager JVM Process Java Heap Off Heap / Native Flink Framework etc. Network Buffers Timer State Keyed State Typical Size
  • 12. © 2019 Ververica12 Task Manager Process Memory Layout Task Manager JVM Process Java Heap Off Heap / Native Flink Framework etc. Network Buffers Timer State Keyed State Typical Size
  • 13. © 2019 Ververica13 Keyed State Backends Based on Java Heap Objects Based on RocksDB
  • 14. © 2019 Ververica Heap Keyed State Backend • State lives as Java objects on the heap • Organized as chained hash table, key ↦ state • One hash table per registered state • Supports asynchronous state snapshots • Data is de / serialized only during state snapshot and restore • Highest Performance • Affected by garbage collection overhead / pauses • Currently no incremental checkpoints • High memory overhead of representation • State is limited by available heap memory 14
  • 15. © 2019 Ververica Heap State Table Architecture 15 - Hash buckets (Object[]), 4B-8B per slot - Load factor <= 75% - Incremental rehash Entry Entry Entry
  • 16. © 2019 Ververica Heap State Table Architecture 16 - Hash buckets (Object[]), 4B-8B per slot - Load factor <= 75% - Incremental rehash Entry Entry Entry ▪ 4 References: ▪ Key ▪ Namespace ▪ State ▪ Next ▪ 3 int: ▪ Entry Version ▪ State Version ▪ Hash Code K N S 4 x (4B-8B) +3 x 4B + ~8B-16B (Object overhead) Object sizes and overhead. Some objects might be shared.
  • 17. © 2019 Ververica Heap State Table Snapshot 17 Original Snapshot A C B Entry Entry Entry Copy of hash bucket array is snapshot overhead
  • 18. © 2019 Ververica Heap State Table Snapshot 18 Original Snapshot A C B D No conflicting modification = no overhead
  • 19. © 2019 Ververica Heap State Table Snapshot 19 Original Snapshot A’ C B D A Modifications trigger deep copy of entry - only as much as required. This depends on what was modified and what is immutable (as determined by type serializer). Worst case overhead = size of original at time of snapshot.
  • 20. © 2019 Ververica Heap Backend Tuning Considerations • Choose TypeSerializers with efficient copy-methods • Flag immutability of objects where possible to avoid copy completely • Flatten POJOs / avoid deep objects • Reduces object overheads and following references • GC choice / tuning • Scale out using multiple task managers per node 20
  • 21. © 2019 Ververica RocksDB Keyed State Backend Characteristics • State lives as serialized byte-strings in off-heap memory and on local disk • One column family per registered state (~table) • Key / Value store, organized as a log-structured merge tree (LSM tree) • Key: serialized bytes of <keygroup, key, namespace> • LSM naturally supports MVCC • Data is de / serialized on every read and update • Not affected by garbage collection • Relatively low overhead of representation • LSM naturally supports incremental snapshots • State size is limited by available local disk space • Lower performance (~ order of magnitude compared to Heap state backend) 21
  • 22. © 2019 Ververica RocksDB Architecture 22 Local Disk WAL WAL Compaction Memory Persistent Store Flush In Flink: - disable WAL and sync - persistence via checkpointsActive MemTable ReadOnly MemTable WriteOp Full/Switch SST SST SSTSST Merge
  • 23. © 2019 Ververica RocksDB Architecture 23 Local Disk WAL WAL Compaction Memory Persistent Store Flush In Flink: - disable WAL and sync - persistence via checkpointsActive MemTable ReadOnly MemTable WriteOp Full/Switch SST SST SSTSST Merge Set per column family (~table)
  • 24. © 2019 Ververica RocksDB Architecture 24 ReadOp Local Disk WAL WAL Memory Persistent Store Flush Merge Active MemTable ReadOnly MemTable Full/Switch WriteOp SST SST SSTSST In Flink: - disable WAL and sync - persistence via checkpoints
  • 25. © 2019 Ververica RocksDB Architecture 25 ReadOp Local Disk WAL WAL Memory Persistent Store Flush Merge Active MemTable ReadOnly MemTable Full/Switch WriteOp SST SST SSTSST In Flink: - disable WAL and sync - persistence via checkpointsActive MemTable ReadOnly MemTable WriteOp ReadOp Local Disk WAL WAL Compaction Memory Persistent Store Full/Switch Read Only Block Cache Flush SST SST SSTSST Merge In Flink: - disable WAL and sync - persistence via checkpoints
  • 26. © 2019 Ververica RocksDB Resource Consumption • One RocksDB instance per operator subtask • block_cache_size • Size of the block cache • write_buffer_size • Max size of a MemTable • max_write_buffer_number • The maximum number of MemTable’s allowed in memory before flush to SST file • Indexes and bloom filters • Optional • Table Cache • Caches open file descriptors to SST files • Default: unlimited! 26
  • 27. © 2019 Ververica Performance Tuning Amplification Factors 27 Write Amplification Read Amplification Space Amplification More details: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide Parameter Space
  • 28. © 2019 Ververica Performance Tuning Amplification Factors 28 Write Amplification Read Amplification Space Amplification More details: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide Parameter Space Example: More compaction effort = increased write amplification and reduced read amplification
  • 29. © 2019 Ververica General Performance Considerations • Use efficient TypeSerializer’s and serialization formats • Decompose user code objects • ValueState<List<Integer>> ListState<Integer> • ValueState<Map<Integer, Integer>> MapState<Integer, Integer> • Use the correct configuration for your hardware setup • Consider enabling RocksDB native metrics to profile your applications • File Systems • Working directory on fast storage, ideally local SSD. Could even be memory. • EBS performance can be problematic 29
  • 31. © 2019 Ververica Heap Timers 31 ▪ 2 References: ▪ Key ▪ Namespace ▪ 1 long: ▪ Timestamp ▪ 1 int: ▪ Array Index K N Object sizes and overhead. Some objects might be shared. Binary heap of timers in array Peek: O(1) Poll: O(log(n)) Insert: O(log(n)) Delete: O(n) Contains O(n) Timer
  • 32. © 2019 Ververica Heap Timers 32 ▪ 2 References: ▪ Key ▪ Namespace ▪ 1 long: ▪ Timestamp ▪ 1 int: ▪ Array Index K N Object sizes and overhead. Some objects might be shared. Binary heap of timers in array HashMap<Timer, Timer> : fast deduplication and deletes Key Value Peek: O(1) Poll: O(log(n)) Insert: O(log(n)) Delete: O(log(n)) Contains O(1) MapEntry Timer
  • 33. © 2019 Ververica Heap Timers 33 Binary heap of timers in array HashMap<Timer, Timer> : fast deduplication and deletes MapEntry Key Value Snapshot (net values of a timer are immutable) Timer
  • 34. © 2019 Ververica RocksDB Timers 34 0 20 A X 0 40 D Z 1 10 D Z 1 20 C Y 2 50 B Y 2 60 A X … … Key Group Time stamp Key Name space … Lexicographically ordered byte sequences as key, no value Column Family - only key, no value
  • 35. © 2019 Ververica RocksDB Timers 35 0 20 A X 0 40 D Z 1 10 D Z 1 20 C Y 2 50 B Y 2 60 A X … … Key Group Time stamp Key Name space Column Family - only key, no value Key group queues (caching first k timers) Priority queue of key group queues
  • 36. © 2019 Ververica 3 Task Manager Memory Layout 36 Task Manager JVM Process Off Heap / Native Flink Framework etc. Network Buffers Timer State Keyed State Task Manager JVM Process Java Heap Off Heap / Native Flink Framework etc. Network Buffers Timer State Keyed State Task Manager JVM Process Java Heap Off Heap / Native Flink Framework etc. Network Buffers Keyed State Timer State
  • 37. © 2019 Ververica Full / Incremental Checkpoints
  • 38. © 2019 Ververica Full Checkpoint 38 G H C D Checkpoint 1 Checkpoint 2 Checkpoint 3 I E A B C D A B C D A F C D E @t1 @t2 @t3 A F C D E G H C D I E
  • 39. © 2019 Ververica Full Checkpoint Overview • Creation iterates and writes full database snapshots as a stream to stable storage • Restore reads data as a stream from stable storage and re-inserts into the state backend • Each checkpoint is self contained, and size is proportional to the size of full state • Optional: compression with snappy 39
  • 40. © 2019 Ververica Incremental Checkpoint 40 H C D Checkpoint 1 Checkpoint 2 Checkpoint 3 I E A B C D A B C D A F C D E E F G H I @t1 @t2 @t3 builds upon builds upon 𝚫𝚫 𝚫
  • 41. © 2019 Ververica Incremental Checkpoints with RocksDB 41 Local Disk WAL WAL Compaction Memory Persistent Store Flush Incremental checkpoint: Observe created/deleted SST files since last checkpoint Active MemTable ReadOnly MemTable WriteOp Full/Switch SST SST SSTSST Merge
  • 42. © 2019 Ververica Incremental Checkpoint Overview • Expected trade-off: faster* checkpoints, slower recovery • Creation only copies deltas (new local SST files) to stable storage • Creates write amplification because we also upload compacted SST files so that we can prune checkpoint history • Sum of all increments that we read from stable storage can be larger than the full state size • No rebuild is required because we simply re-open the RocksDB backend from the SST files • SST files are snappy compressed by default 42
  • 44. © 2019 Ververica Anatomy of a Flink Stream Job Upgrade 44 Flink job user code Local State Backend Persistent Savepoint local reads / writes that
 manipulate state
  • 45. © 2019 Ververica Anatomy of a Flink Stream Job Upgrade 45 Flink job user code Local State Backend Persistent Savepoint Application Upgrade
  • 46. © 2019 Ververica Anatomy of a Flink Stream Job Upgrade 46 Flink job user code Local State Backend Persistent Savepoint Continue To Access State
  • 48. © 2019 Ververica Upcoming Features • A new state backend • Unified savepoint binary format • State Processor API 48