SlideShare a Scribd company logo
The State
of the
Table API:
2022
David Anderson
–
@alpinegizmo
–
Flink Forward 22
Sep 2021
legacy planner removed
streaming/batch unification
DataStream <-> Table interop
1.14
May 2022
SQL version upgrades
window TVFs in batch
JSON functions
Table Store
1.15
Aug-Sep 2022
MATCH_RECOGNIZE batch
SQL Gateway
1.16
Sep 2021
legacy planner removed
streaming/batch unification
DataStream <-> Table interop
1.14
May 2022
SQL version upgrades
window TVFs in batch
JSON functions
Table Store
1.15
Aug-Sep 2022
MATCH_RECOGNIZE batch
SQL Gateway
1.16
Sep 2021
legacy planner removed
streaming/batch unification
DataStream <-> Table interop
1.14
May 2022
SQL version upgrades
window TVFs in batch
JSON functions
Table Store
1.15
Aug-Sep 2022
MATCH_RECOGNIZE batch
SQL Gateway
1.16
Intro
About me
Apache Flink
● Flink Committer
● Focus on training, documentation, FLIP-220
● Release manager for Flink 1.15.1
● Prolific author of answers about Flink on Stack Overflow
Career
● Researcher: Carnegie Mellon, Mitsubishi Electric, Sun Labs
● Consultant: Machine Learning and Data Engineering
● Trainer: Data Science Retreat and data Artisans / Ververica
● Community Engineering @ immerok
6
David Anderson
@alpinegizmo
Business data is naturally in streams: either bounded or unbounded
Batch processing is a special case of stream processing
8
start now
past future
unbounded
stream
unbounded stream
bounded
stream
Flink jobs are organized as dataflow graphs
9
Transaction
s
Customers
Join Sink
Flink jobs are stateful
10
Transaction
s
Customers
Join Sink
Flink jobs are executed in parallel
Transaction
s
Partition1
Customers
Partition1
Join Sink
Transaction
s
Partition2
Customers
Partition2
Join Sink
shuffle by
customerI
d
DataStreams & Tables,
Batch & Streaming
Runtime
DataSet API
Table / SQL API
unified batch & streaming
Looking back at Flink’s legacy API stack
DataStream API
Runtime
Internal Operator API
Relational Planner / Optimizer
DataStream API
unified batch & streaming
Table / SQL API
unified batch & streaming
Today the Table API is entirely its own thing
Latest Transaction for each Customer (Table)
SELECT
t_id,
t_customer_id,
t_amount,
t_time
FROM (
SELECT *, ROW_NUMBER()
OVER (PARTITION BY t_customer_id
ORDER BY t_time DESC)
AS rownum
FROM Transactions )
WHERE rownum <= 1;
{
"t_id": 1,
"t_customer_id": 1,
"t_amount": 99.08,
"time": 1657144244000
}
Batch
+-------------------------+---------------+-----------------------+--------------+
| t_time | t_id | t_customer_id | t_amount |
+-------------------------+---------------+-----------------------+--------------+
| 2022-07-24 08:00:00.000 | 2 | 0 | 500 |
| 2022-07-24 09:00:00.000 | 3 | 1 | 11 |
+-------------------------+---------------+-----------------------+--------------+
Streaming
+----+-------------------------+--------------+----------------------+--------------+
| op | t_time | t_id | t_customer_id | t_amount |
+----+-------------------------+--------------+----------------------+--------------+
| +I | 2022-08-03 09:17:25.505 | 0 | 1 | 316 |
| +I | 2022-08-03 09:17:26.871 | 1 | 0 | 660 |
| -U | 2022-08-03 09:17:26.871 | 1 | 0 | 660 |
| +U | 2022-08-03 09:17:27.952 | 2 | 0 | 493 |
| -U | 2022-08-03 09:17:25.505 | 0 | 1 | 316 |
| +U | 2022-08-03 09:17:29.046 | 3 | 1 | 35 |
| … | … | … | … | … |
Batch vs Streaming
Latest Transaction for each Customer (DataStream)
DataStream<Transaction> results =
transactionStream
.keyBy(t -> t.t_customer_id)
.process(new LatestTransaction());
public void processElement(
Transaction incoming,
Context context,
Collector<Transaction> out) {
Transaction latest = latestTransaction.value();
if (latest == null ||
(incoming.t_time.isAfter(latest.t_time))) {
latestTransaction.update(incoming);
out.collect(incoming);
}
}
DataStreams
● inputs and outputs: event streams
○ user implements classes for event objects
○ user supplies ser/de
● business logic: low-level code that
reacts to events and timers by
○ reading and writing state
○ creating timers
○ emitting events
Dynamic Tables
● inputs and outputs: event streams
are a history of changes to Tables
○ events insert, update, or delete Rows
○ user provides Table schemas
○ user specifies formats (e.g. CSV or JSON)
● business logic: SQL queries
○ high-level, declarative description compiled
into a dataflow graph
○ the dataflow reacts to these changes and
updates the result(s) (akin to materialized
view maintenance)
Two different programming models
Interoperability
Customers
{
"c_id": 1,
"c_name": "Ramon Stehr"
}
Customers
{
"c_id": 1,
"c_name": "Ramon Stehr"
}
{
"t_id": 1,
"t_customer_id": 1,
"t_amount": 99.08,
"time": 1657144244000
}
Transactions
Customers
{
"c_id": 1,
"c_name": "Ramon Stehr"
}
{
"t_id": 1,
"t_customer_id": 1,
"t_amount": 99.08,
"time": 1657144244000
}
Transactions
In this example, the
transaction stream
may contain duplicates
Deduplicate
Customers
Join Sink
Transaction
s
Deduplicate
Customers
Join Sink
Transaction
s
INSERT INTO Sink
SELECT t_id, c_name, t_amount
FROM Customers
JOIN (SELECT DISTINCT * FROM Transactions) ON c_id = t_customer_id;
+I[25, Renaldo Walsh, 280.49]
+I[27, Stuart Altenwerth, 818.16]
+I[19, Kizzie Reichert, 60.71]
+I[29, Renaldo Walsh, 335.59]
+I[31, Stuart Altenwerth, 948.26]
+I[23, Ashley Towne, 784.84]
+I[41, Louis White, 578.81]
+I[35, Ashley Towne, 585.44]
+I[43, Renaldo Walsh, 503.11]
+I[39, Kizzie Reichert, 625.32]
+I[13, Kizzie Reichert, 840.47]
...
Results
INSERT INTO Sink
SELECT t_id, c_name, t_amount
FROM Customers
JOIN
(SELECT DISTINCT * FROM Transactions)
ON c_id = t_customer_id;
Starting point: POJOs for Customers and Transactions
public class Customer {
// A Flink POJO must have public fields, or getters and setters
public long c_id;
public String c_name;
// A Flink POJO must have a no-args default constructor
public Customer() {}
. . .
}
Seamless interoperability between DataStreams and Tables
KafkaSource<Customer> customerSource =
KafkaSource.<Customer>builder()
.setBootstrapServers("localhost:9092")
.setTopics(CUSTOMER_TOPIC)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new CustomerDeserializer())
.build();
DataStream<Customer> customerStream =
env.fromSource(
customerSource, WatermarkStrategy.noWatermarks(), "Customers");
tableEnv.createTemporaryView("Customers", customerStream);
Seamless interoperability between DataStreams and Tables
KafkaSource<Customer> customerSource =
KafkaSource.<Customer>builder()
.setBootstrapServers("localhost:9092")
.setTopics(CUSTOMER_TOPIC)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new CustomerDeserializer())
.build();
DataStream<Customer> customerStream =
env.fromSource(
customerSource, WatermarkStrategy.noWatermarks(), "Customers");
tableEnv.createTemporaryView("Customers", customerStream);
Seamless interoperability between DataStreams and Tables
KafkaSource<Customer> customerSource =
KafkaSource.<Customer>builder()
.setBootstrapServers("localhost:9092")
.setTopics(CUSTOMER_TOPIC)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new CustomerDeserializer())
.build();
DataStream<Customer> customerStream =
env.fromSource(
customerSource, WatermarkStrategy.noWatermarks(), "Customers");
tableEnv.createTemporaryView("Customers", customerStream);
Seamless interoperability between DataStreams and Tables
// use Flink SQL to do the heavy lifting
Table resultTable =
tableEnv.sqlQuery(
String.join(
"n",
"SELECT t_id, c_name, CAST(t_amount AS DECIMAL(5, 2))",
"FROM Customers",
"JOIN (SELECT DISTINCT * FROM Transactions)”,
"ON c_id = t_customer_id"));
// switch back from Table API to DataStream
DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
Seamless interoperability between DataStreams and Tables
// use Flink SQL to do the heavy lifting
Table resultTable =
tableEnv.sqlQuery(
String.join(
"n",
"SELECT t_id, c_name, CAST(t_amount AS DECIMAL(5, 2))",
"FROM Customers",
"JOIN (SELECT DISTINCT * FROM Transactions)”,
"ON c_id = t_customer_id"));
// switch back from Table API to DataStream
DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
DataStreams
vs Tables
Version
Upgrades
Intro Interoperability
Table Store
SQL Gateway
Interlude
Flink now has a powerful and versatile SQL engine
● Batch / Streaming unification
● The new type system
● DataStream / Table interoperability
● Scala-free classpath
● Catalogs, connectors, formats, CDC
● PyFlink
● Improved semantics
● Optimizations
● Bug fixes, new features, etc.
Use cases?
● ETL (esp joins)
● Analytics
● Anything, really
○ in combination with UDFs
and/or the DataStream API
SQL Features in Flink 1.16
SELECT FROM WHERE
GROUP BY [HAVING]
Non-windowed
TUMBLE, HOP, SESSION windows
Window Table-Valued Functions
TUMBLE, HOP, CUMULATE windows
OVER window
JOIN
Time-Windowed INNER + OUTER JOIN
Non-windowed INNER + OUTER JOIN
MATCH_RECOGNIZE
Set Operations
User-Defined Functions
Scalar
Aggregation
Table-valued
Statement Sets
Streaming
and
Batch
Streaming
only
ORDER BY time
INNER JOIN with
Temporal table
External lookup table
Batch
only
ORDER BY anything
Full TPC-DS support
Table API: Long-term initiatives
FLIP(s) 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16
Blink planner 32
Python 38, 58, 78, 96, 97,
106, 112, 114, 121,
139
Hive 30, 123, 152
CDC 87, 95, 105
Connectors,
Formats
DataStream/Table
interop
136
Version upgrades 190
Table Store 188, 226, 230, 254
SQL Gateway 91
Version Upgrades
Stateful restarts of Flink jobs
● Flink jobs can be restarted from
checkpoints and savepoints
● This requires that each stateful
operator be able to find and load its
state
● Things may have changed, making this
difficult/impossible
○ types
○ topology
DataStream API
● You have enough low-level control to
be able to avoid or cope with
potential problems
Table/SQL API
● New Flink versions can introduce
changes to the SQL planner that
render old state un-restorable
FLIP-190: Flink Version Upgrades for Table/SQL API Programs
Goals
● The same query can always be restarted correctly after upgrading Flink
● Schema and query evolution are out of scope
Status
● Released as BETA in 1.15
Usage
● Only supports streaming
● Must be a complete pipeline, i.e., INSERT INTO sink SELECT . . .
Example: before upgrade
String streamingQueryWithInsert =
String.join(
"n",
"INSERT INTO sink",
"SELECT t_id, c_name, t_amount",
"FROM Customers",
"JOIN (SELECT DISTINCT * FROM Transactions)",
"ON c_id = t_customer_id");
tableEnv.compilePlanSql(streamingQueryWithInsert).writeToFile(planLocation);
Example: after upgrade
TableResult execution =
tableEnv.executePlan(PlanReference.fromFile(planLocation));
Table Store
Typical use case / scenario
Joins Aggregations
intermediate results
aggregated results
Table
Store
Tables backed by connectors vs built-in table storage
CREATE CATALOG my_catalog WITH (
'type'='table-store',
'warehouse'='file:/tmp/table_store'
);
USE CATALOG my_catalog;
-- create a word count table
CREATE TABLE word_count (
word STRING PRIMARY KEY NOT ENFORCED,
cnt BIGINT
);
-- create a word count table
CREATE TABLE word_count (
word STRING PRIMARY KEY NOT ENFORCED,
cnt BIGINT
) WITH (
'connector' = 'filesystem',
'path' = '/tmp/word_count',
'format' = 'csv'
);
Architecture of this built-in table storage
Advantages of the Table Store
● Easy to use
○ drop in the JAR file and start using it
○ provides “normal” tables
● Flexible
○ streaming pipelines
○ batch jobs
○ ad-hoc queries
● Low-latency
● Integrates with
○ Spark
○ Trino
○ Hive
SQL Gateway
SQL Gateway: Architecture
Client
REST
endpoint
Session
Manager
Executor
Catalog
Flink
Cluster
Wrap-up
The ongoing efforts to add
version upgrade support, built-in
table storage, and a SQL
gateway will expand the Table
API into many new use cases.
Thanks!
David Anderson
@alpinegizmo
danderson@apache.org
These examples and more
can be found in the
Immerok Apache Flink
Cookbook at
https://docs.immerok.cloud

More Related Content

PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Hive + Tez: A Performance Deep Dive
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Evening out the uneven: dealing with skew in Flink
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Tame the small files problem and optimize data layout for streaming ingestion...
Apache Iceberg - A Table Format for Hige Analytic Datasets
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Building a fully managed stream processing platform on Flink at scale for Lin...
Hive + Tez: A Performance Deep Dive

What's hot (20)

PPTX
High Performance, High Reliability Data Loading on ClickHouse
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Performance Optimizations in Apache Impala
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
Log Structured Merge Tree
PDF
Flink powered stream processing platform at Pinterest
PDF
A Deep Dive into Kafka Controller
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Fundamentals of Apache Kafka
PDF
Kafka Streams State Stores Being Persistent
PDF
Kafka Streams: What it is, and how to use it?
PDF
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
PDF
Better than you think: Handling JSON data in ClickHouse
PDF
Hudi architecture, fundamentals and capabilities
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
One sink to rule them all: Introducing the new Async Sink
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
High Performance, High Reliability Data Loading on ClickHouse
Tuning Apache Kafka Connectors for Flink.pptx
Where is my bottleneck? Performance troubleshooting in Flink
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Performance Optimizations in Apache Impala
Batch Processing at Scale with Flink & Iceberg
Log Structured Merge Tree
Flink powered stream processing platform at Pinterest
A Deep Dive into Kafka Controller
Dynamic Partition Pruning in Apache Spark
Fundamentals of Apache Kafka
Kafka Streams State Stores Being Persistent
Kafka Streams: What it is, and how to use it?
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
Better than you think: Handling JSON data in ClickHouse
Hudi architecture, fundamentals and capabilities
Flexible and Real-Time Stream Processing with Apache Flink
One sink to rule them all: Introducing the new Async Sink
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Ad

Similar to The Current State of Table API in 2022 (20)

PDF
Modern query optimisation features in MySQL 8.
PDF
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
PDF
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
PPTX
Top 10 tips for Oracle performance
PPSX
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
PDF
TechShift: There’s light beyond LAMP
PDF
Die Neuheiten in MariaDB 10.2 und MaxScale 2.1
PPTX
Foundations of streaming SQL: stream & table theory
PDF
Macy's: Changing Engines in Mid-Flight
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
PPTX
Streaming Data from Scylla to Kafka
PPTX
Confoo 2021 -- MySQL New Features
PPTX
Why and how to leverage the simplicity and power of SQL on Flink
PPT
Database Development Replication Security Maintenance Report
PDF
Flexviews materialized views for my sql
PPTX
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
PDF
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
PDF
Fast federated SQL with Apache Calcite
PPTX
Cost-based query optimization in Apache Hive
Modern query optimisation features in MySQL 8.
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
Top 10 tips for Oracle performance
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
TechShift: There’s light beyond LAMP
Die Neuheiten in MariaDB 10.2 und MaxScale 2.1
Foundations of streaming SQL: stream & table theory
Macy's: Changing Engines in Mid-Flight
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Streaming Data from Scylla to Kafka
Confoo 2021 -- MySQL New Features
Why and how to leverage the simplicity and power of SQL on Flink
Database Development Replication Security Maintenance Report
Flexviews materialized views for my sql
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
Fast federated SQL with Apache Calcite
Cost-based query optimization in Apache Hive
Ad

More from Flink Forward (18)

PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Processing Semantically-Ordered Streams in Financial Services
PPTX
Welcome to the Flink Community!
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Extending Flink SQL for stream processing use cases
PPTX
The top 3 challenges running multi-tenant Flink at scale
PPTX
Using Queryable State for Fun and Profit
PDF
Changelog Stream Processing with Apache Flink
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Apache Flink in the Cloud-Native Era
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Processing Semantically-Ordered Streams in Financial Services
Welcome to the Flink Community!
Practical learnings from running thousands of Flink jobs
Extending Flink SQL for stream processing use cases
The top 3 challenges running multi-tenant Flink at scale
Using Queryable State for Fun and Profit
Changelog Stream Processing with Apache Flink
Large Scale Real Time Fraudulent Web Behavior Detection
Building Reliable Lakehouses with Apache Flink and Delta Lake
Near real-time statistical modeling and anomaly detection using Flink!
How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
Teaching material agriculture food technology
PPTX
TLE Review Electricity (Electricity).pptx
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative study of natural language inference in Swahili using monolingua...
Getting Started with Data Integration: FME Form 101
OMC Textile Division Presentation 2021.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
MIND Revenue Release Quarter 2 2025 Press Release
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Univ-Connecticut-ChatGPT-Presentaion.pdf
A Presentation on Artificial Intelligence
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Programs and apps: productivity, graphics, security and other tools
Teaching material agriculture food technology
TLE Review Electricity (Electricity).pptx

The Current State of Table API in 2022

  • 1. The State of the Table API: 2022 David Anderson – @alpinegizmo – Flink Forward 22
  • 2. Sep 2021 legacy planner removed streaming/batch unification DataStream <-> Table interop 1.14 May 2022 SQL version upgrades window TVFs in batch JSON functions Table Store 1.15 Aug-Sep 2022 MATCH_RECOGNIZE batch SQL Gateway 1.16
  • 3. Sep 2021 legacy planner removed streaming/batch unification DataStream <-> Table interop 1.14 May 2022 SQL version upgrades window TVFs in batch JSON functions Table Store 1.15 Aug-Sep 2022 MATCH_RECOGNIZE batch SQL Gateway 1.16
  • 4. Sep 2021 legacy planner removed streaming/batch unification DataStream <-> Table interop 1.14 May 2022 SQL version upgrades window TVFs in batch JSON functions Table Store 1.15 Aug-Sep 2022 MATCH_RECOGNIZE batch SQL Gateway 1.16
  • 6. About me Apache Flink ● Flink Committer ● Focus on training, documentation, FLIP-220 ● Release manager for Flink 1.15.1 ● Prolific author of answers about Flink on Stack Overflow Career ● Researcher: Carnegie Mellon, Mitsubishi Electric, Sun Labs ● Consultant: Machine Learning and Data Engineering ● Trainer: Data Science Retreat and data Artisans / Ververica ● Community Engineering @ immerok 6 David Anderson @alpinegizmo
  • 7. Business data is naturally in streams: either bounded or unbounded Batch processing is a special case of stream processing 8 start now past future unbounded stream unbounded stream bounded stream
  • 8. Flink jobs are organized as dataflow graphs 9 Transaction s Customers Join Sink
  • 9. Flink jobs are stateful 10 Transaction s Customers Join Sink
  • 10. Flink jobs are executed in parallel Transaction s Partition1 Customers Partition1 Join Sink Transaction s Partition2 Customers Partition2 Join Sink shuffle by customerI d
  • 12. Runtime DataSet API Table / SQL API unified batch & streaming Looking back at Flink’s legacy API stack DataStream API
  • 13. Runtime Internal Operator API Relational Planner / Optimizer DataStream API unified batch & streaming Table / SQL API unified batch & streaming Today the Table API is entirely its own thing
  • 14. Latest Transaction for each Customer (Table) SELECT t_id, t_customer_id, t_amount, t_time FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY t_customer_id ORDER BY t_time DESC) AS rownum FROM Transactions ) WHERE rownum <= 1; { "t_id": 1, "t_customer_id": 1, "t_amount": 99.08, "time": 1657144244000 }
  • 15. Batch +-------------------------+---------------+-----------------------+--------------+ | t_time | t_id | t_customer_id | t_amount | +-------------------------+---------------+-----------------------+--------------+ | 2022-07-24 08:00:00.000 | 2 | 0 | 500 | | 2022-07-24 09:00:00.000 | 3 | 1 | 11 | +-------------------------+---------------+-----------------------+--------------+ Streaming +----+-------------------------+--------------+----------------------+--------------+ | op | t_time | t_id | t_customer_id | t_amount | +----+-------------------------+--------------+----------------------+--------------+ | +I | 2022-08-03 09:17:25.505 | 0 | 1 | 316 | | +I | 2022-08-03 09:17:26.871 | 1 | 0 | 660 | | -U | 2022-08-03 09:17:26.871 | 1 | 0 | 660 | | +U | 2022-08-03 09:17:27.952 | 2 | 0 | 493 | | -U | 2022-08-03 09:17:25.505 | 0 | 1 | 316 | | +U | 2022-08-03 09:17:29.046 | 3 | 1 | 35 | | … | … | … | … | … | Batch vs Streaming
  • 16. Latest Transaction for each Customer (DataStream) DataStream<Transaction> results = transactionStream .keyBy(t -> t.t_customer_id) .process(new LatestTransaction()); public void processElement( Transaction incoming, Context context, Collector<Transaction> out) { Transaction latest = latestTransaction.value(); if (latest == null || (incoming.t_time.isAfter(latest.t_time))) { latestTransaction.update(incoming); out.collect(incoming); } }
  • 17. DataStreams ● inputs and outputs: event streams ○ user implements classes for event objects ○ user supplies ser/de ● business logic: low-level code that reacts to events and timers by ○ reading and writing state ○ creating timers ○ emitting events Dynamic Tables ● inputs and outputs: event streams are a history of changes to Tables ○ events insert, update, or delete Rows ○ user provides Table schemas ○ user specifies formats (e.g. CSV or JSON) ● business logic: SQL queries ○ high-level, declarative description compiled into a dataflow graph ○ the dataflow reacts to these changes and updates the result(s) (akin to materialized view maintenance) Two different programming models
  • 20. Customers { "c_id": 1, "c_name": "Ramon Stehr" } { "t_id": 1, "t_customer_id": 1, "t_amount": 99.08, "time": 1657144244000 } Transactions
  • 21. Customers { "c_id": 1, "c_name": "Ramon Stehr" } { "t_id": 1, "t_customer_id": 1, "t_amount": 99.08, "time": 1657144244000 } Transactions In this example, the transaction stream may contain duplicates
  • 23. Deduplicate Customers Join Sink Transaction s INSERT INTO Sink SELECT t_id, c_name, t_amount FROM Customers JOIN (SELECT DISTINCT * FROM Transactions) ON c_id = t_customer_id;
  • 24. +I[25, Renaldo Walsh, 280.49] +I[27, Stuart Altenwerth, 818.16] +I[19, Kizzie Reichert, 60.71] +I[29, Renaldo Walsh, 335.59] +I[31, Stuart Altenwerth, 948.26] +I[23, Ashley Towne, 784.84] +I[41, Louis White, 578.81] +I[35, Ashley Towne, 585.44] +I[43, Renaldo Walsh, 503.11] +I[39, Kizzie Reichert, 625.32] +I[13, Kizzie Reichert, 840.47] ... Results INSERT INTO Sink SELECT t_id, c_name, t_amount FROM Customers JOIN (SELECT DISTINCT * FROM Transactions) ON c_id = t_customer_id;
  • 25. Starting point: POJOs for Customers and Transactions public class Customer { // A Flink POJO must have public fields, or getters and setters public long c_id; public String c_name; // A Flink POJO must have a no-args default constructor public Customer() {} . . . }
  • 26. Seamless interoperability between DataStreams and Tables KafkaSource<Customer> customerSource = KafkaSource.<Customer>builder() .setBootstrapServers("localhost:9092") .setTopics(CUSTOMER_TOPIC) .setStartingOffsets(OffsetsInitializer.earliest()) .setValueOnlyDeserializer(new CustomerDeserializer()) .build(); DataStream<Customer> customerStream = env.fromSource( customerSource, WatermarkStrategy.noWatermarks(), "Customers"); tableEnv.createTemporaryView("Customers", customerStream);
  • 27. Seamless interoperability between DataStreams and Tables KafkaSource<Customer> customerSource = KafkaSource.<Customer>builder() .setBootstrapServers("localhost:9092") .setTopics(CUSTOMER_TOPIC) .setStartingOffsets(OffsetsInitializer.earliest()) .setValueOnlyDeserializer(new CustomerDeserializer()) .build(); DataStream<Customer> customerStream = env.fromSource( customerSource, WatermarkStrategy.noWatermarks(), "Customers"); tableEnv.createTemporaryView("Customers", customerStream);
  • 28. Seamless interoperability between DataStreams and Tables KafkaSource<Customer> customerSource = KafkaSource.<Customer>builder() .setBootstrapServers("localhost:9092") .setTopics(CUSTOMER_TOPIC) .setStartingOffsets(OffsetsInitializer.earliest()) .setValueOnlyDeserializer(new CustomerDeserializer()) .build(); DataStream<Customer> customerStream = env.fromSource( customerSource, WatermarkStrategy.noWatermarks(), "Customers"); tableEnv.createTemporaryView("Customers", customerStream);
  • 29. Seamless interoperability between DataStreams and Tables // use Flink SQL to do the heavy lifting Table resultTable = tableEnv.sqlQuery( String.join( "n", "SELECT t_id, c_name, CAST(t_amount AS DECIMAL(5, 2))", "FROM Customers", "JOIN (SELECT DISTINCT * FROM Transactions)”, "ON c_id = t_customer_id")); // switch back from Table API to DataStream DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
  • 30. Seamless interoperability between DataStreams and Tables // use Flink SQL to do the heavy lifting Table resultTable = tableEnv.sqlQuery( String.join( "n", "SELECT t_id, c_name, CAST(t_amount AS DECIMAL(5, 2))", "FROM Customers", "JOIN (SELECT DISTINCT * FROM Transactions)”, "ON c_id = t_customer_id")); // switch back from Table API to DataStream DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
  • 32. Flink now has a powerful and versatile SQL engine ● Batch / Streaming unification ● The new type system ● DataStream / Table interoperability ● Scala-free classpath ● Catalogs, connectors, formats, CDC ● PyFlink ● Improved semantics ● Optimizations ● Bug fixes, new features, etc. Use cases? ● ETL (esp joins) ● Analytics ● Anything, really ○ in combination with UDFs and/or the DataStream API
  • 33. SQL Features in Flink 1.16 SELECT FROM WHERE GROUP BY [HAVING] Non-windowed TUMBLE, HOP, SESSION windows Window Table-Valued Functions TUMBLE, HOP, CUMULATE windows OVER window JOIN Time-Windowed INNER + OUTER JOIN Non-windowed INNER + OUTER JOIN MATCH_RECOGNIZE Set Operations User-Defined Functions Scalar Aggregation Table-valued Statement Sets Streaming and Batch Streaming only ORDER BY time INNER JOIN with Temporal table External lookup table Batch only ORDER BY anything Full TPC-DS support
  • 34. Table API: Long-term initiatives FLIP(s) 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16 Blink planner 32 Python 38, 58, 78, 96, 97, 106, 112, 114, 121, 139 Hive 30, 123, 152 CDC 87, 95, 105 Connectors, Formats DataStream/Table interop 136 Version upgrades 190 Table Store 188, 226, 230, 254 SQL Gateway 91
  • 36. Stateful restarts of Flink jobs ● Flink jobs can be restarted from checkpoints and savepoints ● This requires that each stateful operator be able to find and load its state ● Things may have changed, making this difficult/impossible ○ types ○ topology DataStream API ● You have enough low-level control to be able to avoid or cope with potential problems Table/SQL API ● New Flink versions can introduce changes to the SQL planner that render old state un-restorable
  • 37. FLIP-190: Flink Version Upgrades for Table/SQL API Programs Goals ● The same query can always be restarted correctly after upgrading Flink ● Schema and query evolution are out of scope Status ● Released as BETA in 1.15 Usage ● Only supports streaming ● Must be a complete pipeline, i.e., INSERT INTO sink SELECT . . .
  • 38. Example: before upgrade String streamingQueryWithInsert = String.join( "n", "INSERT INTO sink", "SELECT t_id, c_name, t_amount", "FROM Customers", "JOIN (SELECT DISTINCT * FROM Transactions)", "ON c_id = t_customer_id"); tableEnv.compilePlanSql(streamingQueryWithInsert).writeToFile(planLocation);
  • 39. Example: after upgrade TableResult execution = tableEnv.executePlan(PlanReference.fromFile(planLocation));
  • 41. Typical use case / scenario Joins Aggregations intermediate results aggregated results Table Store
  • 42. Tables backed by connectors vs built-in table storage CREATE CATALOG my_catalog WITH ( 'type'='table-store', 'warehouse'='file:/tmp/table_store' ); USE CATALOG my_catalog; -- create a word count table CREATE TABLE word_count ( word STRING PRIMARY KEY NOT ENFORCED, cnt BIGINT ); -- create a word count table CREATE TABLE word_count ( word STRING PRIMARY KEY NOT ENFORCED, cnt BIGINT ) WITH ( 'connector' = 'filesystem', 'path' = '/tmp/word_count', 'format' = 'csv' );
  • 43. Architecture of this built-in table storage
  • 44. Advantages of the Table Store ● Easy to use ○ drop in the JAR file and start using it ○ provides “normal” tables ● Flexible ○ streaming pipelines ○ batch jobs ○ ad-hoc queries ● Low-latency ● Integrates with ○ Spark ○ Trino ○ Hive
  • 47. Wrap-up The ongoing efforts to add version upgrade support, built-in table storage, and a SQL gateway will expand the Table API into many new use cases.
  • 48. Thanks! David Anderson @alpinegizmo [email protected] These examples and more can be found in the Immerok Apache Flink Cookbook at https://docs.immerok.cloud