The Current State of Table API in 2022

The State
of the
Table API:
2022
David Anderson
–
@alpinegizmo
–
Flink Forward 22

Sep 2021
legacy planner removed
streaming/batch unification
DataStream <-> Table interop
1.14
May 2022
SQL version upgrades
window TVFs in batch
JSON functions
Table Store
1.15
Aug-Sep 2022
MATCH_RECOGNIZE batch
SQL Gateway
1.16

About me
Apache Flink
● Flink Committer
● Focus on training, documentation, FLIP-220
● Release manager for Flink 1.15.1
● Prolific author of answers about Flink on Stack Overflow
Career
● Researcher: Carnegie Mellon, Mitsubishi Electric, Sun Labs
● Consultant: Machine Learning and Data Engineering
● Trainer: Data Science Retreat and data Artisans / Ververica
● Community Engineering @ immerok
6
David Anderson
@alpinegizmo

Business data is naturally in streams: either bounded or unbounded
Batch processing is a special case of stream processing
8
start now
past future
unbounded
stream
unbounded stream
bounded
stream

Flink jobs are organized as dataflow graphs
9
Transaction
s
Customers
Join Sink

Flink jobs are stateful
10
Transaction
s
Customers
Join Sink

Flink jobs are executed in parallel
Transaction
s
Partition1
Customers
Partition1
Join Sink
Transaction
s
Partition2
Customers
Partition2
Join Sink
shuffle by
customerI
d

DataStreams & Tables,
Batch & Streaming

Runtime
DataSet API
Table / SQL API
unified batch & streaming
Looking back at Flink’s legacy API stack
DataStream API

Runtime
Internal Operator API
Relational Planner / Optimizer
DataStream API
Table / SQL API
Today the Table API is entirely its own thing

Latest Transaction for each Customer (Table)
SELECT
t_id,
t_customer_id,
t_amount,
t_time
FROM (
SELECT *, ROW_NUMBER()
OVER (PARTITION BY t_customer_id
ORDER BY t_time DESC)
AS rownum
FROM Transactions )
WHERE rownum <= 1;
{
"t_id": 1,
"t_customer_id": 1,
"t_amount": 99.08,
"time": 1657144244000
}

Batch
+-------------------------+---------------+-----------------------+--------------+
| t_time | t_id | t_customer_id | t_amount |
+-------------------------+---------------+-----------------------+--------------+
| 2022-07-24 08:00:00.000 | 2 | 0 | 500 |
| 2022-07-24 09:00:00.000 | 3 | 1 | 11 |
+-------------------------+---------------+-----------------------+--------------+
Streaming
+----+-------------------------+--------------+----------------------+--------------+
| op | t_time | t_id | t_customer_id | t_amount |
+----+-------------------------+--------------+----------------------+--------------+
| +I | 2022-08-03 09:17:25.505 | 0 | 1 | 316 |
| +I | 2022-08-03 09:17:26.871 | 1 | 0 | 660 |
| -U | 2022-08-03 09:17:26.871 | 1 | 0 | 660 |
| +U | 2022-08-03 09:17:27.952 | 2 | 0 | 493 |
| -U | 2022-08-03 09:17:25.505 | 0 | 1 | 316 |
| +U | 2022-08-03 09:17:29.046 | 3 | 1 | 35 |
| … | … | … | … | … |
Batch vs Streaming

Latest Transaction for each Customer (DataStream)
DataStream<Transaction> results =
transactionStream
.keyBy(t -> t.t_customer_id)
.process(new LatestTransaction());
public void processElement(
Transaction incoming,
Context context,
Collector<Transaction> out) {
Transaction latest = latestTransaction.value();
if (latest == null ||
(incoming.t_time.isAfter(latest.t_time))) {
latestTransaction.update(incoming);
out.collect(incoming);
}
}

DataStreams
● inputs and outputs: event streams
○ user implements classes for event objects
○ user supplies ser/de
● business logic: low-level code that
reacts to events and timers by
○ reading and writing state
○ creating timers
○ emitting events
Dynamic Tables
● inputs and outputs: event streams
are a history of changes to Tables
○ events insert, update, or delete Rows
○ user provides Table schemas
○ user specifies formats (e.g. CSV or JSON)
● business logic: SQL queries
○ high-level, declarative description compiled
into a dataflow graph
○ the dataflow reacts to these changes and
updates the result(s) (akin to materialized
view maintenance)
Two different programming models

Customers
{
"c_id": 1,
"c_name": "Ramon Stehr"
}

Customers
{
"c_id": 1,
}
{
"t_id": 1,
"t_customer_id": 1,
"t_amount": 99.08,
"time": 1657144244000
}
Transactions

Customers
{
"c_id": 1,
}
{
"t_id": 1,
"t_customer_id": 1,
"t_amount": 99.08,
"time": 1657144244000
}
Transactions
In this example, the
transaction stream
may contain duplicates

Deduplicate
Customers
Join Sink
Transaction
s

Deduplicate
Customers
Join Sink
Transaction
s
INSERT INTO Sink
SELECT t_id, c_name, t_amount
FROM Customers
JOIN (SELECT DISTINCT * FROM Transactions) ON c_id = t_customer_id;

+I[25, Renaldo Walsh, 280.49]
+I[27, Stuart Altenwerth, 818.16]
+I[19, Kizzie Reichert, 60.71]
+I[31, Stuart Altenwerth, 948.26]
+I[23, Ashley Towne, 784.84]
+I[41, Louis White, 578.81]
+I[35, Ashley Towne, 585.44]
...
Results
INSERT INTO Sink
SELECT t_id, c_name, t_amount
FROM Customers
JOIN
(SELECT DISTINCT * FROM Transactions)
ON c_id = t_customer_id;

Starting point: POJOs for Customers and Transactions
public class Customer {
// A Flink POJO must have public fields, or getters and setters
public long c_id;
public String c_name;
// A Flink POJO must have a no-args default constructor
public Customer() {}
. . .
}

Seamless interoperability between DataStreams and Tables
KafkaSource<Customer> customerSource =
KafkaSource.<Customer>builder()
.setBootstrapServers("localhost:9092")
.setTopics(CUSTOMER_TOPIC)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new CustomerDeserializer())
.build();
DataStream<Customer> customerStream =
env.fromSource(
customerSource, WatermarkStrategy.noWatermarks(), "Customers");
tableEnv.createTemporaryView("Customers", customerStream);

Seamless interoperability between DataStreams and Tables
// use Flink SQL to do the heavy lifting
Table resultTable =
tableEnv.sqlQuery(
String.join(
"n",
"SELECT t_id, c_name, CAST(t_amount AS DECIMAL(5, 2))",
"FROM Customers",
"JOIN (SELECT DISTINCT * FROM Transactions)”,
"ON c_id = t_customer_id"));
// switch back from Table API to DataStream
DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);

DataStreams
vs Tables
Version
Upgrades
Intro Interoperability
Table Store
SQL Gateway
Interlude

Flink now has a powerful and versatile SQL engine
● Batch / Streaming unification
● The new type system
● DataStream / Table interoperability
● Scala-free classpath
● Catalogs, connectors, formats, CDC
● PyFlink
● Improved semantics
● Optimizations
● Bug fixes, new features, etc.
Use cases?
● ETL (esp joins)
● Analytics
● Anything, really
○ in combination with UDFs
and/or the DataStream API

SQL Features in Flink 1.16
SELECT FROM WHERE
GROUP BY [HAVING]
Non-windowed
TUMBLE, HOP, SESSION windows
Window Table-Valued Functions
TUMBLE, HOP, CUMULATE windows
OVER window
JOIN
Time-Windowed INNER + OUTER JOIN
Non-windowed INNER + OUTER JOIN
MATCH_RECOGNIZE
Set Operations
User-Defined Functions
Scalar
Aggregation
Table-valued
Statement Sets
Streaming
and
Batch
Streaming
only
ORDER BY time
INNER JOIN with
Temporal table
External lookup table
Batch
only
ORDER BY anything
Full TPC-DS support

Table API: Long-term initiatives
FLIP(s) 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16
Blink planner 32
Python 38, 58, 78, 96, 97,
106, 112, 114, 121,
139
Hive 30, 123, 152
CDC 87, 95, 105
Connectors,
Formats
DataStream/Table
interop
136
Version upgrades 190
Table Store 188, 226, 230, 254
SQL Gateway 91

Stateful restarts of Flink jobs
● Flink jobs can be restarted from
checkpoints and savepoints
● This requires that each stateful
operator be able to find and load its
state
● Things may have changed, making this
difficult/impossible
○ types
○ topology
DataStream API
● You have enough low-level control to
be able to avoid or cope with
potential problems
Table/SQL API
● New Flink versions can introduce
changes to the SQL planner that
render old state un-restorable

FLIP-190: Flink Version Upgrades for Table/SQL API Programs
Goals
● The same query can always be restarted correctly after upgrading Flink
● Schema and query evolution are out of scope
Status
● Released as BETA in 1.15
Usage
● Only supports streaming
● Must be a complete pipeline, i.e., INSERT INTO sink SELECT . . .

Example: before upgrade
String streamingQueryWithInsert =
String.join(
"n",
"INSERT INTO sink",
"SELECT t_id, c_name, t_amount",
"FROM Customers",
"JOIN (SELECT DISTINCT * FROM Transactions)",
"ON c_id = t_customer_id");
tableEnv.compilePlanSql(streamingQueryWithInsert).writeToFile(planLocation);

Example: after upgrade
TableResult execution =
tableEnv.executePlan(PlanReference.fromFile(planLocation));

Typical use case / scenario
Joins Aggregations
intermediate results
aggregated results
Table
Store

Tables backed by connectors vs built-in table storage
CREATE CATALOG my_catalog WITH (
'type'='table-store',
'warehouse'='file:/tmp/table_store'
);
USE CATALOG my_catalog;
-- create a word count table
CREATE TABLE word_count (
word STRING PRIMARY KEY NOT ENFORCED,
cnt BIGINT
);
-- create a word count table
CREATE TABLE word_count (
word STRING PRIMARY KEY NOT ENFORCED,
cnt BIGINT
) WITH (
'connector' = 'filesystem',
'path' = '/tmp/word_count',
'format' = 'csv'
);

Architecture of this built-in table storage

Advantages of the Table Store
● Easy to use
○ drop in the JAR file and start using it
○ provides “normal” tables
● Flexible
○ streaming pipelines
○ batch jobs
○ ad-hoc queries
● Low-latency
● Integrates with
○ Spark
○ Trino
○ Hive

SQL Gateway: Architecture
Client
REST
endpoint
Session
Manager
Executor
Catalog
Flink
Cluster

Wrap-up
The ongoing efforts to add
version upgrade support, built-in
table storage, and a SQL
gateway will expand the Table
API into many new use cases.

Thanks!
David Anderson
@alpinegizmo
danderson@apache.org
These examples and more
can be found in the
Immerok Apache Flink
Cookbook at
https://docs.immerok.cloud

The Current State of Table API in 2022

More Related Content

What's hot (20)

Similar to The Current State of Table API in 2022 (20)

More from Flink Forward (18)

Recently uploaded (20)

The Current State of Table API in 2022