SlideShare a Scribd company logo
1
Fabian Hueske
@fhueske
Flink Forward Berlin
September, 13th 2017
Stream Analytics with SQL
on Apache Flink®
2
Original creators of Apache
Flink®
Providers of
dA Platform 2, including
open source Apache Flink +
dA Application Manager
The DataStream API
 Flink’s DataStream API is very expressive
• Application logic implemented as user-defined functions
• Windows, triggers, evictors, state, timers, async calls, …
 Many applications follow similar patterns
• Do not require the expressiveness of the DataStream API
• Can be specified more concisely and easily with a DSL
Q: What’s the most popular DSL for data processing?
A: SQL!
3
Apache Flink’s relational APIs
 Standard SQL & LINQ-style Table API
 Unified APIs for batch & streaming data
A query specifies exactly the same result
regardless whether its input is
static batch data or streaming data.
 Common translation layers
• Optimization based on Apache Calcite
• Type system & code-generation
• Table sources & sinks
4
Show me some code!
tableEnvironment
.scan("clicks")
.filter('url.like("https://www.xyz.com%")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
WHERE url LIKE 'https://www.xyz.com%'
GROUP BY user
5
“clicks” can be a
- file
- database table,
- stream, …
What if “clicks” is a file?
6
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
user cnt
Mary 2
Bob 1
Liz 1
Q: What if we get more click data?
A: We run the query again.
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
What if “clicks” is a stream?
7
 We want the same
results as for batch
input!
 Can we query a stream
with SQL as well?
SQL was not designed for
streams
 Relations are
bounded (multi-)sets.
 DBMS can access
all data.
 SQL queries return a
result and complete.
8
Streams are infinite
sequences.
Streaming data arrives
over time.
Streaming queries
continuously emit results
and never complete.
↔
↔
↔
DBMSs run queries on streams
 Materialized views (MV) are similar to regular views,
but persisted to disk or memory
• Used to speed-up analytical queries
• MVs need to be updated when the base tables change
 MV maintenance is very similar to SQL on streams
• Base table updates are a stream of DML statements
• MV definition query is evaluated on that stream
• MV is query result and continuously updated
9
Continuous Queries in Flink
 Core concept is a “Dynamic Table”
• Dynamic tables are changing over time
 Queries on dynamic tables
• produce new dynamic tables (which are updated based on input)
• do not terminate
 Stream ↔ Dynamic table conversions
10
Stream → Dynamic Table
 Append mode
• Stream records are appended to table
• Table grows as more data arrives
11
user cTime url
Mary 12:00:00 ./home
Bob 12:00:00 ./cart
Mary 12:00:05 ./prod?id=1
Liz 12:01:00 ./home
Bob 12:01:30 ./prod?id=3
Mary 12:01:45 ./prod?id=7
… …
Mary, 12:00:00, ./home
Bob, 12:00:00, ./cart
Mary, 12:00:05, ./prod?id=1
Liz, 12:01:00, ./home
Bob, 12:01:30, ./prod?id=3
Mary, 12:01:45, ./prod?id=7
Stream → Dynamic Table
 Upsert mode
• Stream records have (composite) key attributes
• Records are inserted or update existing records with same key
12
user lastLogin
Mary 2017-07-01
Bob 2017-06-01
Liz 2017-05-01
…
Mary, 2017-03-01
Bob, 2017-03-15
Mary, 2017-04-01
Liz, 2017-05-01
Bob, 2017-06-01
Mary, 2017-07-01
Querying a Dynamic Table
clicks
user cnt
u1 1
result
u2 1
u3 1
u1 2
u3 2
u1 3SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Rows of result table are updated.
13
Mary 12:01:45 ./prod?id=7
Liz 12:01:30 ./prod?id=3
Liz 12:01:00 ./home
Mary 12:00:05 ./prod?id=1
Bob 12:00:00 ./cart
Mary 12:00:00 ./home
user cTime url
What about windows?
tableEnvironment
.scan("clicks")
.window(Tumble over 1.hour on 'cTime as 'w)
.groupBy('w, 'user)
.select('user, 'w.end AS endT, 'url.count as 'cnt)
SELECT user,
TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS),
user
14
clicks
Computing window aggregates
user endT cnt
u1 13:00:00 3
u2 13:00:00 1
result
u2 14:00:00 1
u3 14:00:00 2
u1 15:00:00 1
u2 15:00:00 2
u3 15:00:00 1
Mary 12:00:00 ./home
Bob 12:00:00 ./cart
Mary 12:02:00 ./prod?id=2
Mary 12:55:00 ./home
Mary 14:00:00 ./prod?id=1
Liz 14:02:00 ./prod?id=8
Bob 14:30:00 ./prod?id=7
Bob 14:40:00 ./home
Bob 13:01:00 ./prod?id=4
Liz 13:30:00 ./cart
Liz 13:59:00 ./home
SELECT
user,
TUMBLE_END(
cTime,
INTERVAL '1' HOURS)
AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY
user,
TUMBLE(
cTime,
INTERVAL '1' HOURS)
Rows are appended to result table. 15
user cTime url
Why are results always appended?
 cTime attribute is event-time attribute
• Guarded by watermarks
• Internally represented as special type
• User-facing as TIMESTAMP
 Special plans for queries that operate on event-time attributes
16
SELECT user,
TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS),
user
Dynamic Table → Stream
 Converting a dynamic table into a stream
• Dynamic tables might update or delete existing rows
• Updates must be encoded in outgoing stream
 Conversion of tables to streams inspired by DBMS logs
• DBMS use logs to restore databases (and tables)
• REDO logs store new records to redo changes
• UNDO logs store old records to undo changes
17
Dynamic Table → Stream: REDO/UNDO
+ Bob,1+ Mary,2+ Liz,1+ Bob,2 + Mary,1- Mary,1- Bob,1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
+ INSERT / - DELETE
18
user url
clicks
Mary ./home
Bob ./cart
Mary ./prod?id=1
Liz ./home
Bob ./prod?id=3
Dynamic Table → Stream: REDO
* Bob,1* Mary,2* Liz,1* Liz,2* Mary,3 * Mary,1
* UPSERT by KEY / - DELETE by KEY
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
19
user url
clicks
Mary ./home
Bob ./cart
Mary ./prod?id=1
Liz ./home
Liz ./prod?id=3
Mary ./prod?id=7
Can we run any query on a dynamic table?
 No, there are space and computation constraints 
 State size may not grow infinitely as more data arrives
SELECT sessionId, COUNT(url) FROM clicks GROUP BY sessionId;
 A change of an input table may only trigger a partial
re-computation of the result table
SELECT user, RANK() OVER (ORDER BY lastLogin) FROM users;
20
Bounding the size of query state
 Adapt the semantics of the query
• Aggregate data of last 24 hours. Discard older data.
 Trade the accuracy of the result for size of state
• Remove state for keys that became inactive.
21
SELECT sessionId, COUNT(url) AS cnt
FROM clicks
WHERE last(cTime, INTERVAL '1' DAY)
GROUP BY sessionId
Current state of SQL & Table API
 Flink’s relational APIs are rapidly evolving
• Lots of interest by community and many contributors
• Used in production at large scale by Alibaba and others
 Features released in Flink 1.3
• GroupBy & Over windowed aggregates
• Non-windowed aggregates (with update changes)
• User-defined aggregation functions
 Features coming with Flink 1.4
• Windowed Joins
• Reworked connectors APIs
22
What can be built with this?
 Continuous ETL
• Continuously ingest data
• Process with transformations & window aggregates
• Write to files (Parquet, ORC), Kafka, PostgreSQL, HBase, …
23
What can be built with this?
24
 Dashboards, reporting & event-driven architectures
• Flink updates query results with low latency
• Result can be written to KV store, DBMS, compacted Kafka topic
• Maintain result table as queryable state
Wrap-up!
 Table API & SQL support many streaming use cases
• High-level / declarative specification
• Automatic optimization and translation
• Efficient execution
• Scalar, table, aggregation UDFs for flexibility
 Updating results enable many exciting applications
 Check it out!
25
Thank you!
@fhueske
@ApacheFlink
@dataArtisans
Available on O’Reilly Early Release!
We are hiring!
data-artisans.com/careers
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing with Apache Flink's Relational APIs
Tables are materialized streams
 A table is the materialization of a stream of modifications
• SQL DML statements: INSERT, UPDATE, and DELETE
• DBMSs process statements by modifying tables
29
user name lastLogin
u2 Peter 2017-05-01
u1 Mary 2017-03-01u1 Mary 2017-06-01
INSERT (u1, Mary, "2017-03-01")
INSERT (u2, Peter, "2017-05-01")
DELETE WHERE (user = u2)
UPDATE (lastLogin = "2017-06-01")
WHERE (user = u1)

More Related Content

What's hot (20)

Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Flink Streaming @BudapestData
Flink Streaming @BudapestData
Gyula Fóra
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Stateful Distributed Stream Processing
Stateful Distributed Stream Processing
Gyula Fóra
 
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward
 
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Flink Streaming @BudapestData
Flink Streaming @BudapestData
Gyula Fóra
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Stateful Distributed Stream Processing
Stateful Distributed Stream Processing
Gyula Fóra
 
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward
 
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 

Similar to Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing with Apache Flink's Relational APIs (20)

Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
Webinar: Flink SQL in Action - Fabian Hueske
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward
 
Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
DataWorks Summit
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
StreamNative
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Flink SQL in Action
Flink SQL in Action
Fabian Hueske
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!
HostedbyConfluent
 
Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
Webinar: Flink SQL in Action - Fabian Hueske
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward
 
Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
DataWorks Summit
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
StreamNative
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!
HostedbyConfluent
 
Ad

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Ad

Recently uploaded (20)

LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
vemuripraveena2622
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
Module 1Integrity_and_Ethics_PPT-2025.pptx
Module 1Integrity_and_Ethics_PPT-2025.pptx
Karikalcholan Mayavan
 
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
Eddie Lee
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?
Amnic
 
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
vemulavenu484
 
Power BI API Connectors - Best Practices for Scalable Data Connections
Power BI API Connectors - Best Practices for Scalable Data Connections
Vidicorp Ltd
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
vemuripraveena2622
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
Module 1Integrity_and_Ethics_PPT-2025.pptx
Module 1Integrity_and_Ethics_PPT-2025.pptx
Karikalcholan Mayavan
 
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
Eddie Lee
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?
Amnic
 
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
vemulavenu484
 
Power BI API Connectors - Best Practices for Scalable Data Connections
Power BI API Connectors - Best Practices for Scalable Data Connections
Vidicorp Ltd
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing with Apache Flink's Relational APIs

  • 1. 1 Fabian Hueske @fhueske Flink Forward Berlin September, 13th 2017 Stream Analytics with SQL on Apache Flink®
  • 2. 2 Original creators of Apache Flink® Providers of dA Platform 2, including open source Apache Flink + dA Application Manager
  • 3. The DataStream API  Flink’s DataStream API is very expressive • Application logic implemented as user-defined functions • Windows, triggers, evictors, state, timers, async calls, …  Many applications follow similar patterns • Do not require the expressiveness of the DataStream API • Can be specified more concisely and easily with a DSL Q: What’s the most popular DSL for data processing? A: SQL! 3
  • 4. Apache Flink’s relational APIs  Standard SQL & LINQ-style Table API  Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data.  Common translation layers • Optimization based on Apache Calcite • Type system & code-generation • Table sources & sinks 4
  • 5. Show me some code! tableEnvironment .scan("clicks") .filter('url.like("https://www.xyz.com%") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks WHERE url LIKE 'https://www.xyz.com%' GROUP BY user 5 “clicks” can be a - file - database table, - stream, …
  • 6. What if “clicks” is a file? 6 user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… user cnt Mary 2 Bob 1 Liz 1 Q: What if we get more click data? A: We run the query again. SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user
  • 7. What if “clicks” is a stream? 7  We want the same results as for batch input!  Can we query a stream with SQL as well?
  • 8. SQL was not designed for streams  Relations are bounded (multi-)sets.  DBMS can access all data.  SQL queries return a result and complete. 8 Streams are infinite sequences. Streaming data arrives over time. Streaming queries continuously emit results and never complete. ↔ ↔ ↔
  • 9. DBMSs run queries on streams  Materialized views (MV) are similar to regular views, but persisted to disk or memory • Used to speed-up analytical queries • MVs need to be updated when the base tables change  MV maintenance is very similar to SQL on streams • Base table updates are a stream of DML statements • MV definition query is evaluated on that stream • MV is query result and continuously updated 9
  • 10. Continuous Queries in Flink  Core concept is a “Dynamic Table” • Dynamic tables are changing over time  Queries on dynamic tables • produce new dynamic tables (which are updated based on input) • do not terminate  Stream ↔ Dynamic table conversions 10
  • 11. Stream → Dynamic Table  Append mode • Stream records are appended to table • Table grows as more data arrives 11 user cTime url Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:00:05 ./prod?id=1 Liz 12:01:00 ./home Bob 12:01:30 ./prod?id=3 Mary 12:01:45 ./prod?id=7 … … Mary, 12:00:00, ./home Bob, 12:00:00, ./cart Mary, 12:00:05, ./prod?id=1 Liz, 12:01:00, ./home Bob, 12:01:30, ./prod?id=3 Mary, 12:01:45, ./prod?id=7
  • 12. Stream → Dynamic Table  Upsert mode • Stream records have (composite) key attributes • Records are inserted or update existing records with same key 12 user lastLogin Mary 2017-07-01 Bob 2017-06-01 Liz 2017-05-01 … Mary, 2017-03-01 Bob, 2017-03-15 Mary, 2017-04-01 Liz, 2017-05-01 Bob, 2017-06-01 Mary, 2017-07-01
  • 13. Querying a Dynamic Table clicks user cnt u1 1 result u2 1 u3 1 u1 2 u3 2 u1 3SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Rows of result table are updated. 13 Mary 12:01:45 ./prod?id=7 Liz 12:01:30 ./prod?id=3 Liz 12:01:00 ./home Mary 12:00:05 ./prod?id=1 Bob 12:00:00 ./cart Mary 12:00:00 ./home user cTime url
  • 14. What about windows? tableEnvironment .scan("clicks") .window(Tumble over 1.hour on 'cTime as 'w) .groupBy('w, 'user) .select('user, 'w.end AS endT, 'url.count as 'cnt) SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user 14
  • 15. clicks Computing window aggregates user endT cnt u1 13:00:00 3 u2 13:00:00 1 result u2 14:00:00 1 u3 14:00:00 2 u1 15:00:00 1 u2 15:00:00 2 u3 15:00:00 1 Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:02:00 ./prod?id=2 Mary 12:55:00 ./home Mary 14:00:00 ./prod?id=1 Liz 14:02:00 ./prod?id=8 Bob 14:30:00 ./prod?id=7 Bob 14:40:00 ./home Bob 13:01:00 ./prod?id=4 Liz 13:30:00 ./cart Liz 13:59:00 ./home SELECT user, TUMBLE_END( cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY user, TUMBLE( cTime, INTERVAL '1' HOURS) Rows are appended to result table. 15 user cTime url
  • 16. Why are results always appended?  cTime attribute is event-time attribute • Guarded by watermarks • Internally represented as special type • User-facing as TIMESTAMP  Special plans for queries that operate on event-time attributes 16 SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user
  • 17. Dynamic Table → Stream  Converting a dynamic table into a stream • Dynamic tables might update or delete existing rows • Updates must be encoded in outgoing stream  Conversion of tables to streams inspired by DBMS logs • DBMS use logs to restore databases (and tables) • REDO logs store new records to redo changes • UNDO logs store old records to undo changes 17
  • 18. Dynamic Table → Stream: REDO/UNDO + Bob,1+ Mary,2+ Liz,1+ Bob,2 + Mary,1- Mary,1- Bob,1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user + INSERT / - DELETE 18 user url clicks Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Bob ./prod?id=3
  • 19. Dynamic Table → Stream: REDO * Bob,1* Mary,2* Liz,1* Liz,2* Mary,3 * Mary,1 * UPSERT by KEY / - DELETE by KEY SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user 19 user url clicks Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Liz ./prod?id=3 Mary ./prod?id=7
  • 20. Can we run any query on a dynamic table?  No, there are space and computation constraints   State size may not grow infinitely as more data arrives SELECT sessionId, COUNT(url) FROM clicks GROUP BY sessionId;  A change of an input table may only trigger a partial re-computation of the result table SELECT user, RANK() OVER (ORDER BY lastLogin) FROM users; 20
  • 21. Bounding the size of query state  Adapt the semantics of the query • Aggregate data of last 24 hours. Discard older data.  Trade the accuracy of the result for size of state • Remove state for keys that became inactive. 21 SELECT sessionId, COUNT(url) AS cnt FROM clicks WHERE last(cTime, INTERVAL '1' DAY) GROUP BY sessionId
  • 22. Current state of SQL & Table API  Flink’s relational APIs are rapidly evolving • Lots of interest by community and many contributors • Used in production at large scale by Alibaba and others  Features released in Flink 1.3 • GroupBy & Over windowed aggregates • Non-windowed aggregates (with update changes) • User-defined aggregation functions  Features coming with Flink 1.4 • Windowed Joins • Reworked connectors APIs 22
  • 23. What can be built with this?  Continuous ETL • Continuously ingest data • Process with transformations & window aggregates • Write to files (Parquet, ORC), Kafka, PostgreSQL, HBase, … 23
  • 24. What can be built with this? 24  Dashboards, reporting & event-driven architectures • Flink updates query results with low latency • Result can be written to KV store, DBMS, compacted Kafka topic • Maintain result table as queryable state
  • 25. Wrap-up!  Table API & SQL support many streaming use cases • High-level / declarative specification • Automatic optimization and translation • Efficient execution • Scalar, table, aggregation UDFs for flexibility  Updating results enable many exciting applications  Check it out! 25
  • 29. Tables are materialized streams  A table is the materialization of a stream of modifications • SQL DML statements: INSERT, UPDATE, and DELETE • DBMSs process statements by modifying tables 29 user name lastLogin u2 Peter 2017-05-01 u1 Mary 2017-03-01u1 Mary 2017-06-01 INSERT (u1, Mary, "2017-03-01") INSERT (u2, Peter, "2017-05-01") DELETE WHERE (user = u2) UPDATE (lastLogin = "2017-06-01") WHERE (user = u1)

Editor's Notes

  • #3: A little bit about myself, I am a committer for Apache Flink and a software engineer for data Artisans, the original creators of Apache Flink and the providers of the dA Platform.