SlideShare a Scribd company logo
Marton Balassi – data Artisans
Gyula Fora - SICS
Flink committers
mbalassi@apache.org / gyfora@apache.org
Real-time Stream Processing
with Apache Flink
Stream Processing
2
 Data stream: Infinite sequence of data arriving in a continuous fashion.
 Stream processing: Analyzing and acting on real-time streaming data,
using continuous queries
Streaming landscape
3
Apache Storm
•True streaming, low latency - lower throughput
•Low level API (Bolts, Spouts) + Trident
Spark Streaming
•Stream processing on top of batch system, high throughput - higher latency
•Functional API (DStreams), restricted by batch runtime
Apache Samza
•True streaming built on top of Apache Kafka, state is first class citizen
•Slightly different stream notion, low level API
Apache Flink
•True streaming with adjustable latency-throughput trade-off
•Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
Apache Storm
4
 True streaming, low latency - lower throughput
 Low level API (Bolts, Spouts) + Trident
 At-least-once processing guarantees Issues
 Costly fault tolerance
 Serialization
 Low level API
Spark Streaming
5
 Stream processing emulated on a batch system
 High throughput - higher latency
 Functional API (DStreams)
 Exactly-once processing guarantees Issues
 Restricted streaming
semantics
 Windowing
 High latency
Apache Samza
6
 True streaming built on top of Apache Kafka
 Slightly different stream notion, low level API
 At-least-once processing guarantees with state
Issues
 High disk IO
 Low level API
Apache Flink
7
 True streaming with adjustable latency and throughput
 Rich functional API exploiting streaming runtime
 Flexible windowing semantics
 Exactly-once processing guarantees with (small) state
Issues
 Limited state size
 HA issue
Apache Flink
8
What is Flink
9
A "use-case complete" framework to unify
batch and stream processing
Event logs
Historic data
ETL
Relational
Graph analysis
Machine learning
Streaming analysis
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event logs
Real-time data
streams
What is Flink
An engine that puts equal emphasis
to streaming and batch
10
Flink stack
11
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
*current Flink master + few PRs
Streaming Optimizer
Flink Streaming
12
Overview of the API
 Data stream sources
• File system
• Message queue connectors
• Arbitrary source functionality
 Stream transformations
• Basic transformations: Map, Reduce, Filter, Aggregations…
• Binary stream transformations: CoMap, CoReduce…
• Windowing semantics: Policy based flexible windowing (Time, Count, Delta…)
• Temporal binary stream operators: Joins, Crosses…
• Native support for iterations
 Data stream outputs
 For the details please refer to the programming guide:
• http://flink.apache.org/docs/latest/streaming_guide.html
13
Reduce
Merge
Filter
Sum
Map
Src
Sink
Src
Use-case: Financial analytics
14
 Reading from multiple inputs
• Merge stock data from various sources
 Window aggregations
• Compute simple statistics over windows of data
 Data driven windows
• Define arbitrary windowing semantics
 Combine with sentiment analysis
• Enrich your analytics with social media feeds (Twitter)
 Streaming joins
• Join multiple data streams
 Detailed explanation and source code on our blog
• http://flink.apache.org/news/2015/02/09/streaming-example.html
Reading from multiple inputs
case class StockPrice(symbol : String, price : Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val socketStockStream = env.socketTextStream("localhost", 9999)
.map(x => { val split = x.split(",")
StockPrice(split(0), split(1).toDouble) })
val SPX_Stream = env.addSource(generateStock("SPX")(10) _)
val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _)
val stockStream = socketStockStream.merge(SPX_Stream, FTSE_STREAM) 15
(1)
(2)
(4)
(3)
(1)
(2)
(3)
(4)
"HDP, 23.8"
"HDP, 26.6"
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Window aggregations
val windowedStream = stockStream
.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))
val lowest = windowedStream.minBy("price")
val maxByStock = windowedStream.groupBy("symbol").maxBy("price")
val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _)
16
(1)
(2)
(4)
(3)
(1)
(2)
(4)
(3)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
StockPrice(HDP, 23.8)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 26.6)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 25.2)
Data-driven windows
case class Count(symbol : String, count : Int)
val priceWarnings = stockStream.groupBy("symbol")
.window(Delta.of(0.05, priceChange, defaultPrice))
.mapWindow(sendWarning _)
val warningsPerStock = priceWarnings.map(Count(_, 1)) .groupBy("symbol")
.window(Time.of(30, SECONDS))
.sum("count") 17
(1)
(2) (4)
(3)
(1)
(2)
(4)
(3)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Count(HDP, 1)StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Combining with a Twitter stream
val tweetStream = env.addSource(generateTweets _)
val mentionedSymbols = tweetStream.flatMap(tweet => tweet.split(" "))
.map(_.toUpperCase())
.filter(symbols.contains(_))
val tweetsPerStock = mentionedSymbols.map(Count(_, 1)).groupBy("symbol")
.window(Time.of(30, SECONDS))
.sum("count")
18
"hdp is on the rise!"
"I wish I bought more
YHOO and HDP stocks"
Count(HDP, 2)
Count(YHOO, 1)(1)
(2)
(4)
(3)
(1)
(2)
(4)
(3)
Streaming joins
val tweetsAndWarning = warningsPerStock.join(tweetsPerStock)
.onWindow(30, SECONDS)
.where("symbol")
.equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) }
val rollingCorrelation = tweetsAndWarning
.window(Time.of(30, SECONDS))
.mapWindow(computeCorrelation _)
19
Count(HDP, 2)
Count(YHOO, 1)
Count(HDP, 1)
(1,2)
(1) (2)
(1)
(2)
0.5
Fault tolerance
 Exactly once semantics
• Asynchronous barrier snapshotting
• Checkpoint barriers streamed from the sources
• Operator state checkpointing + source backup
• Pluggable backend for state management
20
1
1
2 3
JM
SM
State manager
Job manager
Operator
Snapshot barrier
Event channel
Data channel
Checkpoint
JM
SM
Performance
21
 Performance optimizations
• Effective serialization due to strongly typed topologies
• Operator chaining (thread sharing/no serialization)
• Different automatic query optimizations
 Competitive performance
• ~ 1.5m events / sec / core
• As a comparison Storm promises ~ 1m tuples / sec / node
Roadmap
22
 Persistent, high-throughput state backend
 Job manager high availability
 Application libraries
• General statistics over streams
• Pattern matching
• Machine learning pipelines library
• Streaming graph processing library
 Integration with other frameworks
• Zeppelin (Notebook)
• SAMOA (Online ML)
Summary
 Flink is a use-case complete framework to unify batch
and stream processing
 True streaming runtime with high-level APIs
 Flexible, data-driven windowing semantics
 Competitive performance
 We are just getting started!
23
Flink Community
24
0
20
40
60
80
100
120
Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16
Unique git contributors
flink.apache.org
@ApacheFlink

More Related Content

PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Apache Flink Stream Processing
PPTX
Apache Flink @ NYC Flink Meetup
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
PDF
Apache Flink internals
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Introduction to Apache Flink - Fast and reliable big data processing
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Apache Flink Stream Processing
Apache Flink @ NYC Flink Meetup
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
Apache Flink internals
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

What's hot (20)

PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PPTX
Apache flink
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Apache Flink and what it is used for
PDF
Apache Spark Overview
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Parquet performance tuning: the missing guide
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Introduction to Apache Spark
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PPTX
Introduction to Apache Flink
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
Introduction To Flink
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Apache Spark Architecture
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Deploying Flink on Kubernetes - David Anderson
PDF
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
PDF
Introduction to Kafka Streams
Stephan Ewen - Experiences running Flink at Very Large Scale
Apache flink
Deep Dive: Memory Management in Apache Spark
Apache Flink and what it is used for
Apache Spark Overview
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Parquet performance tuning: the missing guide
Processing Large Data with Apache Spark -- HasGeek
Introduction to Apache Spark
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Introduction to Apache Flink
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Introduction To Flink
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Apache Spark Architecture
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Evening out the uneven: dealing with skew in Flink
Deploying Flink on Kubernetes - David Anderson
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
Introduction to Kafka Streams
Ad

Viewers also liked (16)

PDF
[246] foursquare데이터라이프사이클 설현준
PPTX
[115] clean fe development_윤지수
PDF
Large scale data processing pipelines at trivago
PDF
[211]대규모 시스템 시각화 현동석김광림
PDF
Test strategies for data processing pipelines
PDF
Real-time Big Data Processing with Storm
PDF
[225]yarn 기반의 deep learning application cluster 구축 김제민
PPTX
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
PDF
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
PPTX
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
PPTX
[112]rest에서 graph ql과 relay로 갈아타기 이정우
PDF
[236] 카카오의데이터파이프라인 윤도영
PDF
Big Data Architecture
PDF
Building a Data Pipeline from Scratch - Joe Crobak
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
[246] foursquare데이터라이프사이클 설현준
[115] clean fe development_윤지수
Large scale data processing pipelines at trivago
[211]대규모 시스템 시각화 현동석김광림
Test strategies for data processing pipelines
Real-time Big Data Processing with Storm
[225]yarn 기반의 deep learning application cluster 구축 김제민
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[236] 카카오의데이터파이프라인 윤도영
Big Data Architecture
Building a Data Pipeline from Scratch - Joe Crobak
Real-Time Analytics with Apache Cassandra and Apache Spark
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Ad

Similar to Real-time Stream Processing with Apache Flink (20)

PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PDF
Flink Streaming Berlin Meetup
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PPTX
Flink Streaming Hadoop Summit San Jose
PPTX
Apache Flink Deep Dive
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PPTX
Apache Flink Overview at SF Spark and Friends
PPTX
Apache Flink at Strata San Jose 2016
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
PPTX
Counting Elements in Streams
PPTX
Meet the squirrel @ #CSHUG
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
Strtio Spark Streaming + Siddhi CEP Engine
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Flink Streaming Berlin Meetup
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Flink Streaming Hadoop Summit San Jose
Apache Flink Deep Dive
Flexible and Real-Time Stream Processing with Apache Flink
Chicago Flink Meetup: Flink's streaming architecture
Apache Flink Overview at SF Spark and Friends
Apache Flink at Strata San Jose 2016
Metadata and Provenance for ML Pipelines with Hopsworks
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
K. Tzoumas & S. Ewen – Flink Forward Keynote
Counting Elements in Streams
Meet the squirrel @ #CSHUG
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Apache Flink @ Tel Aviv / Herzliya Meetup
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Strtio Spark Streaming + Siddhi CEP Engine
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Building Scalable Data Pipelines - 2016 DataPalooza Seattle

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Real-time Stream Processing with Apache Flink

  • 1. Marton Balassi – data Artisans Gyula Fora - SICS Flink committers [email protected] / [email protected] Real-time Stream Processing with Apache Flink
  • 2. Stream Processing 2  Data stream: Infinite sequence of data arriving in a continuous fashion.  Stream processing: Analyzing and acting on real-time streaming data, using continuous queries
  • 3. Streaming landscape 3 Apache Storm •True streaming, low latency - lower throughput •Low level API (Bolts, Spouts) + Trident Spark Streaming •Stream processing on top of batch system, high throughput - higher latency •Functional API (DStreams), restricted by batch runtime Apache Samza •True streaming built on top of Apache Kafka, state is first class citizen •Slightly different stream notion, low level API Apache Flink •True streaming with adjustable latency-throughput trade-off •Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
  • 4. Apache Storm 4  True streaming, low latency - lower throughput  Low level API (Bolts, Spouts) + Trident  At-least-once processing guarantees Issues  Costly fault tolerance  Serialization  Low level API
  • 5. Spark Streaming 5  Stream processing emulated on a batch system  High throughput - higher latency  Functional API (DStreams)  Exactly-once processing guarantees Issues  Restricted streaming semantics  Windowing  High latency
  • 6. Apache Samza 6  True streaming built on top of Apache Kafka  Slightly different stream notion, low level API  At-least-once processing guarantees with state Issues  High disk IO  Low level API
  • 7. Apache Flink 7  True streaming with adjustable latency and throughput  Rich functional API exploiting streaming runtime  Flexible windowing semantics  Exactly-once processing guarantees with (small) state Issues  Limited state size  HA issue
  • 9. What is Flink 9 A "use-case complete" framework to unify batch and stream processing Event logs Historic data ETL Relational Graph analysis Machine learning Streaming analysis
  • 10. Flink Historic data Kafka, RabbitMQ, ... HDFS, JDBC, ... ETL, Graphs, Machine Learning Relational, … Low latency windowing, aggregations, ... Event logs Real-time data streams What is Flink An engine that puts equal emphasis to streaming and batch 10
  • 11. Flink stack 11 Python Gelly Table FlinkML SAMOA Batch Optimizer DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R Flink Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow *current Flink master + few PRs Streaming Optimizer
  • 13. Overview of the API  Data stream sources • File system • Message queue connectors • Arbitrary source functionality  Stream transformations • Basic transformations: Map, Reduce, Filter, Aggregations… • Binary stream transformations: CoMap, CoReduce… • Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) • Temporal binary stream operators: Joins, Crosses… • Native support for iterations  Data stream outputs  For the details please refer to the programming guide: • http://flink.apache.org/docs/latest/streaming_guide.html 13 Reduce Merge Filter Sum Map Src Sink Src
  • 14. Use-case: Financial analytics 14  Reading from multiple inputs • Merge stock data from various sources  Window aggregations • Compute simple statistics over windows of data  Data driven windows • Define arbitrary windowing semantics  Combine with sentiment analysis • Enrich your analytics with social media feeds (Twitter)  Streaming joins • Join multiple data streams  Detailed explanation and source code on our blog • http://flink.apache.org/news/2015/02/09/streaming-example.html
  • 15. Reading from multiple inputs case class StockPrice(symbol : String, price : Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val socketStockStream = env.socketTextStream("localhost", 9999) .map(x => { val split = x.split(",") StockPrice(split(0), split(1).toDouble) }) val SPX_Stream = env.addSource(generateStock("SPX")(10) _) val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _) val stockStream = socketStockStream.merge(SPX_Stream, FTSE_STREAM) 15 (1) (2) (4) (3) (1) (2) (3) (4) "HDP, 23.8" "HDP, 26.6" StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6)
  • 16. Window aggregations val windowedStream = stockStream .window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS)) val lowest = windowedStream.minBy("price") val maxByStock = windowedStream.groupBy("symbol").maxBy("price") val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _) 16 (1) (2) (4) (3) (1) (2) (4) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) StockPrice(HDP, 23.8) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 26.6) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 25.2)
  • 17. Data-driven windows case class Count(symbol : String, count : Int) val priceWarnings = stockStream.groupBy("symbol") .window(Delta.of(0.05, priceChange, defaultPrice)) .mapWindow(sendWarning _) val warningsPerStock = priceWarnings.map(Count(_, 1)) .groupBy("symbol") .window(Time.of(30, SECONDS)) .sum("count") 17 (1) (2) (4) (3) (1) (2) (4) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) Count(HDP, 1)StockPrice(HDP, 23.8) StockPrice(HDP, 26.6)
  • 18. Combining with a Twitter stream val tweetStream = env.addSource(generateTweets _) val mentionedSymbols = tweetStream.flatMap(tweet => tweet.split(" ")) .map(_.toUpperCase()) .filter(symbols.contains(_)) val tweetsPerStock = mentionedSymbols.map(Count(_, 1)).groupBy("symbol") .window(Time.of(30, SECONDS)) .sum("count") 18 "hdp is on the rise!" "I wish I bought more YHOO and HDP stocks" Count(HDP, 2) Count(YHOO, 1)(1) (2) (4) (3) (1) (2) (4) (3)
  • 19. Streaming joins val tweetsAndWarning = warningsPerStock.join(tweetsPerStock) .onWindow(30, SECONDS) .where("symbol") .equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) } val rollingCorrelation = tweetsAndWarning .window(Time.of(30, SECONDS)) .mapWindow(computeCorrelation _) 19 Count(HDP, 2) Count(YHOO, 1) Count(HDP, 1) (1,2) (1) (2) (1) (2) 0.5
  • 20. Fault tolerance  Exactly once semantics • Asynchronous barrier snapshotting • Checkpoint barriers streamed from the sources • Operator state checkpointing + source backup • Pluggable backend for state management 20 1 1 2 3 JM SM State manager Job manager Operator Snapshot barrier Event channel Data channel Checkpoint JM SM
  • 21. Performance 21  Performance optimizations • Effective serialization due to strongly typed topologies • Operator chaining (thread sharing/no serialization) • Different automatic query optimizations  Competitive performance • ~ 1.5m events / sec / core • As a comparison Storm promises ~ 1m tuples / sec / node
  • 22. Roadmap 22  Persistent, high-throughput state backend  Job manager high availability  Application libraries • General statistics over streams • Pattern matching • Machine learning pipelines library • Streaming graph processing library  Integration with other frameworks • Zeppelin (Notebook) • SAMOA (Online ML)
  • 23. Summary  Flink is a use-case complete framework to unify batch and stream processing  True streaming runtime with high-level APIs  Flexible, data-driven windowing semantics  Competitive performance  We are just getting started! 23
  • 24. Flink Community 24 0 20 40 60 80 100 120 Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16 Unique git contributors

Editor's Notes

  • #14: 3 main components to the system: Connectors(sources), Operators(trafos), Sinks (outputs) The source interface is as general as it gets + preimplemented connectors Rich set of operators, designed for true streaming analytics (long-standing, stateful, windowing) Sinks are very general, same as sources. Simple interfaces + pre-implemented
  • #15: -The goal is to showcase the main features of the api on a “real world” example -Use-case: Analyze streams of stock market data, which consists of (Stock symbol, Stock price) pairs -Sentiment analysis: Combine the market information with information aquired from social media feeds (in this case the nr of times the stock symbol was mentioned in the twitter stream) -Use stream joins for this ^
  • #16: - As a first step we need to connect to our data streams, and parse our inputs Here we use a simple socket stream and convert it to a case-class -> Could have used kafka or any other message queue + or more advanced stock representation
  • #17: Talk about policy based windowing -> eviction(window size), trigger (slide size) Window operations: reduce, mapWindow Grouped vs non-grouped windowing
  • #18: Flexible windowing -> awesome features Delta policy Case-class Count for convenience Mention other cool use-cases: detecting user sessions
  • #19: Simple windowed word count on the filtered tweet stream for each symbol