SlideShare a Scribd company logo
Re-introducing the Stream
ProcessorA Universal Tool for Continuous Data Analytical Needs
A Universal Tool for Continuous Data Analysis
Paris Carbone
Committer @ Apache Flink
PhD Candidate @ KTH
Data Stream Processors
Data Stream
Processor
can set up any data
pipeline for you
http://edge.alluremedia.com.au/m/l/2014/10/CoolingPipes.jpg
Is this really a step forward in data processing?
A growing open-source ecosystem:
kafkaflink beam apex
e.g.
General Idea of the tech:
• Processes pipeline computation in a cluster
• Computation is continuous and parallel (like data)
• Event-processing logic <-> Application state
• It’s production-ready and aims to simplify analytics
Data Stream Processors
streams
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
1. Speed
stream
processor
1. Speed
Low-Latency Data Processing
Traditionally the sole reason stream processing was used
• No intermediate scheduling (you let it run)
• No physical blocking (pre-compute on the go)
• Copy-on-write for state and output
How do stream processors achieve low latency?
But Is this is only relevant for live data?
CEP semantics etc. are nowadays provided as additional
libraries for stream processors
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
1. Speed 2. History
stream
processor
2. History
Offline Data Processing
It is possible and better over bulk historical data analysis
• Ability to define custom state to build up models
• Large-scale support is a given (inherits cluster computing benefits)
• Separation of notions of time and out-of-order processing
What can stream processors do for historical data?
But isn’t streaming hard to deal with failures?
session
windows
event-timewindowse.g.,
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
1. Speed 2. History
3. Durability
stream
processor
3. Durability
Exactly-Once Data Processing
Traditionally streaming ~ lossy, approximate processing
This is no longer true. Forget the ‘lambda architecture’.
• Input records are durably stored and indexed in logs (e.g., Kafka)
• Systems handle state snapshotting & transactions with external
stores transparently.
• Idempontent and transactional writes to external stores
part 1 part 2 part 3 part 4
on Flink each stream computation either completes or repeats
e.g.
3. Durability
Exactly-Once Data Processing
input
streams
application
states
stream
processor
rollback
complex event proc
fast approximate
streaming
ETL
event logs
production
database
4 Aspects of Data Processing
rules
data warehouses
+ historical data
application
state
+
failover
“microservices"
complex analytics
large-scale
processing systems
interactive
queries
data science
reports
dev
user analyst
data engineer
1. Speed 2. History
3. Durability
stream
processor
4. Interactivity
4. Interactivity
Querying Data Processing State
Stream Processor ~ Inverse DBMS
Application state holds fresh knowledge we want to query:
• In some systems (e.g. Kafka-Streams) we can use the changelog
• In other systems (i.e., Flink) we can query the state externally…or
stream queries on custom query processor on-top of them*
Alice
Bob? Bob=…
*https://techblog.king.com/rbea-scalable-real-time-analytics-king/
4 Aspects of Data Processing
1. Speed 2. History
3. Durability 4. Interactivity
stream
processor
• no physical blocking/staging
• no rescheduling
• efficient pipelining
• copy-on-write data structures
• different notions of time
• flexible stateful processing
• high throughput
• durable input logging is a standard
• automated state management
• exactly-once processing
• output commit & Idempotency
• external access to state/
changelogs
• ability to ‘stream queries’ over state
@SenorCarbone
Try out Stream Processing
https://flink.apache.org/
https://kafka.apache.org/
https://beam.apache.org/

More Related Content

PDF
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
PDF
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
PDF
Aggregate Sharing for User-Define Data Stream Windows
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
PPTX
Apache Flink Training: System Overview
PDF
Pulsar connector on flink 1.14
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
Aggregate Sharing for User-Define Data Stream Windows
Tech Talk @ Google on Flink Fault Tolerance and HA
Matthias J. Sax – A Tale of Squirrels and Storms
Graph Stream Processing : spinning fast, large scale, complex analytics
Apache Flink Training: System Overview
Pulsar connector on flink 1.14

What's hot (20)

PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PPTX
An Introduction to Distributed Data Streaming
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
PDF
Unified Stream and Batch Processing with Apache Flink
PDF
Zurich Flink Meetup
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PDF
Marton Balassi – Stateful Stream Processing
PDF
A look at Flink 1.2
PDF
Stateful stream processing with Apache Flink
PDF
Stream Loops on Flink - Reinventing the wheel for the streaming era
PDF
Introduction to Stateful Stream Processing with Apache Flink.
PDF
Data Stream Analytics - Why they are important
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PDF
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
PPTX
Apache Flink@ Strata & Hadoop World London
PPTX
Apache Flink at Strata San Jose 2016
PDF
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
An Introduction to Distributed Data Streaming
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Unified Stream and Batch Processing with Apache Flink
Zurich Flink Meetup
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Marton Balassi – Stateful Stream Processing
A look at Flink 1.2
Stateful stream processing with Apache Flink
Stream Loops on Flink - Reinventing the wheel for the streaming era
Introduction to Stateful Stream Processing with Apache Flink.
Data Stream Analytics - Why they are important
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Continuous Processing with Apache Flink - Strata London 2016
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Apache Flink@ Strata & Hadoop World London
Apache Flink at Strata San Jose 2016
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Don't Cross The Streams - Data Streaming And Apache Flink
Ad

Similar to Reintroducing the Stream Processor: A universal tool for continuous data analysis (20)

PDF
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
PDF
Reflections on Almost Two Decades of Research into Stream Processing
PDF
Spark meetup stream processing use cases
PDF
Introduction to Stream Processing
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
PDF
data stream processing.and its applications pdf
PDF
The Rise of Streaming SQL
PDF
[WSO2Con USA 2018] The Rise of Streaming SQL
PPTX
Apache Kafka Streams
PDF
The Rise of Streaming SQL and Evolution of Streaming Applications
PPTX
Data Stream Processing with Apache Flink
PDF
Introduction to Stream Processing
PDF
Introduction to Stream Processing
PDF
Santander Stream Processing with Apache Flink
PDF
The State of Stream Processing
PPTX
Streaming in the Wild with Apache Flink
PPT
Moving Towards a Streaming Architecture
PPTX
Stream Set presentation for datapipeline.
PDF
Introduction to Stream Processing
PPTX
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Reflections on Almost Two Decades of Research into Stream Processing
Spark meetup stream processing use cases
Introduction to Stream Processing
[WSO2Con EU 2018] The Rise of Streaming SQL
data stream processing.and its applications pdf
The Rise of Streaming SQL
[WSO2Con USA 2018] The Rise of Streaming SQL
Apache Kafka Streams
The Rise of Streaming SQL and Evolution of Streaming Applications
Data Stream Processing with Apache Flink
Introduction to Stream Processing
Introduction to Stream Processing
Santander Stream Processing with Apache Flink
The State of Stream Processing
Streaming in the Wild with Apache Flink
Moving Towards a Streaming Architecture
Stream Set presentation for datapipeline.
Introduction to Stream Processing
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ad

More from Paris Carbone (6)

PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
PDF
Scalable and Reliable Data Stream Processing - Doctorate Seminar
PDF
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
PDF
A Future Look of Data Stream Processing as an Architecture for AI
PDF
Continuous Deep Analytics
PDF
Single-Pass Graph Stream Analytics with Apache Flink
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Scalable and Reliable Data Stream Processing - Doctorate Seminar
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
A Future Look of Data Stream Processing as an Architecture for AI
Continuous Deep Analytics
Single-Pass Graph Stream Analytics with Apache Flink

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Quality review (1)_presentation of this 21
PDF
annual-report-2024-2025 original latest.
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Foundation of Data Science unit number two notes
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
1_Introduction to advance data techniques.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
STUDY DESIGN details- Lt Col Maksud (21).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Quality review (1)_presentation of this 21
annual-report-2024-2025 original latest.
Business Ppt On Nestle.pptx huunnnhhgfvu
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Clinical guidelines as a resource for EBP(1).pdf
.pdf is not working space design for the following data for the following dat...
Foundation of Data Science unit number two notes
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
1_Introduction to advance data techniques.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Reintroducing the Stream Processor: A universal tool for continuous data analysis

  • 1. Re-introducing the Stream ProcessorA Universal Tool for Continuous Data Analytical Needs A Universal Tool for Continuous Data Analysis Paris Carbone Committer @ Apache Flink PhD Candidate @ KTH
  • 2. Data Stream Processors Data Stream Processor can set up any data pipeline for you http://edge.alluremedia.com.au/m/l/2014/10/CoolingPipes.jpg
  • 3. Is this really a step forward in data processing? A growing open-source ecosystem: kafkaflink beam apex e.g. General Idea of the tech: • Processes pipeline computation in a cluster • Computation is continuous and parallel (like data) • Event-processing logic <-> Application state • It’s production-ready and aims to simplify analytics Data Stream Processors streams
  • 4. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer
  • 5. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer
  • 6. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer 1. Speed stream processor
  • 7. 1. Speed Low-Latency Data Processing Traditionally the sole reason stream processing was used • No intermediate scheduling (you let it run) • No physical blocking (pre-compute on the go) • Copy-on-write for state and output How do stream processors achieve low latency? But Is this is only relevant for live data? CEP semantics etc. are nowadays provided as additional libraries for stream processors
  • 8. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer 1. Speed 2. History stream processor
  • 9. 2. History Offline Data Processing It is possible and better over bulk historical data analysis • Ability to define custom state to build up models • Large-scale support is a given (inherits cluster computing benefits) • Separation of notions of time and out-of-order processing What can stream processors do for historical data? But isn’t streaming hard to deal with failures? session windows event-timewindowse.g.,
  • 10. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer 1. Speed 2. History 3. Durability stream processor
  • 11. 3. Durability Exactly-Once Data Processing Traditionally streaming ~ lossy, approximate processing This is no longer true. Forget the ‘lambda architecture’. • Input records are durably stored and indexed in logs (e.g., Kafka) • Systems handle state snapshotting & transactions with external stores transparently. • Idempontent and transactional writes to external stores part 1 part 2 part 3 part 4 on Flink each stream computation either completes or repeats e.g.
  • 12. 3. Durability Exactly-Once Data Processing input streams application states stream processor rollback
  • 13. complex event proc fast approximate streaming ETL event logs production database 4 Aspects of Data Processing rules data warehouses + historical data application state + failover “microservices" complex analytics large-scale processing systems interactive queries data science reports dev user analyst data engineer 1. Speed 2. History 3. Durability stream processor 4. Interactivity
  • 14. 4. Interactivity Querying Data Processing State Stream Processor ~ Inverse DBMS Application state holds fresh knowledge we want to query: • In some systems (e.g. Kafka-Streams) we can use the changelog • In other systems (i.e., Flink) we can query the state externally…or stream queries on custom query processor on-top of them* Alice Bob? Bob=… *https://techblog.king.com/rbea-scalable-real-time-analytics-king/
  • 15. 4 Aspects of Data Processing 1. Speed 2. History 3. Durability 4. Interactivity stream processor • no physical blocking/staging • no rescheduling • efficient pipelining • copy-on-write data structures • different notions of time • flexible stateful processing • high throughput • durable input logging is a standard • automated state management • exactly-once processing • output commit & Idempotency • external access to state/ changelogs • ability to ‘stream queries’ over state
  • 16. @SenorCarbone Try out Stream Processing https://flink.apache.org/ https://kafka.apache.org/ https://beam.apache.org/