SlideShare a Scribd company logo
Imre Nagi
Traveloka Data
@imrenagi
Jakarta
Google Cloud Dataflow
Unified Model for Stream and Batch
processing
Imre Nagi
Ping me @imrenagi
Previously:
Software Engineer @CERN @eBay Inc
Currently:
Software Engineer @Traveloka Data
Docker Community Leader, Indonesia
Agenda
What is Dataflow?
Dataflow Abstraction
Dataflow Common Pipeline
Stream Analytics
What is
Dataflow?
Jakarta
Apache Beam ...
A set of SDK that define
programming model that
you use to build your
stream and batch
processing pipeline
Cloud Dataflow
Fully managed distributed
service that runs and optimizes
your beam pipeline
Jakarta
Dataflow for ..
● Move
● Filter
● Enrich
ETL
● Connecting to Cloud Pub/Sub
● Read and Write to BigQuery,
Bigtable, etc.
I/O Operation
● Streaming Computing
● Batch Computing
● Machine Learning
Analytics
Jakarta
Unified Programming Model
Unified: Stream & Batch Pipeline
Open Source:
Java SDK
Python SDK
Go SDK (New)
Jakarta
Cloud Pub/Sub Cloud Dataflow
(Streaming)
Cloud Bigquery
Source Processing Data Store
Unified Model (Streaming)
Jakarta
Cloud Pub/Sub
Cloud Storage
Cloud Dataflow
(Streaming)
Cloud Bigquery
Source Processing Data Store
Unified Model (Streaming & Backup)
Jakarta
Cloud Storage Cloud Dataflow
(Batch)
Cloud Bigquery
Source Processing Data Store
Unified Model (Batch)
Dataflow
Abstraction
Jakarta
Jakarta
Represents graph of data
processing transformation
PCollection flows through
the pipeline
Can have multiple I/O in the
beginning and end of
pipeline
Beam Pipeline
Jakarta
// Define the pipeline option
PipelineOptions options = PipelineOptionsFactory.create();
// Create the pipeline
Pipeline p = Pipeline.create(options);
Jakarta
Data Model
PCollection<T> is a
collection of data type T
May be bounded or
unbounded in size
Element might has implicit
or explicit timestamp
// Create the PCollection 'lines' by applying a 'Read' transform.
PCollection<String> lines = p.apply(TextIO.read().from("/path/to/some/inputData.txt"));
PCollection<String> linesGCS = p.apply(TextIO.read().from("gs://deeptech/*"));
static final List<String> LINES = Arrays.asList(
"This is the first line",
"You will say this one is the second",
"But it's not. ");
// Generating PCollection from in memory data
PCollection<String> lines = p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of())
// Generate bounded pcollection
PCollection<Long> bounded = p.apply(GenerateSequence.from(0).to(1000));
// Generate unbounded pcollection
PCollection<Long> unbounded = p.apply(GenerateSequence.from(0));
Jakarta
PTransform: Transforming the Data
public class HelloDoFn extends DoFn<String, String> {
@ProcessElement
public void processElement(ProcessContext context) {
String name = context.element();
context.output("Hello, " + name + " ! ");
}
}
public class StringToLongDoFn extends DoFn<String, Long> {
@ProcessElement
public void processElement(ProcessContext context) {
String name = context.element();
context.output(name.length());
}
}
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
Jakarta
Jakarta
I/O Transform
Dataflow Common
Pipeline
Jakarta
Jakarta
Linear Pipeline
Jakarta
Combining Multiple
PCollection
Jakarta
Producing Multiple
PCollections
Jakarta
Multiple Transformation for a PCollection
Jakarta
Joining PCollection
Stream
Analytics
Jakarta
Jakarta
data..
Jakarta
Can be big..
Jakarta
Tuesday
Wednesday
Thursday
Bigger and bigger..
...maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00
Jakarta
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
Oops.. unknown delays
Jakarta
Lambda Architecture ever says that Stream Processing only CAN’T produce
accurate analytics result. Thus, Batch Processing is necessary to fix the
inaccuracy of the stream processing.
Jakarta
Jakarta
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
∑ ∑ ∑ ∑ ∑ ∑ ∑8:00 8:00
Grouping via Processing-Time Windows
Jakarta
Processing
Time
11:0010:00 15:0014:0013:0012:00
Grouping via event-time windows
Event Time 11:0010:00 15:0014:0013:0012:00
Input
Output
∑ ∑ ∑ ∑ ∑ ∑
Jakarta
What is windowing?
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded
data.
Fixed Sliding
1 2 3
54
Key
2
Key
1
Key
3
Time
1 2 3 4 A windowing function
computes which
window(s) an element
belongs to. Temporal
functions can be
parameterized with
duration and
frequency.
Jakarta
What about data-dependent windowing?
Sessions
2
431
Time
Unique per key - you
can't know a priori
when a session ends,
so the windowing
function is now also
parameterized by
state.
PCollection<KV<String, Integer>> scores = input.apply(
Window.into(FixedWindows.of(Duration.standardMinutes(2))))
.apply(Sum.integersPerKey());
Jakarta
Jakarta
Windowing specifies where events are aggregated in event time,
but when are events emitted in processing time?
Jakarta
Trigger
Triggers: A trigger is a mechanism for declaring
when the output for a window should be
materialized relative to some external signal.
Triggers provide flexibility in choosing when
outputs should be emitted.
They also make it possible to observe the output for a window
multiple times as it evolves
Jakarta
Windowed summation on a streaming engine with perfect (left) and heuristic
(right) watermarks.
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
Jakarta
Windowed summation on a streaming engine with early and late
firings.
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.withAllowedLateness(Duration.standardMinutes(1)))
.apply(Sum.integersPerKey());
Jakarta
Windowed summation on a streaming engine with early and late firings
and allowed lateness
First trigger firing: [5, 8, 3]
Second trigger firing: [5, 8, 3, 15, 19, 23]
Third trigger firing: [5, 8, 3, 15, 19, 23, 9, 13, 10]
Jakarta
Accumulation Modes
First trigger firing: [5, 8, 3]
Second trigger firing: [15, 19, 23]
Third trigger firing: [9, 13, 10]
Jakarta
Discarding Modes
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.discardingFiredPanes())
.apply(Sum.integersPerKey());
Jakarta
Discarding mode version of early/late firings on a streaming engine
Jakarta
1. http://streamingsystems.org/Presentations/Jelena%20Pjesivac-grbo
vic.pdf
2. Stream Analytics with Google Cloud Dataflow: Use Cases &
Patterns, Gaurav Anand
3. Streaming 101 & 102, Tyler Akidau
4. https://streamingbook.net
5. Apache Beam Documentation
Google Slide version from this slide can be accessed from:
https://docs.google.com/presentation/d/1Ws73JxlVH39HiKiYuF3vW
903j8wFzxPQihXz4CQ_HZM/edit?usp=sharing
Credits to:

More Related Content

PDF
e-KTP Information Extraction with Google Cloud Function & Google Cloud Vision
PDF
Dockerize Your Project - GDGBogor
PDF
What is Google Cloud Platform - GDG DevFest 18 Depok
PPTX
Gdsc muk - innocent
PDF
Google Cloud Platform and Kubernetes
PDF
GDG DevFest Romania - Architecting for the Google Cloud Platform
PDF
Serverless with Google Cloud
PDF
GDG Heraklion - Architecting for the Google Cloud Platform
e-KTP Information Extraction with Google Cloud Function & Google Cloud Vision
Dockerize Your Project - GDGBogor
What is Google Cloud Platform - GDG DevFest 18 Depok
Gdsc muk - innocent
Google Cloud Platform and Kubernetes
GDG DevFest Romania - Architecting for the Google Cloud Platform
Serverless with Google Cloud
GDG Heraklion - Architecting for the Google Cloud Platform

What's hot (19)

PDF
Where should I run my code? Serverless, Containers, Virtual Machines and more
PDF
Google Cloud Platform Special Training
PDF
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
PPTX
Go Serverless with Azure
PPTX
Serverless and Servicefull Applications - Where Microservices complements Ser...
PPTX
Cqrs and event sourcing in azure
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
PPTX
Microservice Plumbing - Glynn Bird - Codemotion Rome 2017
PPTX
When IoT meets Serverless - from design to production and monitoring
PPTX
Intellias CQRS Framework
PDF
Google Cloud Platform Solutions for DevOps Engineers
PDF
Google Cloud Platform Kubernetes Workshop IYTE
PDF
CNCF, State of Serverless & Project Nuclio
PDF
Tu non puoi passare! Policy compliance con OPA Gatekeeper | Niccolò Raspa
PPTX
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
PDF
Cncf event driven autoscaling with keda
PPTX
CQRS and Event Sourcing
PPTX
KEDA Overview
PDF
What Does Kubernetes Look Like?: Performance Monitoring & Visualization with ...
Where should I run my code? Serverless, Containers, Virtual Machines and more
Google Cloud Platform Special Training
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
Go Serverless with Azure
Serverless and Servicefull Applications - Where Microservices complements Ser...
Cqrs and event sourcing in azure
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Microservice Plumbing - Glynn Bird - Codemotion Rome 2017
When IoT meets Serverless - from design to production and monitoring
Intellias CQRS Framework
Google Cloud Platform Solutions for DevOps Engineers
Google Cloud Platform Kubernetes Workshop IYTE
CNCF, State of Serverless & Project Nuclio
Tu non puoi passare! Policy compliance con OPA Gatekeeper | Niccolò Raspa
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
Cncf event driven autoscaling with keda
CQRS and Event Sourcing
KEDA Overview
What Does Kubernetes Look Like?: Performance Monitoring & Visualization with ...
Ad

Similar to GDG Jakarta Meetup - Streaming Analytics With Apache Beam (20)

PPTX
Google cloud Dataflow & Apache Flink
PDF
Continuous Application with Structured Streaming 2.0
PDF
Spark what's new what's coming
PPTX
Reactive programming every day
PDF
So you think you can stream.pptx
PPTX
[NDC 2019] Enterprise-Grade Serverless
PPTX
[NDC 2019] Functions 2.0: Enterprise-Grade Serverless
PDF
Structured concurrency with Kotlin Coroutines
PDF
Apache Flink Stream Processing
PPT
Executing Sql Commands
PPT
Executing Sql Commands
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Reactive & Realtime Web Applications with TurboGears2
PPT
Server side JavaScript: going all the way
PDF
Big Data Tools in AWS
PDF
Integrate Solr with real-time stream processing applications
PDF
JVM Mechanics: When Does the JVM JIT & Deoptimize?
PDF
Non Blocking I/O for Everyone with RxJava
PDF
RxJava on Android
PPTX
Category theory, Monads, and Duality in the world of (BIG) Data
Google cloud Dataflow & Apache Flink
Continuous Application with Structured Streaming 2.0
Spark what's new what's coming
Reactive programming every day
So you think you can stream.pptx
[NDC 2019] Enterprise-Grade Serverless
[NDC 2019] Functions 2.0: Enterprise-Grade Serverless
Structured concurrency with Kotlin Coroutines
Apache Flink Stream Processing
Executing Sql Commands
Executing Sql Commands
Flink 0.10 @ Bay Area Meetup (October 2015)
Reactive & Realtime Web Applications with TurboGears2
Server side JavaScript: going all the way
Big Data Tools in AWS
Integrate Solr with real-time stream processing applications
JVM Mechanics: When Does the JVM JIT & Deoptimize?
Non Blocking I/O for Everyone with RxJava
RxJava on Android
Category theory, Monads, and Duality in the world of (BIG) Data
Ad

Recently uploaded (20)

PPTX
Leprosy and NLEP programme community medicine
PDF
annual-report-2024-2025 original latest.
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Lecture1 pattern recognition............
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
How to run a consulting project- client discovery
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Microsoft Core Cloud Services powerpoint
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Leprosy and NLEP programme community medicine
annual-report-2024-2025 original latest.
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Data_Analytics_and_PowerBI_Presentation.pptx
modul_python (1).pptx for professional and student
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
IBA_Chapter_11_Slides_Final_Accessible.pptx
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
importance of Data-Visualization-in-Data-Science. for mba studnts
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Lecture1 pattern recognition............
Pilar Kemerdekaan dan Identi Bangsa.pptx
How to run a consulting project- client discovery
ISS -ESG Data flows What is ESG and HowHow
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Microsoft Core Cloud Services powerpoint
STERILIZATION AND DISINFECTION-1.ppthhhbx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}

GDG Jakarta Meetup - Streaming Analytics With Apache Beam