SlideShare a Scribd company logo
Predictive Maintenance
with
Deep Learning and Flink .
Dongwon Kim, PhD
Solution R&D center
SK Telecom
• Big Data processing engines
• MapReduce, Tez, Spark, Flink, Hive, Presto
• Recent interest
• Deep learning model serving (TensorFlow serving)
• Containerization (Docker, Kubernetes)
• Time-series data (InfluxDB, Prometheus, Grafana)
• Flink Forward 2015 Berlin
• A Comparative Performance Evaluation of Flink
About me (@eastcirclek)
Covered
in this talk
Refinery and semiconductor companies depend on equipment
Breakdown of equipment largely affects company profit
Equipment maintenance to minimize breakdown
Breakdown
Planned Maintenance
Shutdown equipment on a regular basis for
parts replacement, cleaning, adjustments, etc.
Time
2015.07 2016.07 2017.07
Predictive Maintenance (PdM)
Unplanned maintenance based on
prediction of equipment condition
Predictive
Maintenance
Planned
Maintenance
...
<equipment sensors> <predictive maintenance system>
Our approach to Predictive Maintenance
Better prediction of equipment condition
using Deep Learning
PdM
Planned
Maintenance
Breakdown
PdM
Planned
Maintenance
Breakdown
Machine Learning
& Statistics
Machine learning
& Statistics
+ Deep Learning
Contents
1
Why we use Flink
for our time-series prediction model
2
Flink pipeline design for
rendezvous and DNN ensemble
3
Solution packaging and monitoring
with Docker and Prometheus
Time-series prediction model
TF Serving
Grafana
Flink MySQL
PrometheusInfluxDB
Role Toolbox
My team consists of two groups
* The diagram is based on Ted Dunning’s slides (Flink Forward 2017 SF)
Data Engineers
Model Developers
*
history
data
training
DNN
model
*
live
data
prediction
&
alarm
DNN
model
Model developers give Convolutional LSTM to engineers
CNN
RNN
(LSTM)
...
...
Output (Ŷ)
expected sensor values
after 2 days
U
W3 W3
U
W3
O
U
W2
W1
Ŷ
Input (X)
1 week time-series (10080 records)
Model developer
It does not return whether
the equipment breaks down
after 2 days
...
* assuming one minute sampling rate
Data engineers apply Convolutional LSTM to live sensor data
Sensors Vector
y1
y2
y3
ym
......
Equipment
We have multi-sensor data
Data engineers apply Convolutional LSTM to live sensor data
Multi-sensor data arrive
at a fixed interval of 1 minute
Sensors Vector
y1
y2
y3
ym
......
Equipment
...
Timeline
...
Data engineers apply Convolutional LSTM to live sensor data
We maintain a count window
over the latest 10080 records
It slides as a new record arrives
(a sliding count window)
Timeline
Vector
y1
y2
y3
ym
...
Sensors
...
Equipment
Data engineers apply Convolutional LSTM to live sensor data
2 days
...
Y
y1
y2
y3
ym
...
Ŷ
ŷ1
ŷ2
ŷ3
ŷm
...
Given one week time-series (X),
DNN returns predicted values
after 2 days (Ŷ)
Ŷ
Raise an alarm
if the distance of two vectors is
above a defined threshold
...U
W3 W3
U
W3
U
W2
W1
O
...
CNN
RNN
(LSTM)
X (10080 records)
Timeline
Whenever the sliding window moves,
we apply Convolutional LSTM to ....
Stream
source
Sliding
count
window
SinkScoreJoin
Desired streaming topology by data engineers
Apply DNN to X Ŷ
...
timeline
...
X
Requirement 1
Count window to maintain
10080 records instead of
1 week event-time window
Requirement 2
Joining of two streams
on event time
(Rendezvous)
Prediction
stream
Ŷ
Outlier
filter
Y
Measurement
stream
Proof-of-Concept of Spark Structured Streaming
Types
Score
Streaming
Dataset
Input
Streaming
Dataset
Prediction
Streaming
Dataset
joining of two streams
apply DNN to 1 week time-series,
not 10080 records
(Sliding count window is not supported)
generate an input stream
from local files
Inner join between two streaming Datasets
is not supported
in Spark Structured Streaming
Score
Dataset
Unsupported Operations on Spark Structured Streaming (v2.2)
Requirement 1
Count window to maintain
10080 records instead of
1 week event-time window
Requirement 2
Joining of two streams
on event time
(Rendezvous)
That’s why we move to Flink DataStream API
• Sliding count window : not supported
• Joining of two streams : not supported
• Micro-batch behind the scene
• Continuous processing proposed in SPARK-20928
• Sliding count window : supported
• Joining of two streams : supported
• Scalability and performance proved by other use cases
Spark Structured Streaming
Flink DataStream API
* it could be possible to use our Convolutional LSTM model using Spark Structured Stream in some other way
Data processing pipeline with Flink DataStream API
addSource process
countWindowAll
(custom evictor)
apply
assignTimestamps
AndWatermarks
join applywindow
...
timeline
...
X
Ŷ
Measurement
stream
Prediction
stream
+2
days
Outlier
sink
Prediction
Sink
Score
Sink
W(t) @t
Input
Sink
Flink can faithfully implement our streaming topology design
Stream
source
Sliding count
window
SinkScoreJoin
Apply DNN to X Ŷ
...
...
X
Y
Ŷ
Outlier
filter
<Topology design> <Flink implementation>
addSource process
countWindowAll
(custom evictor)
apply
assignTimestampsAndW
atermarks
join applywindow
...
... Ŷ
Contents
1
Why we use Flink
for our time-series prediction model
2
Flink pipeline design for
rendezvous and DNN ensemble
3
Solution packaging and monitoring
with Docker and Prometheus
Time-series prediction model
TF Serving
Grafana
Flink MySQL
PrometheusInfluxDB
addSource process
countWindowAll
(custom evictor)
apply
assignTimestamps
AndWatermarks
join applywindow
...
timeline
...
X
Ŷ
Measurement
stream
Prediction
stream
+2
days
Outlier
sink
Prediction
Sink
Score
Sink
W(t) @t
Input
sink
We read data
from MySQL
Data processing pipeline with Flink DataStream API
Stateful custom source to read from MySQL
• We assume that sensor data arrive in order
• Emit an input record and a watermark of the same time
• Increase lastTimestamp afterward (11:15  11:16)
• Exactly-once semantics
• Store lastTimestamp when taking a snapshot
• Restore lastTimestamp when restarted
addSource
lastTimestamp
(state)
2017-9-13 11:15
2017-9-13 11:16 [y1, y2, ..., y 𝑚]
JDBC Connection
SELECT timestamp, measured
FROM input
WHERE timestamp>$lastTimestamp
W
(11:16)
2017-9-13 11:13 [y1, y2, ..., y 𝑚]
2017-9-13 11:14
2017-9-13 11:15
timestamp measured
[y1, y2, ..., y 𝑚]
[y1, y2, ..., y 𝑚]
Input table in MySQL
2017-9-13 11:16 [y1, y2, ..., y 𝑚] @
11:16
Data processing pipeline with Flink DataStream API
addSource process
countWindowAll
(custom evictor)
apply
assignTimestamps
AndWatermarks
join applywindow
...
timeline
...
X
Ŷ
Measurement
stream
Prediction
stream
+2
days
Prediction
Sink
Score
Sink
W(t) @t
filter out
outliers
maintain last N
elements
Emit outliers to
a side output
Event time window for 1 week
cannot guarantee 10080 records
as data can be missing or filtered
We define
a custom evictor
What if data is absent or filtered for a long period of time?
They look totally different!
3 missing days
We’d better start a new sliding window
for the time-series!
CustomEvictor.of(3, timeThreshold=4)
CustomEvictor evicts
all but the last one
when the last one occurs
after timeThreshold
CountEvictor.of(3)
CountEvictor evicts
elements beyond
its capacity
How to start a new sliding count window after a long break
timeline
1 2 3 4 95 6 7 8
no records for a while
CountTrigger.of(1)
fires every time
a record comes in
EvictingWindowOperator
adds a new input record to
InternalListState
92 3 4
2 3 4 9
2 3 4 9
Sliding count window
of size 3
We want to start a new window
after missing 4 timestamps
Data processing pipeline with Flink DataStream API
addSource process
countWindowAll
(custom evictor)
apply
assignTimestamps
AndWatermarks
join applywindow
...
timeline
...
X
Ŷ
Measurement
stream
Prediction
stream
+2
days
Outlier
sink
Prediction
Sink
Score
Sink
W(t) @t
Input
sink
Working with model developers
They stick to using Python
They develop models using a Python library called Keras
I don’t want to use
Deeplearning4J
because that’s Java…
We use Keras on Python!
I want to develop our
solution on JVM!
Why don’t we
develop models using
Deeplearning4J?
How to load Keras models in Flink?
I don’t know how
to have it!
Loading Keras models in JVM
• Convert Keras models to TensorFlow SavedModel
• use tensorflow.python.saved_model.builder.SavedModelBuilder
• TensorFlow Java API (Flink TensorFlow)
• Do inference inside the JVM process
• TensorFlow Serving
• Do inference outside the JVM process
• Execute Keras through CPython inside JVM
• Do inference inside the JVM process
• Java Embedded Python (JEP) to ease the use of CPython
• https://github.com/ninia/jep
• Use KerasModelImport from Deeplearning4J
• Not mature enough
Comparison of approaches to use Keras models in JVM
TaskManager Process
RichWindowFunction
TensorFlow
Java API
Java Embedded
Python (JEP)
TaskManager Process
RichWindowFunction
TaskManager Process
TensorFlow
Serving
RichWindowFunction
TensorFlow
Native Library
TensorFlow Java API
(very thin wrapper)
Ŷ...
X
CPython
JEP Java object
JEP native code
Ŷ...
X
Saved
Model
Keras
model
Saved
Model
Keras
model
gRPC client
...
X
Ŷ
TFServing process
DynamicManager
Loader
SavedModel
v1
SavedModel
v2Keras
model
Execute Python commands
- import keras
- load a model & weights
- pass X and get Ŷ
Saved
Model
Comparison of runtime inference performance
TensorFlow Java API
77.7 milliseconds per inference
TensorFlow Serving
71.2 milliseconds per inference
Keras inside CPython
w/ TensorFlow backend
32 milliseconds per inference
(* Theano backend is extremely slow in our case)
(* We do not batch inference calls)
Data processing pipeline with Flink DataStream API
addSource process
countWindowAll
(custom evictor)
apply
assignTimestamps
AndWatermarks
join applywindow
...
timeline
...
X
Ŷ
Measurement
stream
Prediction
stream
+2
days
Outlier
sink
Prediction
Sink
Score
Sink
W(t) @t
Input
sink
Tumbling
EventTimeWindows
(size = interval)
Joining two streams on event time
• At a certain time t,
• Y of timestamp t is arriving
• Ŷ of timestamp t+2d is arriving
• Ŷ of timestamp t has arrived two days ago
• TumblingEventTimeWindows.of( Time.seconds(timeUnit) )
• To maintain a window for a single pair of Y and Ŷ
• A window is triggered when watermarks from both streams have arrived
join windowMeasurement
stream
...
@t+2d
@t
@t
apply
assignTimestamps
AndWatermarks
Prediction
stream
Tumbling
EventTime
Windows
@tW(t)
Y
@t+2dW(t+2d)
Ŷ
+2
days
trigger!
Data processing pipeline with Flink DataStream API
addSource process
countWindowAll
(custom evictor)
apply
assignTimestamps
AndWatermarks
join applywindow
...
timeline
...
X
Ŷ
Measurement
stream
Prediction
stream
+2
days
Outlier
sink
Prediction
Sink
W(t) @t
Raise an alarm
if the distance of Ŷ and Y
goes beyond
a defined threshold
Input
sink
Score
Sink
Prediction
Sink
Input
sink
Data processing pipeline with Flink DataStream API
addSource process
countWindowAll
(custom evictor)
apply
assignTimestamps
AndWatermarks
join applywindow
...
timeline
...
X
Ŷ
Measurement
stream
Prediction
stream
+2
days
Outlier
sink
W(t) @t
Input, Prediction, Score sinks write records to InfluxDB
We then plot time-series using Grafana
Predicting from a single DNN is not enough!
Prediction from a single DNN Possibly biased predictionMeasurement
Prediction from
an ensemble of 10 DNNs
...
...
Measurement More reliable prediction!
mean
DNN ensemble for reliable prediction
Ŷ Ŷ Ŷ
Different Convolutional LSTMNs return slightly different prediction results
Ŷ
Timeline
Y
2 days
...
...
...
Raise an alarm
if the distance of two vectors
is above a defined threshold
X (one week time-series) ...
...
Data processing pipeline with Flink DataStream API
addSource process
join applywindow
Measurement
stream
Prediction
stream
Outlier
sink
Prediction
Sink
Score
Sink
Ŷ
…
DNN ensemble
Ŷ
Ŷ
Ŷ
Input
sink
how to implement our ensemble pipeline?
...
...
...
... ...
mean
Ŷ Ŷ Ŷ
Ŷ
…
Data processing pipeline with Flink DataStream API
addSource process
join applywindow
Measurement
stream
Prediction
stream
Outlier
sink
Prediction
Sink
Score
Sink
ŶapplykeyBy
countWindow
(custom evictor)
…
...
...
...
Ŷ
Ŷ
Ŷ
…
…
setParallelism(ensembleSize=10)
assign
Timestamps
And
Watermarks
…
Ŷ
Ŷ
Ŷ
setParallelism(1)
...
...
...
…
DNN ensemble
Ŷ
Ŷ
Ŷ
mean
Ŷ Ŷ Ŷ
Ŷ
flatMap
…
replicate
10 times
Input
sink
applywindowAll
Tumbling
EventTimeWindow
…
... ...
Distribute 10 keys evenly over 10 partitions
Carefully generate keys not to belong to the same partitions
flatMap keyBy
(murmurHash)
PARTITION_0KEY_0,
PARTITION_1KEY_1,
PARTITION_9KEY_9,
…
KEY_1, KEY_9,…
KEY_0,
replicate 10 times with different keys
PARTITION = murmurHash(KEY) / maxParallalism*(parallelism/maxParallism)
Contents
1
Why we use Flink
for our time-series prediction model
2
Flink pipeline design for
rendezvous and DNN ensemble
3
Solution packaging and monitoring
with Docker and Prometheus
TF Serving
Grafana
Flink MySQL
PrometheusInfluxDB
Time-series prediction model
A simple software stack on top of Docker
Customer machine
MySQL
official image
Grafana
official image
Prometheus
official image
TensorFlow
Serving
official image
InfluxDB
official image
Docker engine
Flink
official image
No custom Docker image!
A single yml file is okay to deploy our software stack!
Launch JobManager & TaskManager with some changes
in the official repository of the Docker image for Flink
You need to get
flink-metrics-prometheus-1.4-SNAPSHOT.jar
by yourself
until Flink-1.4 is officially released
metrics.reporter : prom
metrics.reporter.prom.class :
org.apache.flink.metrics.prometheus.PrometheusReporter
jobmanager.heap.mb : 10240
taskmanager.heap.mb : 10240
* Every process runs inside a Docker container
Flink JobManager
TaskManager TaskManager TaskManager
Prometheus
Prometheus scrapes
HTTP endpoints of
metrics exporters
specified in configuration
Grafana
Solution
deployment
:9249/metrics
:9249/metrics
Flink runtime metrics
& Custom metrics
:9104/metrics
MySQLd Exporter
MySQL
metrics
CPU/Disk
Memory
Network
Servers
System
metrics
:9100/metrics
Node Exporter
:8080/metrics
Container
metrics
cAdvisor
TensorFlow
Serving
inference MySQL
source
InfluxDB
sink
Docker
Submits a Flink Job
to launch our pipeline
InfluxDB
Sensor
time-series
dashboard
Solution
monitoring
dashboard
* if using TFServing
Solution monitoring dashboard
* this dashboard is based on ”Docker Dashboard” by Brian Christner
Server
Mem/CPU/FS
usage
(by node exporter)
Container
CPU usage
(by cAdvisor)
Inference time
from each DNN
(custom metrics)
TaskManager
JVM memory
Usage
# records
written to sinks
(custom metrics)
Recap – contents
1
Why we use Flink
for our time-series prediction model
2
Flink pipeline design for
rendezvous and DNN ensemble
3
Solution packaging and monitoring
with Docker and Prometheus
Time-series prediction model
TF Serving
Grafana
Flink MySQL
PrometheusInfluxDB
Conclusion
• Flink helps us concentrate on the core logic
• DataStream API is just like a natural language in presenting streaming topologies
• flexible windowing mechanisms (count window and evictor)
• joining of two streams on event time
• Thanks to it, we can focus on
• implementation of custom source/sink to meet customer requirements
• interaction with DNN ensembles
• It has a nice ecosystem to help build a solution
• Docker
• Prometheus metric reporter
THE END

More Related Content

PPTX
Kibana overview
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Event driven autoscaling with keda
PDF
Grafana introduction
PPTX
Kubernetes introduction
PDF
Kafka streams windowing behind the curtain
PDF
Optimizing Hive Queries
PDF
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Kibana overview
Apache Spark on K8S Best Practice and Performance in the Cloud
Event driven autoscaling with keda
Grafana introduction
Kubernetes introduction
Kafka streams windowing behind the curtain
Optimizing Hive Queries
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)

What's hot (20)

PDF
Delta from a Data Engineer's Perspective
PDF
Intro to Telegraf
PDF
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
PDF
Flexible, hybrid API-led software architectures with Kong
PDF
Machine Learning Data Lineage with MLflow and Delta Lake
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
PDF
Grafana Loki: like Prometheus, but for Logs
PPTX
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
PPTX
Grafana
PPTX
Monitoring with Dynatrace Presentation.pptx
PDF
Designing microservices platforms with nats
PDF
Building Microservices with gRPC and NATS
PDF
Introduction to Apache Kafka
PPTX
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...
PPTX
대용량 분산 아키텍쳐 설계 #4. soa 아키텍쳐
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
Service mesh
PDF
Power Your Predictive Analytics with InfluxDB
PPTX
Migrating SSIS to the cloud
PDF
Shift left Observability
Delta from a Data Engineer's Perspective
Intro to Telegraf
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Flexible, hybrid API-led software architectures with Kong
Machine Learning Data Lineage with MLflow and Delta Lake
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Grafana Loki: like Prometheus, but for Logs
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
Grafana
Monitoring with Dynatrace Presentation.pptx
Designing microservices platforms with nats
Building Microservices with gRPC and NATS
Introduction to Apache Kafka
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...
대용량 분산 아키텍쳐 설계 #4. soa 아키텍쳐
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Service mesh
Power Your Predictive Analytics with InfluxDB
Migrating SSIS to the cloud
Shift left Observability
Ad

Similar to Predictive Maintenance with Deep Learning and Apache Flink (20)

PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PDF
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
PDF
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
PDF
Apache Flink London Meetup - Let's Talk ML on Flink
PDF
[FFE19] Build a Flink AI Ecosystem
PPTX
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
PDF
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
PPTX
Simplifying training deep and serving learning models with big data in python...
PDF
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
PPTX
Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
PPTX
The Past, Present, and Future of Apache Flink®
PPTX
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
PDF
Parallel machines flinkforward2017
PPTX
Flink Streaming
PDF
Predictive Maintenance Using Recurrent Neural Networks
PDF
Operationalizing Machine Learning: Serving ML Models
PDF
AI made easy with Flink AI Flow
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Apache Flink London Meetup - Let's Talk ML on Flink
[FFE19] Build a Flink AI Ecosystem
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Simplifying training deep and serving learning models with big data in python...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
The Past, Present, and Future of Apache Flink®
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Parallel machines flinkforward2017
Flink Streaming
Predictive Maintenance Using Recurrent Neural Networks
Operationalizing Machine Learning: Serving ML Models
AI made easy with Flink AI Flow
Powering tensor flow with big data using apache beam, flink, and spark cern...
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Ad

Recently uploaded (20)

PDF
composite construction of structures.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
737-MAX_SRG.pdf student reference guides
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Current and future trends in Computer Vision.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Well-logging-methods_new................
PPTX
Geodesy 1.pptx...............................................
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
composite construction of structures.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CH1 Production IntroductoryConcepts.pptx
737-MAX_SRG.pdf student reference guides
bas. eng. economics group 4 presentation 1.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Automation-in-Manufacturing-Chapter-Introduction.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Current and future trends in Computer Vision.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
III.4.1.2_The_Space_Environment.p pdffdf
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Well-logging-methods_new................
Geodesy 1.pptx...............................................
Safety Seminar civil to be ensured for safe working.
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...

Predictive Maintenance with Deep Learning and Apache Flink

  • 1. Predictive Maintenance with Deep Learning and Flink . Dongwon Kim, PhD Solution R&D center SK Telecom
  • 2. • Big Data processing engines • MapReduce, Tez, Spark, Flink, Hive, Presto • Recent interest • Deep learning model serving (TensorFlow serving) • Containerization (Docker, Kubernetes) • Time-series data (InfluxDB, Prometheus, Grafana) • Flink Forward 2015 Berlin • A Comparative Performance Evaluation of Flink About me (@eastcirclek) Covered in this talk
  • 3. Refinery and semiconductor companies depend on equipment Breakdown of equipment largely affects company profit
  • 4. Equipment maintenance to minimize breakdown Breakdown Planned Maintenance Shutdown equipment on a regular basis for parts replacement, cleaning, adjustments, etc. Time 2015.07 2016.07 2017.07 Predictive Maintenance (PdM) Unplanned maintenance based on prediction of equipment condition Predictive Maintenance Planned Maintenance ... <equipment sensors> <predictive maintenance system>
  • 5. Our approach to Predictive Maintenance Better prediction of equipment condition using Deep Learning PdM Planned Maintenance Breakdown PdM Planned Maintenance Breakdown Machine Learning & Statistics Machine learning & Statistics + Deep Learning
  • 6. Contents 1 Why we use Flink for our time-series prediction model 2 Flink pipeline design for rendezvous and DNN ensemble 3 Solution packaging and monitoring with Docker and Prometheus Time-series prediction model TF Serving Grafana Flink MySQL PrometheusInfluxDB
  • 7. Role Toolbox My team consists of two groups * The diagram is based on Ted Dunning’s slides (Flink Forward 2017 SF) Data Engineers Model Developers * history data training DNN model * live data prediction & alarm DNN model
  • 8. Model developers give Convolutional LSTM to engineers CNN RNN (LSTM) ... ... Output (Ŷ) expected sensor values after 2 days U W3 W3 U W3 O U W2 W1 Ŷ Input (X) 1 week time-series (10080 records) Model developer It does not return whether the equipment breaks down after 2 days ... * assuming one minute sampling rate
  • 9. Data engineers apply Convolutional LSTM to live sensor data Sensors Vector y1 y2 y3 ym ...... Equipment We have multi-sensor data
  • 10. Data engineers apply Convolutional LSTM to live sensor data Multi-sensor data arrive at a fixed interval of 1 minute Sensors Vector y1 y2 y3 ym ...... Equipment ... Timeline
  • 11. ... Data engineers apply Convolutional LSTM to live sensor data We maintain a count window over the latest 10080 records It slides as a new record arrives (a sliding count window) Timeline Vector y1 y2 y3 ym ... Sensors ... Equipment
  • 12. Data engineers apply Convolutional LSTM to live sensor data 2 days ... Y y1 y2 y3 ym ... Ŷ ŷ1 ŷ2 ŷ3 ŷm ... Given one week time-series (X), DNN returns predicted values after 2 days (Ŷ) Ŷ Raise an alarm if the distance of two vectors is above a defined threshold ...U W3 W3 U W3 U W2 W1 O ... CNN RNN (LSTM) X (10080 records) Timeline Whenever the sliding window moves, we apply Convolutional LSTM to ....
  • 13. Stream source Sliding count window SinkScoreJoin Desired streaming topology by data engineers Apply DNN to X Ŷ ... timeline ... X Requirement 1 Count window to maintain 10080 records instead of 1 week event-time window Requirement 2 Joining of two streams on event time (Rendezvous) Prediction stream Ŷ Outlier filter Y Measurement stream
  • 14. Proof-of-Concept of Spark Structured Streaming Types Score Streaming Dataset Input Streaming Dataset Prediction Streaming Dataset joining of two streams apply DNN to 1 week time-series, not 10080 records (Sliding count window is not supported) generate an input stream from local files
  • 15. Inner join between two streaming Datasets is not supported in Spark Structured Streaming Score Dataset
  • 16. Unsupported Operations on Spark Structured Streaming (v2.2) Requirement 1 Count window to maintain 10080 records instead of 1 week event-time window Requirement 2 Joining of two streams on event time (Rendezvous)
  • 17. That’s why we move to Flink DataStream API • Sliding count window : not supported • Joining of two streams : not supported • Micro-batch behind the scene • Continuous processing proposed in SPARK-20928 • Sliding count window : supported • Joining of two streams : supported • Scalability and performance proved by other use cases Spark Structured Streaming Flink DataStream API * it could be possible to use our Convolutional LSTM model using Spark Structured Stream in some other way
  • 18. Data processing pipeline with Flink DataStream API addSource process countWindowAll (custom evictor) apply assignTimestamps AndWatermarks join applywindow ... timeline ... X Ŷ Measurement stream Prediction stream +2 days Outlier sink Prediction Sink Score Sink W(t) @t Input Sink
  • 19. Flink can faithfully implement our streaming topology design Stream source Sliding count window SinkScoreJoin Apply DNN to X Ŷ ... ... X Y Ŷ Outlier filter <Topology design> <Flink implementation> addSource process countWindowAll (custom evictor) apply assignTimestampsAndW atermarks join applywindow ... ... Ŷ
  • 20. Contents 1 Why we use Flink for our time-series prediction model 2 Flink pipeline design for rendezvous and DNN ensemble 3 Solution packaging and monitoring with Docker and Prometheus Time-series prediction model TF Serving Grafana Flink MySQL PrometheusInfluxDB
  • 21. addSource process countWindowAll (custom evictor) apply assignTimestamps AndWatermarks join applywindow ... timeline ... X Ŷ Measurement stream Prediction stream +2 days Outlier sink Prediction Sink Score Sink W(t) @t Input sink We read data from MySQL Data processing pipeline with Flink DataStream API
  • 22. Stateful custom source to read from MySQL • We assume that sensor data arrive in order • Emit an input record and a watermark of the same time • Increase lastTimestamp afterward (11:15  11:16) • Exactly-once semantics • Store lastTimestamp when taking a snapshot • Restore lastTimestamp when restarted addSource lastTimestamp (state) 2017-9-13 11:15 2017-9-13 11:16 [y1, y2, ..., y 𝑚] JDBC Connection SELECT timestamp, measured FROM input WHERE timestamp>$lastTimestamp W (11:16) 2017-9-13 11:13 [y1, y2, ..., y 𝑚] 2017-9-13 11:14 2017-9-13 11:15 timestamp measured [y1, y2, ..., y 𝑚] [y1, y2, ..., y 𝑚] Input table in MySQL 2017-9-13 11:16 [y1, y2, ..., y 𝑚] @ 11:16
  • 23. Data processing pipeline with Flink DataStream API addSource process countWindowAll (custom evictor) apply assignTimestamps AndWatermarks join applywindow ... timeline ... X Ŷ Measurement stream Prediction stream +2 days Prediction Sink Score Sink W(t) @t filter out outliers maintain last N elements Emit outliers to a side output Event time window for 1 week cannot guarantee 10080 records as data can be missing or filtered We define a custom evictor
  • 24. What if data is absent or filtered for a long period of time? They look totally different! 3 missing days We’d better start a new sliding window for the time-series!
  • 25. CustomEvictor.of(3, timeThreshold=4) CustomEvictor evicts all but the last one when the last one occurs after timeThreshold CountEvictor.of(3) CountEvictor evicts elements beyond its capacity How to start a new sliding count window after a long break timeline 1 2 3 4 95 6 7 8 no records for a while CountTrigger.of(1) fires every time a record comes in EvictingWindowOperator adds a new input record to InternalListState 92 3 4 2 3 4 9 2 3 4 9 Sliding count window of size 3 We want to start a new window after missing 4 timestamps
  • 26. Data processing pipeline with Flink DataStream API addSource process countWindowAll (custom evictor) apply assignTimestamps AndWatermarks join applywindow ... timeline ... X Ŷ Measurement stream Prediction stream +2 days Outlier sink Prediction Sink Score Sink W(t) @t Input sink
  • 27. Working with model developers They stick to using Python They develop models using a Python library called Keras I don’t want to use Deeplearning4J because that’s Java… We use Keras on Python! I want to develop our solution on JVM! Why don’t we develop models using Deeplearning4J?
  • 28. How to load Keras models in Flink? I don’t know how to have it!
  • 29. Loading Keras models in JVM • Convert Keras models to TensorFlow SavedModel • use tensorflow.python.saved_model.builder.SavedModelBuilder • TensorFlow Java API (Flink TensorFlow) • Do inference inside the JVM process • TensorFlow Serving • Do inference outside the JVM process • Execute Keras through CPython inside JVM • Do inference inside the JVM process • Java Embedded Python (JEP) to ease the use of CPython • https://github.com/ninia/jep • Use KerasModelImport from Deeplearning4J • Not mature enough
  • 30. Comparison of approaches to use Keras models in JVM TaskManager Process RichWindowFunction TensorFlow Java API Java Embedded Python (JEP) TaskManager Process RichWindowFunction TaskManager Process TensorFlow Serving RichWindowFunction TensorFlow Native Library TensorFlow Java API (very thin wrapper) Ŷ... X CPython JEP Java object JEP native code Ŷ... X Saved Model Keras model Saved Model Keras model gRPC client ... X Ŷ TFServing process DynamicManager Loader SavedModel v1 SavedModel v2Keras model Execute Python commands - import keras - load a model & weights - pass X and get Ŷ Saved Model
  • 31. Comparison of runtime inference performance TensorFlow Java API 77.7 milliseconds per inference TensorFlow Serving 71.2 milliseconds per inference Keras inside CPython w/ TensorFlow backend 32 milliseconds per inference (* Theano backend is extremely slow in our case) (* We do not batch inference calls)
  • 32. Data processing pipeline with Flink DataStream API addSource process countWindowAll (custom evictor) apply assignTimestamps AndWatermarks join applywindow ... timeline ... X Ŷ Measurement stream Prediction stream +2 days Outlier sink Prediction Sink Score Sink W(t) @t Input sink Tumbling EventTimeWindows (size = interval)
  • 33. Joining two streams on event time • At a certain time t, • Y of timestamp t is arriving • Ŷ of timestamp t+2d is arriving • Ŷ of timestamp t has arrived two days ago • TumblingEventTimeWindows.of( Time.seconds(timeUnit) ) • To maintain a window for a single pair of Y and Ŷ • A window is triggered when watermarks from both streams have arrived join windowMeasurement stream ... @t+2d @t @t apply assignTimestamps AndWatermarks Prediction stream Tumbling EventTime Windows @tW(t) Y @t+2dW(t+2d) Ŷ +2 days trigger!
  • 34. Data processing pipeline with Flink DataStream API addSource process countWindowAll (custom evictor) apply assignTimestamps AndWatermarks join applywindow ... timeline ... X Ŷ Measurement stream Prediction stream +2 days Outlier sink Prediction Sink W(t) @t Raise an alarm if the distance of Ŷ and Y goes beyond a defined threshold Input sink
  • 35. Score Sink Prediction Sink Input sink Data processing pipeline with Flink DataStream API addSource process countWindowAll (custom evictor) apply assignTimestamps AndWatermarks join applywindow ... timeline ... X Ŷ Measurement stream Prediction stream +2 days Outlier sink W(t) @t
  • 36. Input, Prediction, Score sinks write records to InfluxDB We then plot time-series using Grafana
  • 37. Predicting from a single DNN is not enough! Prediction from a single DNN Possibly biased predictionMeasurement Prediction from an ensemble of 10 DNNs ... ... Measurement More reliable prediction!
  • 38. mean DNN ensemble for reliable prediction Ŷ Ŷ Ŷ Different Convolutional LSTMNs return slightly different prediction results Ŷ Timeline Y 2 days ... ... ... Raise an alarm if the distance of two vectors is above a defined threshold X (one week time-series) ... ...
  • 39. Data processing pipeline with Flink DataStream API addSource process join applywindow Measurement stream Prediction stream Outlier sink Prediction Sink Score Sink Ŷ … DNN ensemble Ŷ Ŷ Ŷ Input sink how to implement our ensemble pipeline? ... ... ... ... ... mean Ŷ Ŷ Ŷ Ŷ …
  • 40. Data processing pipeline with Flink DataStream API addSource process join applywindow Measurement stream Prediction stream Outlier sink Prediction Sink Score Sink ŶapplykeyBy countWindow (custom evictor) … ... ... ... Ŷ Ŷ Ŷ … … setParallelism(ensembleSize=10) assign Timestamps And Watermarks … Ŷ Ŷ Ŷ setParallelism(1) ... ... ... … DNN ensemble Ŷ Ŷ Ŷ mean Ŷ Ŷ Ŷ Ŷ flatMap … replicate 10 times Input sink applywindowAll Tumbling EventTimeWindow … ... ...
  • 41. Distribute 10 keys evenly over 10 partitions Carefully generate keys not to belong to the same partitions flatMap keyBy (murmurHash) PARTITION_0KEY_0, PARTITION_1KEY_1, PARTITION_9KEY_9, … KEY_1, KEY_9,… KEY_0, replicate 10 times with different keys PARTITION = murmurHash(KEY) / maxParallalism*(parallelism/maxParallism)
  • 42. Contents 1 Why we use Flink for our time-series prediction model 2 Flink pipeline design for rendezvous and DNN ensemble 3 Solution packaging and monitoring with Docker and Prometheus TF Serving Grafana Flink MySQL PrometheusInfluxDB Time-series prediction model
  • 43. A simple software stack on top of Docker Customer machine MySQL official image Grafana official image Prometheus official image TensorFlow Serving official image InfluxDB official image Docker engine Flink official image No custom Docker image! A single yml file is okay to deploy our software stack!
  • 44. Launch JobManager & TaskManager with some changes in the official repository of the Docker image for Flink You need to get flink-metrics-prometheus-1.4-SNAPSHOT.jar by yourself until Flink-1.4 is officially released metrics.reporter : prom metrics.reporter.prom.class : org.apache.flink.metrics.prometheus.PrometheusReporter jobmanager.heap.mb : 10240 taskmanager.heap.mb : 10240
  • 45. * Every process runs inside a Docker container Flink JobManager TaskManager TaskManager TaskManager Prometheus Prometheus scrapes HTTP endpoints of metrics exporters specified in configuration Grafana Solution deployment :9249/metrics :9249/metrics Flink runtime metrics & Custom metrics :9104/metrics MySQLd Exporter MySQL metrics CPU/Disk Memory Network Servers System metrics :9100/metrics Node Exporter :8080/metrics Container metrics cAdvisor TensorFlow Serving inference MySQL source InfluxDB sink Docker Submits a Flink Job to launch our pipeline InfluxDB Sensor time-series dashboard Solution monitoring dashboard * if using TFServing
  • 46. Solution monitoring dashboard * this dashboard is based on ”Docker Dashboard” by Brian Christner Server Mem/CPU/FS usage (by node exporter) Container CPU usage (by cAdvisor) Inference time from each DNN (custom metrics) TaskManager JVM memory Usage # records written to sinks (custom metrics)
  • 47. Recap – contents 1 Why we use Flink for our time-series prediction model 2 Flink pipeline design for rendezvous and DNN ensemble 3 Solution packaging and monitoring with Docker and Prometheus Time-series prediction model TF Serving Grafana Flink MySQL PrometheusInfluxDB
  • 48. Conclusion • Flink helps us concentrate on the core logic • DataStream API is just like a natural language in presenting streaming topologies • flexible windowing mechanisms (count window and evictor) • joining of two streams on event time • Thanks to it, we can focus on • implementation of custom source/sink to meet customer requirements • interaction with DNN ensembles • It has a nice ecosystem to help build a solution • Docker • Prometheus metric reporter