SlideShare a Scribd company logo
Till Rohrmann
Flink committer
trohrmann@apache.org
@stsffap
Machine Learning
with
Apache Flink
What is Flink
§  Large-scale data processing engine
§  Easy and powerful APIs for batch and real-time
streaming analysis (Java / Scala)
§  Backed by a very robust execution backend
•  with true streaming capabilities,
•  custom memory manager,
•  native iteration execution,
•  and a cost-based optimizer.
2
Technology inside Flink
§  Technology inspired by compilers +
MPP databases + distributed systems
§  For ease of use, reliable performance,
and scalability
case	
  class	
  Path	
  (from:	
  Long,	
  to:	
  Long)	
  
val	
  tc	
  =	
  edges.iterate(10)	
  {	
  	
  
	
  	
  paths:	
  DataSet[Path]	
  =>	
  
	
  	
  	
  	
  val	
  next	
  =	
  paths	
  
	
  	
  	
  	
  	
  	
  .join(edges)	
  
	
  	
  	
  	
  	
  	
  .where("to")	
  
	
  	
  	
  	
  	
  	
  .equalTo("from")	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  (path,	
  edge)	
  =>	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Path(path.from,	
  edge.to)	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  .union(paths)	
  
	
  	
  	
  	
  	
  	
  .distinct()	
  
	
  	
  	
  	
  next	
  
	
  	
  }	
  
Cost-based
optimizer
Type extraction
stack
Memory
manager
Out-of-core
algos
real-time
streaming
Task
scheduling
Recovery
metadata
Data
serialization
stack
Streaming
network
stack
...
Pre-flight
(client) Master
Workers
How do you use Flink?
4
Example: WordCount
5
case	
  class	
  Word	
  (word:	
  String,	
  frequency:	
  Int)	
  
	
  
val	
  env	
  =	
  ExecutionEnvironment.getExecutionEnvironment()	
  
	
  
val	
  lines	
  =	
  env.readTextFile(...)	
  
	
  
lines	
  
	
  	
  	
  .flatMap	
  {line	
  =>	
  line.split("	
  ").map(word	
  =>	
  Word(word,1))}	
  	
  	
  
	
  	
  	
  .groupBy("word").sum("frequency”)	
  
	
  	
  	
  .print()	
  
	
  
env.execute()	
  	
  	
  	
  
Flink has mirrored Java and Scala APIs that offer the same
functionality, including by-name addressing.
Flink API in a Nutshell
§  map, flatMap, filter,
groupBy, reduce,
reduceGroup,
aggregate, join,
coGroup, cross,
project, distinct, union,
iterate, iterateDelta, ...
§  All Hadoop input
formats are supported
§  API similar for data sets
and data streams with
slightly different
operator semantics
§  Window functions for
data streams
§  Counters,
accumulators, and
broadcast variables
6
Machine learning with Flink
7
Does ML work like that?
8
More realistic scenario!
9
Machine learning pipelines
§  Pipelining inspired by scikit-learn
§  Transformer: Modify data
§  Learner: Train a model
§  Reusable components
§  Let’s you quickly build ML pipelines
§  Model inherits pipeline of learner
10
Linear regression in polynomial space
val	
  polynomialBase	
  =	
  PolynomialBase()	
  
val	
  learner	
  =	
  MultipleLinearRegression()	
  
	
  
val	
  pipeline	
  =	
  polynomialBase.chain(learner)	
  
	
  
val	
  trainingDS	
  =	
  env.fromCollection(trainingData)	
  
	
  
val	
  parameters	
  =	
  ParameterMap()	
  
	
  	
  .add(PolynomialBase.Degree,	
  3)	
  
	
  	
  .add(MultipleLinearRegression.Stepsize,	
  0.002)	
  
	
  	
  .add(MultipleLinearRegression.Iterations,	
  100)	
  
	
  
val	
  model	
  =	
  pipeline.fit(trainingDS,	
  parameters)	
  
11
Input	
  Data	
  
Polynomial	
  
Base	
  
Mapper	
  
Mul4ple	
  
Linear	
  
Regression	
  
Linear	
  
Model	
  
Current state of Flink-ML
§  Existing learners
•  Multiple linear regression
•  Alternating least squares
•  Communication efficient distributed dual
coordinate ascent (PR pending)
§  Feature transformer
•  Polynomial base feature mapper
§  Tooling
12
Distributed linear algebra
§  Linear algebra universal
language for data
analysis
§  High-level abstraction
§  Fast prototyping
§  Pre- and post-processing
step
13
Example: Gaussian non-negative matrix
factorization
§  Given input matrix V, find W and H such
that
§  Iterative approximation
14
Ht+1 = Ht ∗ Wt
T
V /Wt
T
Wt Ht( )
Wt+1 = Wt ∗ VHt+1
T
/Wt Ht+1Ht+1
T
( )
V ≈ WH
var	
  i	
  =	
  0	
  
var	
  H:	
  CheckpointedDrm[Int]	
  =	
  randomMatrix(k,	
  V.numCols)	
  
var	
  W:	
  CheckpointedDrm[Int]	
  =	
  randomMatrix(V.numRows,	
  k)	
  
	
  
while(i	
  <	
  maxIterations)	
  {	
  
	
  	
  H	
  =	
  H	
  *	
  (W.t	
  %*%	
  V	
  /	
  W.t	
  %*%	
  W	
  %*%	
  H)	
  
	
  	
  W	
  =	
  W	
  *	
  (V	
  %*%	
  	
  H.t	
  /	
  W	
  %*%	
  H	
  %*%	
  H.t)	
  
	
  	
  i	
  +=	
  1	
  
}	
  
Why is Flink a good fit for ML?
15
Flink’s features
§  Stateful iterations
•  Keep state across iterations
§  Delta iterations
•  Limit computation to elements which matter
§  Pipelining
•  Avoiding materialization of large
intermediate state
16
CoCoA
17
minw∈Rd P(w):=
λ
2
w
2
+
1
n
ℓi wT
xi( )
i=1
n
∑
#
$
%
&
'
(
Bulk Iterations
18
partial
solution
partial
solutionX
other
datasets
Y
initial
solution
iteration
result
Replace
Step function
Delta iterations
19
partial
solution
delta
setX
other
datasets
Y
initial
solution
iteration
result
workset A B workset
Merge deltas
Replace
initial
workset
Effect of delta iterations
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
#ofelementsupdated
iteration
Iteration performance
21
0
10
20
30
40
50
60
Hadoop Flink bulk Flink delta
Time(minutes)
61 iterations and 30 iterations of
PageRank on a Twitter follower
graph with Hadoop MapReduce
and Flink using bulk and delta
iterations
30 iterations
61 iterations
MapReduce
How to factorize really large
matrices?
22
Collaborative Filtering
§  Recommend items based on users with
similar preferences
§  Latent factor models capture underlying
characteristics of items and preferences
of user
§  Predicted preference:
23
ˆru,i = xu
T
yi
Matrix factorization
24
minX,Y ru,i − xu
T
yi( )
2
+ λ nu xu
2
+ ni yi
2
i
∑
u
∑
#
$
%
&
'
(
ru,i≠0
∑
R ≈ XT
Y
R
X
Y
Alternating least squares
§  Fixing one matrix gives a quadratic form
§  Solution guarantees to decrease overall
cost function
§  To calculate , all rated item vectors and
ratings are needed
25
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T
Sii
u
=
1 if ru,i ≠ 0
0 else
"
#
$
%$
xu
Data partitioning
26
Naïve ALS
case	
  class	
  Rating(userID:	
  Int,	
  itemID:	
  Int,	
  rating:	
  Double)	
  
case	
  class	
  ColumnVector(columnIndex:	
  Int,	
  vector:	
  Array[Double])	
  
	
  
val	
  items:	
  DataSet[ColumnVector]	
  =	
  _	
  
val	
  ratings:	
  DataSet[Rating]	
  =	
  _	
  
	
  
//	
  Generate	
  tuples	
  of	
  items	
  with	
  their	
  ratings	
  
val	
  uVA	
  =	
  items.join(ratings).where(0).equalTo(1)	
  {	
  
	
  	
  (item,	
  ratingEntry)	
  =>	
  {	
  
	
  	
  	
  	
  val	
  Rating(uID,	
  _,	
  rating)	
  =	
  ratingEntry	
  
	
  	
  	
  	
  (uID,	
  rating,	
  item.vector)	
  
	
  	
  }	
  
}	
  
	
  
	
  
27
Naïve ALS contd.
uVA.groupBy(0).reduceGroup	
  {	
  
	
  	
  vectors	
  =>	
  {	
  
	
  	
  	
  	
  var	
  uID	
  =	
  -­‐1	
  
	
  	
  	
  	
  val	
  matrix	
  =	
  FloatMatrix.zeros(factors,	
  factors)	
  
	
  	
  	
  	
  val	
  vector	
  =	
  FloatMatrix.zeros(factors)	
  
	
  	
  	
  	
  var	
  n	
  =	
  0	
  
	
  
	
  	
  	
  	
  for((id,	
  rating,	
  v)	
  <-­‐	
  vectors)	
  {	
  
	
  	
  	
  	
  	
  	
  uID	
  =	
  id	
  
	
  	
  	
  	
  	
  	
  vector	
  +=	
  rating	
  *	
  v	
  
	
  	
  	
  	
  	
  	
  matrix	
  +=	
  outerProduct(v	
  ,	
  v)	
  
	
  	
  	
  	
  	
  	
  n	
  +=	
  1	
  
	
  	
  	
  	
  }	
  
	
  
	
  	
  	
  	
  for(idx	
  <-­‐	
  0	
  until	
  factors)	
  {	
  
	
  	
  	
  	
  	
  	
  matrix(idx,	
  idx)	
  +=	
  lambda	
  *	
  n	
  
	
  	
  	
  	
  }	
  
	
  
	
  	
  	
  	
  new	
  ColumnVector(uID,	
  Solve(matrix,	
  vector))	
  
	
  	
  }	
  
}	
  
28
Problems of naïve ALS
§  Problem:
•  Item vectors are sent redundantly à High
network load
§  Solution:
•  Blocking of user and item vectors to share
common data
•  Avoids blown up intermediate state
29
Data partitioning
30
Performance comparison
31
•  40	
  node	
  GCE	
  cluster,	
  highmem-­‐8	
  
•  10	
  ALS	
  itera4on	
  with	
  50	
  latent	
  factors	
  
Runtimeinminutes
0
225
450
675
900
Number of non-zero entries (billion)
0 7.5 15 22.5 30
Blocked ALS Blocked ALS highmem-16 Naive ALS
5.5h
14h
2.5h
1h
Table 2
Entries in billion Naive Join Naive Join Broadcast Broadcast
80 0.08 201.326 3.35543333333333 190.723 3.17871666666667
Streaming machine learning
32
Why is streaming ML important?
§  Spam detection in mails
§  Patterns might change over time
§  Retraining of model necessary
§  Best solution: Online models
33
Applications
§  Spam detection
§  Recommendation
§  News feed
personalization
§  Credit card fraud
detection
34
Apache SAMOA
§  Scalable Advanced Massive Online
Analysis
§  Distributed streaming machine learning
framework
§  Incubation at the Apache Software
Foundation
§  Runs on multiple streaming processing
engines (S4, Storm, Samza)
§  Support for Flink is pending pull request
35
Supported algorithms
§  Classification: Vertical
Hoeffding Tree
§  Clustering: CluStream
§  Regression: Adaptive
Model Rules
§  Frequent pattern mining:
PARMA
36
Closing
37
Flink-ML Outlook
§  Support more algorithms
§  Support for distributed linear algebra
§  Integration with streaming machine learning
§  Interactive programs and Zeppelin
38
flink.apache.org
@ApacheFlink

More Related Content

What's hot (20)

PDF
FlinkML: Large Scale Machine Learning with Apache Flink
Theodoros Vasiloudis
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
PPTX
First Flink Bay Area meetup
Kostas Tzoumas
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PPTX
Flink internals web
Kostas Tzoumas
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PDF
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PDF
Flink Apachecon Presentation
Gyula Fóra
 
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
PPTX
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
FlinkML: Large Scale Machine Learning with Apache Flink
Theodoros Vasiloudis
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
First Flink Bay Area meetup
Kostas Tzoumas
 
Apache Flink Training: System Overview
Flink Forward
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Flink internals web
Kostas Tzoumas
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Flink Apachecon Presentation
Gyula Fóra
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 

Similar to Machine Learning with Apache Flink at Stockholm Machine Learning Group (20)

PDF
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
PPTX
Apache Flink Deep Dive
DataWorks Summit
 
PDF
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
PPTX
Introduction to Apache Flink
mxmxm
 
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
PPTX
Data Analysis With Apache Flink
DataWorks Summit
 
PPTX
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Aljoscha Krettek
 
PDF
Apache Flink London Meetup - Let's Talk ML on Flink
Stavros Kontopoulos
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPTX
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
PPTX
Flink Streaming
Gyula Fóra
 
PDF
Márton Balassi Streaming ML with Flink-
Flink Forward
 
PDF
[FFE19] Build a Flink AI Ecosystem
Jiangjie Qin
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Bowen Li
 
PPTX
Advanced
mxmxm
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PDF
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
PDF
Data Stream Analytics - Why they are important
Paris Carbone
 
PDF
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
Apache Flink Deep Dive
DataWorks Summit
 
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
Introduction to Apache Flink
mxmxm
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
Data Analysis With Apache Flink
DataWorks Summit
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Aljoscha Krettek
 
Apache Flink London Meetup - Let's Talk ML on Flink
Stavros Kontopoulos
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Flink Streaming
Gyula Fóra
 
Márton Balassi Streaming ML with Flink-
Flink Forward
 
[FFE19] Build a Flink AI Ecosystem
Jiangjie Qin
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Bowen Li
 
Advanced
mxmxm
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
Data Stream Analytics - Why they are important
Paris Carbone
 
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward
 
Ad

More from Till Rohrmann (17)

PDF
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
PPTX
Apache flink 1.7 and Beyond
Till Rohrmann
 
PDF
Elastic Streams at Scale @ Flink Forward 2018 Berlin
Till Rohrmann
 
PDF
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
PDF
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
PDF
Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin
Till Rohrmann
 
PDF
Apache Flink® Meets Apache Mesos® and DC/OS
Till Rohrmann
 
PPTX
From Apache Flink® 1.3 to 1.4
Till Rohrmann
 
PDF
Apache Flink and More @ MesosCon Asia 2017
Till Rohrmann
 
PPTX
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
Till Rohrmann
 
PDF
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PDF
Streaming Analytics & CEP - Two sides of the same coin?
Till Rohrmann
 
PDF
Apache Flink: Streaming Done Right @ FOSDEM 2016
Till Rohrmann
 
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
PDF
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Till Rohrmann
 
PDF
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Till Rohrmann
 
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
Apache flink 1.7 and Beyond
Till Rohrmann
 
Elastic Streams at Scale @ Flink Forward 2018 Berlin
Till Rohrmann
 
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin
Till Rohrmann
 
Apache Flink® Meets Apache Mesos® and DC/OS
Till Rohrmann
 
From Apache Flink® 1.3 to 1.4
Till Rohrmann
 
Apache Flink and More @ MesosCon Asia 2017
Till Rohrmann
 
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
Till Rohrmann
 
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Streaming Analytics & CEP - Two sides of the same coin?
Till Rohrmann
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Till Rohrmann
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Till Rohrmann
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Till Rohrmann
 
Ad

Recently uploaded (20)

PPTX
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
PDF
Designing Accessible Content Blocks (1).pdf
jaclynmennie1
 
PPTX
CV-Project_2024 version 01222222222.pptx
MohammadSiddiqui70
 
PPTX
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
PDF
Rewards and Recognition (2).pdf
ethan Talor
 
PDF
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
PPTX
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
PDF
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
PDF
Automated Test Case Repair Using Language Models
Lionel Briand
 
PDF
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
PPTX
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
PDF
WholeClear Split vCard Software for Split large vCard file
markwillsonmw004
 
PDF
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
PDF
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
PDF
AI Software Development Process, Strategies and Challenges
Net-Craft.com
 
PDF
How DeepSeek Beats ChatGPT: Cost Comparison and Key Differences
sumitpurohit810
 
PDF
Laboratory Workflows Digitalized and live in 90 days with Scifeon´s SAPPA P...
info969686
 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PPTX
Seamless-Image-Conversion-From-Raster-to-wrt-rtx-rtx.pptx
Quick Conversion Services
 
PDF
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
 
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
Designing Accessible Content Blocks (1).pdf
jaclynmennie1
 
CV-Project_2024 version 01222222222.pptx
MohammadSiddiqui70
 
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
Rewards and Recognition (2).pdf
ethan Talor
 
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
Automated Test Case Repair Using Language Models
Lionel Briand
 
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
WholeClear Split vCard Software for Split large vCard file
markwillsonmw004
 
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
AI Software Development Process, Strategies and Challenges
Net-Craft.com
 
How DeepSeek Beats ChatGPT: Cost Comparison and Key Differences
sumitpurohit810
 
Laboratory Workflows Digitalized and live in 90 days with Scifeon´s SAPPA P...
info969686
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
Seamless-Image-Conversion-From-Raster-to-wrt-rtx-rtx.pptx
Quick Conversion Services
 
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
 

Machine Learning with Apache Flink at Stockholm Machine Learning Group

  • 2. What is Flink §  Large-scale data processing engine §  Easy and powerful APIs for batch and real-time streaming analysis (Java / Scala) §  Backed by a very robust execution backend •  with true streaming capabilities, •  custom memory manager, •  native iteration execution, •  and a cost-based optimizer. 2
  • 3. Technology inside Flink §  Technology inspired by compilers + MPP databases + distributed systems §  For ease of use, reliable performance, and scalability case  class  Path  (from:  Long,  to:  Long)   val  tc  =  edges.iterate(10)  {        paths:  DataSet[Path]  =>          val  next  =  paths              .join(edges)              .where("to")              .equalTo("from")  {                  (path,  edge)  =>                        Path(path.from,  edge.to)              }              .union(paths)              .distinct()          next      }   Cost-based optimizer Type extraction stack Memory manager Out-of-core algos real-time streaming Task scheduling Recovery metadata Data serialization stack Streaming network stack ... Pre-flight (client) Master Workers
  • 4. How do you use Flink? 4
  • 5. Example: WordCount 5 case  class  Word  (word:  String,  frequency:  Int)     val  env  =  ExecutionEnvironment.getExecutionEnvironment()     val  lines  =  env.readTextFile(...)     lines        .flatMap  {line  =>  line.split("  ").map(word  =>  Word(word,1))}            .groupBy("word").sum("frequency”)        .print()     env.execute()         Flink has mirrored Java and Scala APIs that offer the same functionality, including by-name addressing.
  • 6. Flink API in a Nutshell §  map, flatMap, filter, groupBy, reduce, reduceGroup, aggregate, join, coGroup, cross, project, distinct, union, iterate, iterateDelta, ... §  All Hadoop input formats are supported §  API similar for data sets and data streams with slightly different operator semantics §  Window functions for data streams §  Counters, accumulators, and broadcast variables 6
  • 8. Does ML work like that? 8
  • 10. Machine learning pipelines §  Pipelining inspired by scikit-learn §  Transformer: Modify data §  Learner: Train a model §  Reusable components §  Let’s you quickly build ML pipelines §  Model inherits pipeline of learner 10
  • 11. Linear regression in polynomial space val  polynomialBase  =  PolynomialBase()   val  learner  =  MultipleLinearRegression()     val  pipeline  =  polynomialBase.chain(learner)     val  trainingDS  =  env.fromCollection(trainingData)     val  parameters  =  ParameterMap()      .add(PolynomialBase.Degree,  3)      .add(MultipleLinearRegression.Stepsize,  0.002)      .add(MultipleLinearRegression.Iterations,  100)     val  model  =  pipeline.fit(trainingDS,  parameters)   11 Input  Data   Polynomial   Base   Mapper   Mul4ple   Linear   Regression   Linear   Model  
  • 12. Current state of Flink-ML §  Existing learners •  Multiple linear regression •  Alternating least squares •  Communication efficient distributed dual coordinate ascent (PR pending) §  Feature transformer •  Polynomial base feature mapper §  Tooling 12
  • 13. Distributed linear algebra §  Linear algebra universal language for data analysis §  High-level abstraction §  Fast prototyping §  Pre- and post-processing step 13
  • 14. Example: Gaussian non-negative matrix factorization §  Given input matrix V, find W and H such that §  Iterative approximation 14 Ht+1 = Ht ∗ Wt T V /Wt T Wt Ht( ) Wt+1 = Wt ∗ VHt+1 T /Wt Ht+1Ht+1 T ( ) V ≈ WH var  i  =  0   var  H:  CheckpointedDrm[Int]  =  randomMatrix(k,  V.numCols)   var  W:  CheckpointedDrm[Int]  =  randomMatrix(V.numRows,  k)     while(i  <  maxIterations)  {      H  =  H  *  (W.t  %*%  V  /  W.t  %*%  W  %*%  H)      W  =  W  *  (V  %*%    H.t  /  W  %*%  H  %*%  H.t)      i  +=  1   }  
  • 15. Why is Flink a good fit for ML? 15
  • 16. Flink’s features §  Stateful iterations •  Keep state across iterations §  Delta iterations •  Limit computation to elements which matter §  Pipelining •  Avoiding materialization of large intermediate state 16
  • 20. Effect of delta iterations 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46 51 56 61 #ofelementsupdated iteration
  • 21. Iteration performance 21 0 10 20 30 40 50 60 Hadoop Flink bulk Flink delta Time(minutes) 61 iterations and 30 iterations of PageRank on a Twitter follower graph with Hadoop MapReduce and Flink using bulk and delta iterations 30 iterations 61 iterations MapReduce
  • 22. How to factorize really large matrices? 22
  • 23. Collaborative Filtering §  Recommend items based on users with similar preferences §  Latent factor models capture underlying characteristics of items and preferences of user §  Predicted preference: 23 ˆru,i = xu T yi
  • 24. Matrix factorization 24 minX,Y ru,i − xu T yi( ) 2 + λ nu xu 2 + ni yi 2 i ∑ u ∑ # $ % & ' ( ru,i≠0 ∑ R ≈ XT Y R X Y
  • 25. Alternating least squares §  Fixing one matrix gives a quadratic form §  Solution guarantees to decrease overall cost function §  To calculate , all rated item vectors and ratings are needed 25 xu = YSu YT + λnuΙ( ) −1 Yru T Sii u = 1 if ru,i ≠ 0 0 else " # $ %$ xu
  • 27. Naïve ALS case  class  Rating(userID:  Int,  itemID:  Int,  rating:  Double)   case  class  ColumnVector(columnIndex:  Int,  vector:  Array[Double])     val  items:  DataSet[ColumnVector]  =  _   val  ratings:  DataSet[Rating]  =  _     //  Generate  tuples  of  items  with  their  ratings   val  uVA  =  items.join(ratings).where(0).equalTo(1)  {      (item,  ratingEntry)  =>  {          val  Rating(uID,  _,  rating)  =  ratingEntry          (uID,  rating,  item.vector)      }   }       27
  • 28. Naïve ALS contd. uVA.groupBy(0).reduceGroup  {      vectors  =>  {          var  uID  =  -­‐1          val  matrix  =  FloatMatrix.zeros(factors,  factors)          val  vector  =  FloatMatrix.zeros(factors)          var  n  =  0            for((id,  rating,  v)  <-­‐  vectors)  {              uID  =  id              vector  +=  rating  *  v              matrix  +=  outerProduct(v  ,  v)              n  +=  1          }            for(idx  <-­‐  0  until  factors)  {              matrix(idx,  idx)  +=  lambda  *  n          }            new  ColumnVector(uID,  Solve(matrix,  vector))      }   }   28
  • 29. Problems of naïve ALS §  Problem: •  Item vectors are sent redundantly à High network load §  Solution: •  Blocking of user and item vectors to share common data •  Avoids blown up intermediate state 29
  • 31. Performance comparison 31 •  40  node  GCE  cluster,  highmem-­‐8   •  10  ALS  itera4on  with  50  latent  factors   Runtimeinminutes 0 225 450 675 900 Number of non-zero entries (billion) 0 7.5 15 22.5 30 Blocked ALS Blocked ALS highmem-16 Naive ALS 5.5h 14h 2.5h 1h Table 2 Entries in billion Naive Join Naive Join Broadcast Broadcast 80 0.08 201.326 3.35543333333333 190.723 3.17871666666667
  • 33. Why is streaming ML important? §  Spam detection in mails §  Patterns might change over time §  Retraining of model necessary §  Best solution: Online models 33
  • 34. Applications §  Spam detection §  Recommendation §  News feed personalization §  Credit card fraud detection 34
  • 35. Apache SAMOA §  Scalable Advanced Massive Online Analysis §  Distributed streaming machine learning framework §  Incubation at the Apache Software Foundation §  Runs on multiple streaming processing engines (S4, Storm, Samza) §  Support for Flink is pending pull request 35
  • 36. Supported algorithms §  Classification: Vertical Hoeffding Tree §  Clustering: CluStream §  Regression: Adaptive Model Rules §  Frequent pattern mining: PARMA 36
  • 38. Flink-ML Outlook §  Support more algorithms §  Support for distributed linear algebra §  Integration with streaming machine learning §  Interactive programs and Zeppelin 38