Machine Learning with Apache Flink at Stockholm Machine Learning Group

Till Rohrmann
Flink committer
trohrmann@apache.org
@stsffap
Machine Learning
with
Apache Flink

What is Flink
§  Large-scale data processing engine
§  Easy and powerful APIs for batch and real-time
streaming analysis (Java / Scala)
§  Backed by a very robust execution backend
•  with true streaming capabilities,
•  custom memory manager,
•  native iteration execution,
•  and a cost-based optimizer.
2

Technology inside Flink
§  Technology inspired by compilers +
MPP databases + distributed systems
§  For ease of use, reliable performance,
and scalability
case
class
Path
(from:
Long,
to:
Long)

val
tc
=
edges.iterate(10)
{

paths:
DataSet[Path]
=>

val
next
=
paths

.join(edges)

.where("to")

.equalTo("from")
{

(path,
edge)
=>

Path(path.from,
edge.to)

}

.union(paths)

.distinct()

next

}

Cost-based
optimizer
Type extraction
stack
Memory
manager
Out-of-core
algos
real-time
streaming
Task
scheduling
Recovery
metadata
Data
serialization
stack
Streaming
network
stack
...
Pre-ﬂight
(client) Master
Workers

Example: WordCount
5
case
class
Word
(word:
String,
frequency:
Int)

val
env
=
ExecutionEnvironment.getExecutionEnvironment()

val
lines
=
env.readTextFile(...)

lines

.flatMap
{line
=>
line.split("
").map(word
=>
Word(word,1))}

.groupBy("word").sum("frequency”)

.print()

env.execute()

Flink has mirrored Java and Scala APIs that offer the same
functionality, including by-name addressing.

Flink API in a Nutshell
§  map, ﬂatMap, ﬁlter,
groupBy, reduce,
reduceGroup,
aggregate, join,
coGroup, cross,
project, distinct, union,
iterate, iterateDelta, ...
§  All Hadoop input
formats are supported
§  API similar for data sets
and data streams with
slightly different
operator semantics
§  Window functions for
data streams
§  Counters,
accumulators, and
broadcast variables
6

Machine learning pipelines
§  Pipelining inspired by scikit-learn
§  Transformer: Modify data
§  Learner: Train a model
§  Reusable components
§  Let’s you quickly build ML pipelines
§  Model inherits pipeline of learner
10

Linear regression in polynomial space
val
polynomialBase
=
PolynomialBase()

val
learner
=
MultipleLinearRegression()

val
pipeline
=
polynomialBase.chain(learner)

val
trainingDS
=
env.fromCollection(trainingData)

val
parameters
=
ParameterMap()

.add(PolynomialBase.Degree,
3)

.add(MultipleLinearRegression.Stepsize,
0.002)

.add(MultipleLinearRegression.Iterations,
100)

val
model
=
pipeline.fit(trainingDS,
parameters)

11
Input
Data

Polynomial

Base

Mapper

Mul4ple

Linear

Regression

Linear

Model

Current state of Flink-ML
§  Existing learners
•  Multiple linear regression
•  Alternating least squares
•  Communication efﬁcient distributed dual
coordinate ascent (PR pending)
§  Feature transformer
•  Polynomial base feature mapper
§  Tooling
12

Distributed linear algebra
§  Linear algebra universal
language for data
analysis
§  High-level abstraction
§  Fast prototyping
§  Pre- and post-processing
step
13

Example: Gaussian non-negative matrix
factorization
§  Given input matrix V, ﬁnd W and H such
that
§  Iterative approximation
14
Ht+1 = Ht ∗ Wt
T
V /Wt
T
Wt Ht( )
Wt+1 = Wt ∗ VHt+1
T
/Wt Ht+1Ht+1
T
( )
V ≈ WH
var
i
=
0

var
H:
CheckpointedDrm[Int]
=
randomMatrix(k,
V.numCols)

var
W:
CheckpointedDrm[Int]
=
randomMatrix(V.numRows,
k)

while(i
<
maxIterations)
{

H
=
H
*
(W.t
%*%
V
/
W.t
%*%
W
%*%
H)

W
=
W
*
(V
%*%

H.t
/
W
%*%
H
%*%
H.t)

i
+=
1

}

Why is Flink a good ﬁt for ML?
15

Flink’s features
§  Stateful iterations
•  Keep state across iterations
§  Delta iterations
•  Limit computation to elements which matter
§  Pipelining
•  Avoiding materialization of large
intermediate state
16

CoCoA
17
minw∈Rd P(w):=
λ
2
w
2
+
1
n
ℓi wT
xi( )
i=1
n
∑
#
$
%
&
'
(

Bulk Iterations
18
partial
solution
partial
solutionX
other
datasets
Y
initial
solution
iteration
result
Replace
Step function

Delta iterations
19
partial
solution
delta
setX
other
datasets
Y
initial
solution
iteration
result
workset A B workset
Merge deltas
Replace
initial
workset

Effect of delta iterations
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
#ofelementsupdated
iteration

Iteration performance
21
0
10
20
30
40
50
60
Hadoop Flink bulk Flink delta
Time(minutes)
61 iterations and 30 iterations of
PageRank on a Twitter follower
graph with Hadoop MapReduce
and Flink using bulk and delta
iterations
30 iterations
61 iterations
MapReduce

How to factorize really large
matrices?
22

Collaborative Filtering
§  Recommend items based on users with
similar preferences
§  Latent factor models capture underlying
characteristics of items and preferences
of user
§  Predicted preference:
23
ˆru,i = xu
T
yi

Matrix factorization
24
minX,Y ru,i − xu
T
yi( )
2
+ λ nu xu
2
+ ni yi
2
i
∑
u
∑
#
$
%
&
'
(
ru,i≠0
∑
R ≈ XT
Y
R
X
Y

Alternating least squares
§  Fixing one matrix gives a quadratic form
§  Solution guarantees to decrease overall
cost function
§  To calculate , all rated item vectors and
ratings are needed
25
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T
Sii
u
=
1 if ru,i ≠ 0
0 else
"
#
$
%$
xu

Naïve ALS
case
class
Rating(userID:
Int,
itemID:
Int,
rating:
Double)

case
class
ColumnVector(columnIndex:
Int,
vector:
Array[Double])

val
items:
DataSet[ColumnVector]
=
_

val
ratings:
DataSet[Rating]
=
_

//
Generate
tuples
of
items
with
their
ratings

val
uVA
=
items.join(ratings).where(0).equalTo(1)
{

(item,
ratingEntry)
=>
{

val
Rating(uID,
_,
rating)
=
ratingEntry

(uID,
rating,
item.vector)

}

}

27

Naïve ALS contd.
uVA.groupBy(0).reduceGroup
{

vectors
=>
{

var
uID
=
-‐1

val
matrix
=
FloatMatrix.zeros(factors,
factors)

val
vector
=
FloatMatrix.zeros(factors)

var
n
=
0

for((id,
rating,
v)
<-‐
vectors)
{

uID
=
id

vector
+=
rating
*
v

matrix
+=
outerProduct(v
,
v)

n
+=
1

}

for(idx
<-‐
0
until
factors)
{

matrix(idx,
idx)
+=
lambda
*
n

}

new
ColumnVector(uID,
Solve(matrix,
vector))

}

}

28

Problems of naïve ALS
§  Problem:
•  Item vectors are sent redundantly à High
network load
§  Solution:
•  Blocking of user and item vectors to share
common data
•  Avoids blown up intermediate state
29

Performance comparison
31
•  40
node
GCE
cluster,
highmem-‐8

•  10
ALS
itera4on
with
50
latent
factors

Runtimeinminutes
0
225
450
675
900
Number of non-zero entries (billion)
0 7.5 15 22.5 30
Blocked ALS Blocked ALS highmem-16 Naive ALS
5.5h
14h
2.5h
1h
Table 2
Entries in billion Naive Join Naive Join Broadcast Broadcast
80 0.08 201.326 3.35543333333333 190.723 3.17871666666667

Why is streaming ML important?
§  Spam detection in mails
§  Patterns might change over time
§  Retraining of model necessary
§  Best solution: Online models
33

Applications
§  Spam detection
§  Recommendation
§  News feed
personalization
§  Credit card fraud
detection
34

Apache SAMOA
§  Scalable Advanced Massive Online
Analysis
§  Distributed streaming machine learning
framework
§  Incubation at the Apache Software
Foundation
§  Runs on multiple streaming processing
engines (S4, Storm, Samza)
§  Support for Flink is pending pull request
35

Supported algorithms
§  Classiﬁcation: Vertical
Hoeffding Tree
§  Clustering: CluStream
§  Regression: Adaptive
Model Rules
§  Frequent pattern mining:
PARMA
36

Flink-ML Outlook
§  Support more algorithms
§  Support for distributed linear algebra
§  Integration with streaming machine learning
§  Interactive programs and Zeppelin
38

ﬂink.apache.org
@ApacheFlink

Machine Learning with Apache Flink at Stockholm Machine Learning Group

More Related Content

What's hot (20)

Similar to Machine Learning with Apache Flink at Stockholm Machine Learning Group (20)

More from Till Rohrmann (17)

Recently uploaded (20)

Machine Learning with Apache Flink at Stockholm Machine Learning Group