SystemML - Declarative Machine Learning

IBM SparkTechnology Center
Apache Big Data Seville 2016
Apache SystemML
Declarative Machine Learning
Luciano Resende
IBM | Spark Technology Center

About Me
Luciano Resende (lresende@apache.org)
• Architect and community liaison at IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Apache Bahir, Apache Spark, Apache Zeppelin and
Apache SystemML (incubating) projects
2
@lresende1975 http://lresende.blogspot.com/ https://www.linkedin.com/in/lresendehttp://slideshare.net/luckbr1975lresende

Origins of the SystemML Project
2007-2008: Multiple projects at IBM Research – Almaden involving machine
learning on Hadoop.
2009: A dedicated team for scalable ML was created.
2009-2010: Through engagements with customers, we observe how data scientists
create machine learning algorithms.

State-of-the-Art: Small Data
R or
Python
Data
Scientist
Personal
Computer
Data
Results

State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala

State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
😞 Days or weeks per iteration
😞 Errors while translating
algorithms

The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML

The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
😃 Fast iteration
😃 Same answer

Running Example:
Alternating Least Squares
Problem: Movie
Recommendations
Movies
Users
i
j
User i liked
movie j.
Movies Factor
Users Factor
Multiply these
two factors to
produce a less-
sparse matrix.
×
New nonzero
values become
movies
suggestions.

Alternating Least Squares (in R)
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}

1. Start with random factors.
2. Hold the Movies factor constant and
find the best value for the Users factor.
(Value that most closely approximates the original matrix)
3. Hold the Users factor constant and find
the best value for the Movies factor.
4. Repeat steps 2-3 until convergence.
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
R = -G; S = R;
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
V = V + alpha * S;
}
R = R - alpha * HS;
ii = ii + 1;
}
is_U = ! is_U;
}
1
2
2
3
3
4
4
4
Every line has a clear purpose!

Alternating Least Squares (spark.ml)

25 lines’ worth of algorithm…
…mixed with 800 lines of performance code

SystemML can compile and run this
algorithm at scale
No additional performance code
needed!
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
R = -G; S = R;
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
V = V + alpha * S;
}
R = R - alpha * HS;
ii = ii + 1;
}
is_U = ! is_U;
}
(in SystemML’s
subset of R)

How fast does it run?
Running time comparisons between machine learning algorithms
are problematic
•Different, equally-valid answers
•Different convergence rates on different data
•But we’ll do one anyway

Performance Comparison: ALS
0
5000
10000
15000
20000
1.2GB (sparse binary) 12GB 120GB
Running Time (sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-distributed data,
sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per server. Number of iterations
tuned so that all algorithms produce comparable result quality.Details:

Takeaway Points
SystemML runs the R script in parallel
•Same answer as original R script
•Performance is comparable to a low-level RDD-based implementation
How does SystemML achieve this result?

The SystemML Runtime for Spark
Automates critical performance decisions
•Distributed or local computation?
•How to partition the data?
•To persist or not to persist?
Distributed vs local: Hybrid runtime
•Multithreaded computation in Spark Driver
•Distributed computation in Spark Executors
•Optimizer makes a cost-based choice
22
High-Level Operations (HOPs)
General representation of statements in the data
analysis language
Low-Level Operations (LOPs)
General representation of operations in the
runtime framework
High-level language
front-ends
Multiple execution
environments
Cost
Based
Optimizer

But wait, there’s more!
Many other rewrites
Cost-based selection of physical operators
Dynamic recompilation for accurate stats
Parallel FOR (ParFor) optimizer
Direct operations on RDD partitions
YARN and MapReduce support

Summary
Cost-based compilation of machine learning algorithms generates execution plans
•for single-node in-memory, cluster, and hybrid execution
•for varying data characteristics:
– varying number of observations (1,000s to 10s of billions), number of variables (10s to 10s of millions), dense and sparse data
•for varying cluster characteristics (memory configurations, degree of parallelism)
Out-of-the-box, scalable machine learning algorithms
•e.g. descriptive statistics, regression, clustering, and classification
"Roll-your-own" algorithms
•Enable programmer productivity (no worry about scalability, numeric stability, and optimizations)
•Fast turn-around for new algorithms
Higher-level language shields algorithm development investment from platform progression
•Yarn for resource negotiation and elasticity
•Spark for in-memory, iterative processing

Algorithms
Category Description
Descriptive Statistics
Univariate
Bivariate
Stratified Bivariate
Classification
Logistic Regression (multinomial)
Multi-Class SVM
Naïve Bayes (multinomial)
Decision Trees
Random Forest
Clustering k-Means
Regression
Linear Regression system of equations
CG (conjugate gradient)
Generalized Linear
Models (GLM)
Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli
Links for all distributions: identity, log, sq. root, inverse, 1/μ2
Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit
Stepwise
Linear
GLM
Dimension Reduction PCA
Matrix Factorization ALS
direct solve
CG (conjugate gradient descent)
Survival Models
Kaplan Meier Estimate
Cox Proportional Hazard Regression
Predict Algorithm-specific scoring
Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation
PMML models lm, kmeans, svm, glm, mlogit
25

Live Demo
26

Demo – Movie Recommendation
The demo environment
https://github.com/lresende/docker-systemml-notebook
27
Docker Image : lresende/systemml
Executor
Executor
Executor

The Netflix Data Set
• Movies
• Historical Ratings (training set)
28
Movie Year Description
1 2003 Dinosaur Planet
Movie User Rating Date
1 30878 4 2005-12-26

IBM SparkTechnology Center 29

What’s new on SystemML
30

VLDB 2016 Best Paper Award
VLDB 2016 Best Paper and Demonstration
Read Compressed Linear Algebra for
Large-Scale Machine Learning.
http://www.vldb.org/pvldb/vol9/p960-elgohary.pdf
31

SystemML 0.11-incubating Release
Features
• SystemML frames
• New MLContext API
• Transform functions based on
SystemML frames
• Various bug fixes
32
Experimental Features / Algorithms
• New built-in functions for deep
learning (convolution and pooling)
• Deep learning library (DML
bodied functions)
• Python DSL Integration
• GPU Support
• Compressed Linear Algebra

SystemML 0.11-incubating Release
New Algorithms
• Lasso
• kNN
• Lanczos
• PPCA
33
Deep Learning Algorithms
• CNN (Lenet)
• RBM

New SystemML Website
34

SystemML use cases
Using Deep Learning to assess Tumor proliferation by MIKE DUSENBERRY
35
Whole-Slide Image: Sample Image:
Deep ConvNet
Tumor
Score

Come contribute to SystemML
36

Apache SystemML
SystemML is open source!
•Announced in June 2015
•Available on Github since September 1
•First open-source binary release (0.8.0) in October 2015
•Entered Apache incubation in November 2015
•First Apache open-source binary release (0.9) available now
•Latest 0.11-incubating release just came out couple days ago
We are actively seeking contributors and users!

References
SystemML
http://systemml.apache.org
DML (R) Language Reference
https://apache.github.io/incubator-systemml/dml-language-reference.html
Algorithms Reference
http://systemml.apache.org/algorithms
Runtime Reference
https://apache.github.io/incubator-systemml/#running-systemml
38
Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif

SystemML - Declarative Machine Learning

More Related Content

What's hot (20)

Similar to SystemML - Declarative Machine Learning (20)

More from Luciano Resende (20)

Recently uploaded (20)

SystemML - Declarative Machine Learning