SlideShare a Scribd company logo
IBM SparkTechnology Center
Apache Big Data Seville 2016
Apache SystemML
Declarative Machine Learning
Luciano Resende
IBM | Spark Technology Center
IBM SparkTechnology Center
About Me
Luciano Resende (lresende@apache.org)
• Architect and community liaison at IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Apache Bahir, Apache Spark, Apache Zeppelin and
Apache SystemML (incubating) projects
2
@lresende1975 http://lresende.blogspot.com/ https://www.linkedin.com/in/lresendehttp://slideshare.net/luckbr1975lresende
IBM SparkTechnology Center
Origins of the SystemML Project
2007-2008: Multiple projects at IBM Research – Almaden involving machine
learning on Hadoop.
2009: A dedicated team for scalable ML was created.
2009-2010: Through engagements with customers, we observe how data scientists
create machine learning algorithms.
IBM SparkTechnology Center
State-of-the-Art: Small Data
R or
Python
Data
Scientist
Personal
Computer
Data
Results
IBM SparkTechnology Center
State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
IBM SparkTechnology Center
State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
😞 Days	or	weeks	per	iteration
😞 Errors	while	translating	
algorithms
IBM SparkTechnology Center
The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
IBM SparkTechnology Center
The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
😃 Fast	iteration
😃 Same	answer
IBM SparkTechnology Center
Running Example:
Alternating Least Squares
Problem: Movie
Recommendations
Movies
Users
i
j
User	i liked	
movie	j.
Movies	Factor
Users	Factor
Multiply	these	
two	factors	to	
produce	a	less-
sparse	matrix.
×
New	nonzero	
values	become	
movies	
suggestions.
IBM SparkTechnology Center
Alternating Least Squares (in R)
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
IBM SparkTechnology Center
Alternating Least Squares (in R)
1. Start with random factors.
2. Hold the Movies factor constant and
find the best value for the Users factor.
(Value that most closely approximates the original matrix)
3. Hold the Users factor constant and find
the best value for the Movies factor.
4. Repeat steps 2-3 until convergence.
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
1
2
2
3
3
4
4
4
Every	line	has	a	clear	purpose!
IBM SparkTechnology Center
Alternating Least Squares (spark.ml)
IBM SparkTechnology Center
Alternating Least Squares (spark.ml)
IBM SparkTechnology Center
Alternating Least Squares (spark.ml)
IBM SparkTechnology Center
Alternating Least Squares (spark.ml)
IBM SparkTechnology Center
25 lines’ worth of algorithm…
…mixed with 800 lines of performance code
IBM SparkTechnology Center
Alternating Least Squares (in R)
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
IBM SparkTechnology Center
Alternating Least Squares (in R)
SystemML can compile and run this
algorithm at scale
No additional performance code
needed!
U = rand(nrow(X), r, min = -1.0, max = 1.0);
V = rand(r, ncol(X), min = -1.0, max = 1.0);
while(i < mi) {
i = i + 1; ii = 1;
if (is_U)
G = (W * (U %*% V - X)) %*% t(V) + lambda * U;
else
G = t(U) %*% (W * (U %*% V - X)) + lambda * V;
norm_G2 = sum(G ^ 2); norm_R2 = norm_G2;
R = -G; S = R;
while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) {
if (is_U) {
HS = (W * (S %*% V)) %*% t(V) + lambda * S;
alpha = norm_R2 / sum (S * HS);
U = U + alpha * S;
} else {
HS = t(U) %*% (W * (U %*% S)) + lambda * S;
alpha = norm_R2 / sum (S * HS);
V = V + alpha * S;
}
R = R - alpha * HS;
old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2);
S = R + (norm_R2 / old_norm_R2) * S;
ii = ii + 1;
}
is_U = ! is_U;
}
(in	SystemML’s
subset	of	R)
IBM SparkTechnology Center
How fast does it run?
Running time comparisons between machine learning algorithms
are problematic
•Different, equally-valid answers
•Different convergence rates on different data
•But we’ll do one anyway
IBM SparkTechnology Center
Performance Comparison: ALS
0
5000
10000
15000
20000
1.2GB	(sparse	binary) 12GB 120GB
Running	Time	(sec)
R
MLLib
SystemML
>24h>24h
OOM
OOM
Synthetic	data,	0.01	sparsity,	10^5	products	× {10^5,10^6,10^7}	users.	Data	generated	by	multiplying	two	rank-50	matrices	of	normally-distributed	data,	
sampling	from	the	resulting	product,	then	adding	Gaussian	noise.	Cluster	of	6	servers	with	12	cores	and	96GB	of	memory	per	server.	Number	of	iterations	
tuned	so	that	all	algorithms	produce	comparable	result	quality.Details:
IBM SparkTechnology Center
Takeaway Points
SystemML runs the R script in parallel
•Same answer as original R script
•Performance is comparable to a low-level RDD-based implementation
How does SystemML achieve this result?
IBM SparkTechnology Center
The SystemML Runtime for Spark
Automates critical performance decisions
•Distributed or local computation?
•How to partition the data?
•To persist or not to persist?
Distributed vs local: Hybrid runtime
•Multithreaded computation in Spark Driver
•Distributed computation in Spark Executors
•Optimizer makes a cost-based choice
22
High-Level	Operations	(HOPs)
General representation of statements in the data
analysis language
Low-Level	Operations	(LOPs)
General representation of operations in the
runtime framework
High-level language
front-ends
Multiple execution
environments
Cost
Based
Optimizer
IBM SparkTechnology Center
But wait, there’s more!
Many other rewrites
Cost-based selection of physical operators
Dynamic recompilation for accurate stats
Parallel FOR (ParFor) optimizer
Direct operations on RDD partitions
YARN and MapReduce support
IBM SparkTechnology Center
Summary
Cost-based compilation of machine learning algorithms generates execution plans
•for single-node in-memory, cluster, and hybrid execution
•for varying data characteristics:
– varying number of observations (1,000s to 10s of billions), number of variables (10s to 10s of millions), dense and sparse data
•for varying cluster characteristics (memory configurations, degree of parallelism)
Out-of-the-box, scalable machine learning algorithms
•e.g. descriptive statistics, regression, clustering, and classification
"Roll-your-own" algorithms
•Enable programmer productivity (no worry about scalability, numeric stability, and optimizations)
•Fast turn-around for new algorithms
Higher-level language shields algorithm development investment from platform progression
•Yarn for resource negotiation and elasticity
•Spark for in-memory, iterative processing
IBM SparkTechnology Center
Algorithms
Category Description
Descriptive Statistics
Univariate
Bivariate
Stratified Bivariate
Classification
Logistic Regression (multinomial)
Multi-Class SVM
Naïve Bayes (multinomial)
Decision Trees
Random Forest
Clustering k-Means
Regression
Linear Regression system of equations
CG (conjugate gradient)
Generalized Linear
Models (GLM)
Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli
Links for all distributions: identity, log, sq. root, inverse, 1/μ2
Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit
Stepwise
Linear
GLM
Dimension Reduction PCA
Matrix Factorization ALS
direct solve
CG (conjugate gradient descent)
Survival Models
Kaplan Meier Estimate
Cox Proportional Hazard Regression
Predict Algorithm-specific scoring
Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation
PMML models lm, kmeans, svm, glm, mlogit
25
IBM SparkTechnology Center
Live Demo
26
IBM SparkTechnology Center
Demo – Movie Recommendation
The demo environment
https://github.com/lresende/docker-systemml-notebook
27
Docker Image : lresende/systemml
Executor
Executor
Executor
IBM SparkTechnology Center
Demo – Movie Recommendation
The Netflix Data Set
• Movies
• Historical Ratings (training set)
28
Movie Year Description
1 2003 Dinosaur	Planet
Movie User Rating Date
1 30878 4 2005-12-26
IBM SparkTechnology Center 29
Demo – Movie Recommendation
IBM SparkTechnology Center
What’s new on SystemML
30
IBM SparkTechnology Center
VLDB 2016 Best Paper Award
VLDB 2016 Best Paper and Demonstration
Read Compressed Linear Algebra for
Large-Scale Machine Learning.
http://www.vldb.org/pvldb/vol9/p960-elgohary.pdf
31
IBM SparkTechnology Center
SystemML 0.11-incubating Release
Features
• SystemML frames
• New MLContext API
• Transform functions based on
SystemML frames
• Various bug fixes
32
Experimental Features / Algorithms
• New built-in functions for deep
learning (convolution and pooling)
• Deep learning library (DML
bodied functions)
• Python DSL Integration
• GPU Support
• Compressed Linear Algebra
IBM SparkTechnology Center
SystemML 0.11-incubating Release
New Algorithms
• Lasso
• kNN
• Lanczos
• PPCA
33
Deep Learning Algorithms
• CNN (Lenet)
• RBM
IBM SparkTechnology Center
New SystemML Website
34
IBM SparkTechnology Center
SystemML use cases
Using Deep Learning to assess Tumor proliferation by MIKE DUSENBERRY
35
Whole-Slide	Image: Sample	Image:
Deep	ConvNet
Tumor	
Score
IBM SparkTechnology Center
Come contribute to SystemML
36
IBM SparkTechnology Center
Apache SystemML
SystemML is open source!
•Announced in June 2015
•Available on Github since September 1
•First open-source binary release (0.8.0) in October 2015
•Entered Apache incubation in November 2015
•First Apache open-source binary release (0.9) available now
•Latest 0.11-incubating release just came out couple days ago
We are actively seeking contributors and users!
IBM SparkTechnology Center
References
SystemML
http://systemml.apache.org
DML (R) Language Reference
https://apache.github.io/incubator-systemml/dml-language-reference.html
Algorithms Reference
http://systemml.apache.org/algorithms
Runtime Reference
https://apache.github.io/incubator-systemml/#running-systemml
38
Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif

More Related Content

PDF
Luciano Resende's keynote at Apache big data conference
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
PDF
Intro to PySpark: Python Data Analysis at scale in the Cloud
PDF
IoT Applications and Patterns using Apache Spark & Apache Bahir
PPTX
Scalable Machine Learning with PySpark
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
Performance of Spark vs MapReduce
PDF
Big Data Processing with Spark and Scala
Luciano Resende's keynote at Apache big data conference
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Intro to PySpark: Python Data Analysis at scale in the Cloud
IoT Applications and Patterns using Apache Spark & Apache Bahir
Scalable Machine Learning with PySpark
An Insider’s Guide to Maximizing Spark SQL Performance
Performance of Spark vs MapReduce
Big Data Processing with Spark and Scala

What's hot (20)

PDF
Spark Summit EU talk by Tim Hunter
PDF
Powering Custom Apps at Facebook using Spark Script Transformation
PDF
Writing Continuous Applications with Structured Streaming PySpark API
PDF
Apache Zeppelin Helium and Beyond
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Hadoopsummit16 myui
PPTX
Spark and Hadoop Technology
PDF
Vectorized R Execution in Apache Spark
PDF
Apache Zeppelin, Helium and Beyond
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Koalas: Unifying Spark and pandas APIs
PDF
Getting insights from IoT data with Apache Spark and Apache Bahir
PDF
Build a deep learning pipeline on apache spark for ads optimization
PDF
Infra space talk on Apache Spark - Into to CASK
PDF
Spark SQL | Apache Spark
PDF
Spark Uber Development Kit
PPTX
Apache Spark MLlib - Random Foreset and Desicion Trees
PDF
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Flock: Data Science Platform @ CISL
Spark Summit EU talk by Tim Hunter
Powering Custom Apps at Facebook using Spark Script Transformation
Writing Continuous Applications with Structured Streaming PySpark API
Apache Zeppelin Helium and Beyond
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Hadoopsummit16 myui
Spark and Hadoop Technology
Vectorized R Execution in Apache Spark
Apache Zeppelin, Helium and Beyond
Spark Summit EU talk by Yiannis Gkoufas
Koalas: Unifying Spark and pandas APIs
Getting insights from IoT data with Apache Spark and Apache Bahir
Build a deep learning pipeline on apache spark for ads optimization
Infra space talk on Apache Spark - Into to CASK
Spark SQL | Apache Spark
Spark Uber Development Kit
Apache Spark MLlib - Random Foreset and Desicion Trees
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Spark Summit EU talk by Kaarthik Sivashanmugam
Flock: Data Science Platform @ CISL
Ad

Similar to SystemML - Declarative Machine Learning (20)

PDF
What's new in Apache SystemML - Declarative Machine Learning
PPTX
Inside Apache SystemML
PPTX
R4ML: An R Based Scalable Machine Learning Framework
PPTX
HDL17_MIPS CPU Design using Verilog.pptx
PDF
Rcpp: Seemless R and C++
PPT
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
PDF
Rcpp: Seemless R and C++
PDF
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
PDF
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
PDF
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
PDF
20180420 hk-the powerofmysql8
ODP
PHP applications/environments monitoring: APM & Pinba
PDF
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
PPT
Interm codegen
PPT
r,rstats,r language,r packages
PDF
DSAsunbeam pdf useful for cceegot did it and enjoy.pdf
PPT
Georgy Nosenko - An introduction to the use SMT solvers for software security
PPTX
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
PDF
PDF
Exploiting vectorization with ISPC
What's new in Apache SystemML - Declarative Machine Learning
Inside Apache SystemML
R4ML: An R Based Scalable Machine Learning Framework
HDL17_MIPS CPU Design using Verilog.pptx
Rcpp: Seemless R and C++
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Rcpp: Seemless R and C++
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
20180420 hk-the powerofmysql8
PHP applications/environments monitoring: APM & Pinba
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Interm codegen
r,rstats,r language,r packages
DSAsunbeam pdf useful for cceegot did it and enjoy.pdf
Georgy Nosenko - An introduction to the use SMT solvers for software security
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Exploiting vectorization with ISPC
Ad

More from Luciano Resende (20)

PDF
A Jupyter kernel for Scala and Apache Spark.pdf
PDF
Using Elyra for COVID-19 Analytics
PDF
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
PDF
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
PDF
Ai pipelines powered by jupyter notebooks
PDF
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
PDF
Scaling notebooks for Deep Learning workloads
PDF
Jupyter Enterprise Gateway Overview
PPTX
Inteligencia artificial, open source e IBM Call for Code
PDF
Open Source AI - News and examples
PDF
Building analytical microservices powered by jupyter kernels
PDF
Building iot applications with Apache Spark and Apache Bahir
PDF
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
PDF
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
PDF
Big analytics meetup - Extended Jupyter Kernel Gateway
PDF
Jupyter con meetup extended jupyter kernel gateway
PDF
How mentoring can help you start contributing to open source
PPT
Asf icfoss-mentoring
PDF
Open Source tools overview
PDF
Data access layer and schema definitions
A Jupyter kernel for Scala and Apache Spark.pdf
Using Elyra for COVID-19 Analytics
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Ai pipelines powered by jupyter notebooks
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Scaling notebooks for Deep Learning workloads
Jupyter Enterprise Gateway Overview
Inteligencia artificial, open source e IBM Call for Code
Open Source AI - News and examples
Building analytical microservices powered by jupyter kernels
Building iot applications with Apache Spark and Apache Bahir
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Big analytics meetup - Extended Jupyter Kernel Gateway
Jupyter con meetup extended jupyter kernel gateway
How mentoring can help you start contributing to open source
Asf icfoss-mentoring
Open Source tools overview
Data access layer and schema definitions

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPT
Quality review (1)_presentation of this 21
PDF
Business Analytics and business intelligence.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Quality review (1)_presentation of this 21
Business Analytics and business intelligence.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to machine learning and Linear Models
IBA_Chapter_11_Slides_Final_Accessible.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Database Infoormation System (DBIS).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Reliability_Chapter_ presentation 1221.5784
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck

SystemML - Declarative Machine Learning

  • 1. IBM SparkTechnology Center Apache Big Data Seville 2016 Apache SystemML Declarative Machine Learning Luciano Resende IBM | Spark Technology Center
  • 2. IBM SparkTechnology Center About Me Luciano Resende ([email protected]) • Architect and community liaison at IBM – Spark Technology Center • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Apache Bahir, Apache Spark, Apache Zeppelin and Apache SystemML (incubating) projects 2 @lresende1975 http://lresende.blogspot.com/ https://www.linkedin.com/in/lresendehttp://slideshare.net/luckbr1975lresende
  • 3. IBM SparkTechnology Center Origins of the SystemML Project 2007-2008: Multiple projects at IBM Research – Almaden involving machine learning on Hadoop. 2009: A dedicated team for scalable ML was created. 2009-2010: Through engagements with customers, we observe how data scientists create machine learning algorithms.
  • 4. IBM SparkTechnology Center State-of-the-Art: Small Data R or Python Data Scientist Personal Computer Data Results
  • 5. IBM SparkTechnology Center State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala
  • 6. IBM SparkTechnology Center State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala 😞 Days or weeks per iteration 😞 Errors while translating algorithms
  • 7. IBM SparkTechnology Center The SystemML Vision R or Python Data Scientist Results SystemML
  • 8. IBM SparkTechnology Center The SystemML Vision R or Python Data Scientist Results SystemML 😃 Fast iteration 😃 Same answer
  • 9. IBM SparkTechnology Center Running Example: Alternating Least Squares Problem: Movie Recommendations Movies Users i j User i liked movie j. Movies Factor Users Factor Multiply these two factors to produce a less- sparse matrix. × New nonzero values become movies suggestions.
  • 10. IBM SparkTechnology Center Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; }
  • 11. IBM SparkTechnology Center Alternating Least Squares (in R) 1. Start with random factors. 2. Hold the Movies factor constant and find the best value for the Users factor. (Value that most closely approximates the original matrix) 3. Hold the Users factor constant and find the best value for the Movies factor. 4. Repeat steps 2-3 until convergence. U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } 1 2 2 3 3 4 4 4 Every line has a clear purpose!
  • 12. IBM SparkTechnology Center Alternating Least Squares (spark.ml)
  • 13. IBM SparkTechnology Center Alternating Least Squares (spark.ml)
  • 14. IBM SparkTechnology Center Alternating Least Squares (spark.ml)
  • 15. IBM SparkTechnology Center Alternating Least Squares (spark.ml)
  • 16. IBM SparkTechnology Center 25 lines’ worth of algorithm… …mixed with 800 lines of performance code
  • 17. IBM SparkTechnology Center Alternating Least Squares (in R) U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; }
  • 18. IBM SparkTechnology Center Alternating Least Squares (in R) SystemML can compile and run this algorithm at scale No additional performance code needed! U = rand(nrow(X), r, min = -1.0, max = 1.0); V = rand(r, ncol(X), min = -1.0, max = 1.0); while(i < mi) { i = i + 1; ii = 1; if (is_U) G = (W * (U %*% V - X)) %*% t(V) + lambda * U; else G = t(U) %*% (W * (U %*% V - X)) + lambda * V; norm_G2 = sum(G ^ 2); norm_R2 = norm_G2; R = -G; S = R; while(norm_R2 > 10E-9 * norm_G2 & ii <= mii) { if (is_U) { HS = (W * (S %*% V)) %*% t(V) + lambda * S; alpha = norm_R2 / sum (S * HS); U = U + alpha * S; } else { HS = t(U) %*% (W * (U %*% S)) + lambda * S; alpha = norm_R2 / sum (S * HS); V = V + alpha * S; } R = R - alpha * HS; old_norm_R2 = norm_R2; norm_R2 = sum(R ^ 2); S = R + (norm_R2 / old_norm_R2) * S; ii = ii + 1; } is_U = ! is_U; } (in SystemML’s subset of R)
  • 19. IBM SparkTechnology Center How fast does it run? Running time comparisons between machine learning algorithms are problematic •Different, equally-valid answers •Different convergence rates on different data •But we’ll do one anyway
  • 20. IBM SparkTechnology Center Performance Comparison: ALS 0 5000 10000 15000 20000 1.2GB (sparse binary) 12GB 120GB Running Time (sec) R MLLib SystemML >24h>24h OOM OOM Synthetic data, 0.01 sparsity, 10^5 products × {10^5,10^6,10^7} users. Data generated by multiplying two rank-50 matrices of normally-distributed data, sampling from the resulting product, then adding Gaussian noise. Cluster of 6 servers with 12 cores and 96GB of memory per server. Number of iterations tuned so that all algorithms produce comparable result quality.Details:
  • 21. IBM SparkTechnology Center Takeaway Points SystemML runs the R script in parallel •Same answer as original R script •Performance is comparable to a low-level RDD-based implementation How does SystemML achieve this result?
  • 22. IBM SparkTechnology Center The SystemML Runtime for Spark Automates critical performance decisions •Distributed or local computation? •How to partition the data? •To persist or not to persist? Distributed vs local: Hybrid runtime •Multithreaded computation in Spark Driver •Distributed computation in Spark Executors •Optimizer makes a cost-based choice 22 High-Level Operations (HOPs) General representation of statements in the data analysis language Low-Level Operations (LOPs) General representation of operations in the runtime framework High-level language front-ends Multiple execution environments Cost Based Optimizer
  • 23. IBM SparkTechnology Center But wait, there’s more! Many other rewrites Cost-based selection of physical operators Dynamic recompilation for accurate stats Parallel FOR (ParFor) optimizer Direct operations on RDD partitions YARN and MapReduce support
  • 24. IBM SparkTechnology Center Summary Cost-based compilation of machine learning algorithms generates execution plans •for single-node in-memory, cluster, and hybrid execution •for varying data characteristics: – varying number of observations (1,000s to 10s of billions), number of variables (10s to 10s of millions), dense and sparse data •for varying cluster characteristics (memory configurations, degree of parallelism) Out-of-the-box, scalable machine learning algorithms •e.g. descriptive statistics, regression, clustering, and classification "Roll-your-own" algorithms •Enable programmer productivity (no worry about scalability, numeric stability, and optimizations) •Fast turn-around for new algorithms Higher-level language shields algorithm development investment from platform progression •Yarn for resource negotiation and elasticity •Spark for in-memory, iterative processing
  • 25. IBM SparkTechnology Center Algorithms Category Description Descriptive Statistics Univariate Bivariate Stratified Bivariate Classification Logistic Regression (multinomial) Multi-Class SVM Naïve Bayes (multinomial) Decision Trees Random Forest Clustering k-Means Regression Linear Regression system of equations CG (conjugate gradient) Generalized Linear Models (GLM) Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli Links for all distributions: identity, log, sq. root, inverse, 1/μ2 Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit Stepwise Linear GLM Dimension Reduction PCA Matrix Factorization ALS direct solve CG (conjugate gradient descent) Survival Models Kaplan Meier Estimate Cox Proportional Hazard Regression Predict Algorithm-specific scoring Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation PMML models lm, kmeans, svm, glm, mlogit 25
  • 27. IBM SparkTechnology Center Demo – Movie Recommendation The demo environment https://github.com/lresende/docker-systemml-notebook 27 Docker Image : lresende/systemml Executor Executor Executor
  • 28. IBM SparkTechnology Center Demo – Movie Recommendation The Netflix Data Set • Movies • Historical Ratings (training set) 28 Movie Year Description 1 2003 Dinosaur Planet Movie User Rating Date 1 30878 4 2005-12-26
  • 29. IBM SparkTechnology Center 29 Demo – Movie Recommendation
  • 31. IBM SparkTechnology Center VLDB 2016 Best Paper Award VLDB 2016 Best Paper and Demonstration Read Compressed Linear Algebra for Large-Scale Machine Learning. http://www.vldb.org/pvldb/vol9/p960-elgohary.pdf 31
  • 32. IBM SparkTechnology Center SystemML 0.11-incubating Release Features • SystemML frames • New MLContext API • Transform functions based on SystemML frames • Various bug fixes 32 Experimental Features / Algorithms • New built-in functions for deep learning (convolution and pooling) • Deep learning library (DML bodied functions) • Python DSL Integration • GPU Support • Compressed Linear Algebra
  • 33. IBM SparkTechnology Center SystemML 0.11-incubating Release New Algorithms • Lasso • kNN • Lanczos • PPCA 33 Deep Learning Algorithms • CNN (Lenet) • RBM
  • 34. IBM SparkTechnology Center New SystemML Website 34
  • 35. IBM SparkTechnology Center SystemML use cases Using Deep Learning to assess Tumor proliferation by MIKE DUSENBERRY 35 Whole-Slide Image: Sample Image: Deep ConvNet Tumor Score
  • 36. IBM SparkTechnology Center Come contribute to SystemML 36
  • 37. IBM SparkTechnology Center Apache SystemML SystemML is open source! •Announced in June 2015 •Available on Github since September 1 •First open-source binary release (0.8.0) in October 2015 •Entered Apache incubation in November 2015 •First Apache open-source binary release (0.9) available now •Latest 0.11-incubating release just came out couple days ago We are actively seeking contributors and users!
  • 38. IBM SparkTechnology Center References SystemML http://systemml.apache.org DML (R) Language Reference https://apache.github.io/incubator-systemml/dml-language-reference.html Algorithms Reference http://systemml.apache.org/algorithms Runtime Reference https://apache.github.io/incubator-systemml/#running-systemml 38 Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif