SlideShare a Scribd company logo
Building Custom
Machine Learning Algorithms
with Apache SystemML
Fred Reiss
Chief Architect, IBM Spark Technology Center
Member, IBM Academy of Technology
Roadmap
• What is Apache SystemML?
• Demo!
• How to get SystemML
What is Apache SystemML?
Origins of the SystemML Project
20162015
You are
here.
2014201320122011
200920082007
2007-2008: Multiple
projects at IBM
Research – Almaden
involving machine
learning on Hadoop.
2010
2009-2010: Through
engagements with
customers, we observe
how data scientists
create ML solutions.
2009: We form a
dedicated team
for scalable ML
Case Study: An Auto Manufacturer
Warranty
Claims
Repair
History
Diagnostic
Readouts
Predict
Reacquired
Cars
Case Study: An Auto Manufacturer
Warranty
Claims
Repair
History
Features
Labels
Predict
Reacquired
Cars
Machine
Learning
Algorithm
Algorithm
Algorithm
Algorithm
Result: 25x improvement
in precision!
False
Positives
Diagnostic
Readouts
The Iterative Development Process
Build a pipeline
Results
good
enough?
Yes
Customize part
of the pipeline
No
State-of-the-Art: Small Data
R or
Python
Data
Scientist
Personal
Computer
Data
Results
State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
State-of-the-Art: Big Data
R or
Python
Data
Scientist
Results
Systems
Programmer
Scala
😞 Days or weeks per iteration
😞 Errors while translating
algorithms
The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
The SystemML Vision
R or
Python
Data
Scientist
Results
SystemML
😃 Fast iteration
😃 Same answer
200920082007
2007-2008: Multiple
projects at IBM
Research – Almaden
involving machine
learning on Hadoop.
2010
2009-2010: Through
engagements with
customers, we observe
how data scientists
create machine learning
algorithms.
2009: We form a
dedicated team for
scalable ML
2014201320122011
Research
20162015
Apache SystemML
June 2015: IBM
Announces open-
source SystemML
September 2015:
Code available on
Github
November 2015:
SystemML enters
Apache incubation
June 2016:
Second Apache
release (0.10)
February 2016:
First release (0.9) of
Apache SystemML
SystemML at
• Built algorithms for predicting treatment
outcomes
– Substantial improvement in accuracy
• Moved from Hadoop MapReduce to Spark
– SystemML supports both frameworks
– Exact same code
– 300X faster on 1/40th as many nodes
SystemML at Cadent Technology
“SystemML allows Cadent to
implement advanced numerical
programming methods in
Apache Spark, empowering us
to leverage specialized
algorithms in our predictive
analytics software.”
Michael Zargham
Chief Scientist
Cadent is a leading provider of TV
advertising and data solutions,
reaching over 140 million homes
and trusted by the world’s largest
service providers.
Demo!
Demo Scenario
• Application: Targeted ads using demographic
information tied to cookies
• Problem: The information is incomplete
• Solution: Estimate the missing values
– Treat the problem as a matrix completion problem
Data
• The U.S. Census Public Use Microdata Sample
(PUMS) data set for 2010
• 10% sample of the U.S. population
– We’ll use just California today
• Use this full data set to generate synthetic
incomplete data
Demo Scenario
• Application: Identify products that are
complementary (often purchased together)
• Problem: Customers are not currently buying
the best complements at the same time
• Solution: Suggest new product pairings
– Treat the problem as a matrix completion problem
Demographics
Users
i
j
Value of
demographic
field j for
customer i
Matrix Factorization
Top Factor
LeftFactor
Multiply these
two factors to
produce a less-
sparse matrix.
×
New nonzero
values become
interpolated
demographic
information
Demo Part 1: Data wrangling
Demo Part 2: Custom algorithm
Key Points
• SystemML, Spark, and Zeppelin work together
• Linear algebra is great for data science
• Customization is important
How to get Apache SystemML
The Apache SystemML Web Site
http://systemml.apache.org
Download the
binary release!
Try out
some
tutorials!
Browse the
source!
Contribute to
the project!
THANK YOU.
Please try out Apache SystemML!
http://systemml.apache.org
Special thanks to Nakul Jindal and Mike
Dusenberry for helping with the demo!
Ad

Recommended

SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
Databricks
 
Machine Learning with Apache Spark
Machine Learning with Apache Spark
IBM Cloud Data Services
 
Is This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the People
Databricks
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
Databricks
 
Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Breaking Down Analytical and Computational Barriers Across the Energy Industr...
Spark Summit
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Machine Learning Pipelines
Machine Learning Pipelines
jeykottalam
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!
Databricks
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
Databricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
Databricks
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
Databricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
Databricks
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
Jagjit Srawan
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Machine learning model to production
Machine learning model to production
Georg Heiler
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed Python
Databricks
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
Databricks
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Databricks
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
Arvind Surve
 

More Related Content

What's hot (20)

Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
Databricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
Databricks
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
Databricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
Databricks
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
Jagjit Srawan
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Machine learning model to production
Machine learning model to production
Georg Heiler
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed Python
Databricks
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
Databricks
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Databricks
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
Databricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
Databricks
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
Databricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
Databricks
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
Jagjit Srawan
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Machine learning model to production
Machine learning model to production
Georg Heiler
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed Python
Databricks
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
Databricks
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Databricks
 

Viewers also liked (20)

Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
Arvind Surve
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
Spark Summit
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Spark Uber Development Kit
Spark Uber Development Kit
Jen Aman
 
Spark on Mesos
Spark on Mesos
Jen Aman
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Inside Apache SystemML
Inside Apache SystemML
Frederick Reiss
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
Jen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Arvind Surve
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
Arvind Surve
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
Spark Summit
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Spark Uber Development Kit
Spark Uber Development Kit
Jen Aman
 
Spark on Mesos
Spark on Mesos
Jen Aman
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
Jen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Arvind Surve
 
Ad

Similar to Building Custom Machine Learning Algorithms With Apache SystemML (20)

SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
Mike Dusenberry
 
System mldl meetup
System mldl meetup
Ganesan Narayanasamy
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine Learning
Romeo Kienzler
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
System mldl meetup
System mldl meetup
Ganesan Narayanasamy
 
Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold Reinwald
Chester Chen
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
Luciano Resende
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Arvind Surve
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Arvind Surve
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Arvind Surve
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09
Hortonworks
 
Introduction to Machine Learning - An overview and first step for candidate d...
Introduction to Machine Learning - An overview and first step for candidate d...
Lucas Jellema
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
Turi, Inc.
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
DataScienceConferenc1
 
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
Lucas Jellema
 
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
Lucas Jellema
 
Prepare your data for machine learning
Prepare your data for machine learning
Ivo Andreev
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
Arvind Surve
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
Mike Dusenberry
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine Learning
Romeo Kienzler
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold Reinwald
Chester Chen
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
Luciano Resende
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Arvind Surve
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Arvind Surve
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Arvind Surve
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09
Hortonworks
 
Introduction to Machine Learning - An overview and first step for candidate d...
Introduction to Machine Learning - An overview and first step for candidate d...
Lucas Jellema
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
Turi, Inc.
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
DataScienceConferenc1
 
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
Lucas Jellema
 
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
Lucas Jellema
 
Prepare your data for machine learning
Prepare your data for machine learning
Ivo Andreev
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
Arvind Surve
 
Ad

More from Jen Aman (19)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
Jen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
Jen Aman
 
Spark: Interactive To Production
Spark: Interactive To Production
Jen Aman
 
High-Performance Python On Spark
High-Performance Python On Spark
Jen Aman
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Jen Aman
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
Jen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
Jen Aman
 
Spark: Interactive To Production
Spark: Interactive To Production
Jen Aman
 
High-Performance Python On Spark
High-Performance Python On Spark
Jen Aman
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Jen Aman
 

Recently uploaded (20)

Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
Taqyea
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Section Three - Project colemanite production China
Section Three - Project colemanite production China
VavaniaM
 
Power BI API Connectors - Best Practices for Scalable Data Connections
Power BI API Connectors - Best Practices for Scalable Data Connections
Vidicorp Ltd
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
Grade 10 selection and placement (1).pptx
Grade 10 selection and placement (1).pptx
FIDELISMUSEMBI
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays
 
What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?
Amnic
 
Fundamental Analysis for Dummies.pdf somwmdw
Fundamental Analysis for Dummies.pdf somwmdw
ssuserc74044
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
Taqyea
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Section Three - Project colemanite production China
Section Three - Project colemanite production China
VavaniaM
 
Power BI API Connectors - Best Practices for Scalable Data Connections
Power BI API Connectors - Best Practices for Scalable Data Connections
Vidicorp Ltd
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
Grade 10 selection and placement (1).pptx
Grade 10 selection and placement (1).pptx
FIDELISMUSEMBI
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays
 
What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?
Amnic
 
Fundamental Analysis for Dummies.pdf somwmdw
Fundamental Analysis for Dummies.pdf somwmdw
ssuserc74044
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 

Building Custom Machine Learning Algorithms With Apache SystemML

  • 1. Building Custom Machine Learning Algorithms with Apache SystemML Fred Reiss Chief Architect, IBM Spark Technology Center Member, IBM Academy of Technology
  • 2. Roadmap • What is Apache SystemML? • Demo! • How to get SystemML
  • 3. What is Apache SystemML?
  • 4. Origins of the SystemML Project 20162015 You are here.
  • 6. 200920082007 2007-2008: Multiple projects at IBM Research – Almaden involving machine learning on Hadoop. 2010 2009-2010: Through engagements with customers, we observe how data scientists create ML solutions. 2009: We form a dedicated team for scalable ML
  • 7. Case Study: An Auto Manufacturer Warranty Claims Repair History Diagnostic Readouts Predict Reacquired Cars
  • 8. Case Study: An Auto Manufacturer Warranty Claims Repair History Features Labels Predict Reacquired Cars Machine Learning Algorithm Algorithm Algorithm Algorithm Result: 25x improvement in precision! False Positives Diagnostic Readouts
  • 9. The Iterative Development Process Build a pipeline Results good enough? Yes Customize part of the pipeline No
  • 10. State-of-the-Art: Small Data R or Python Data Scientist Personal Computer Data Results
  • 11. State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala
  • 12. State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala 😞 Days or weeks per iteration 😞 Errors while translating algorithms
  • 13. The SystemML Vision R or Python Data Scientist Results SystemML
  • 14. The SystemML Vision R or Python Data Scientist Results SystemML 😃 Fast iteration 😃 Same answer
  • 15. 200920082007 2007-2008: Multiple projects at IBM Research – Almaden involving machine learning on Hadoop. 2010 2009-2010: Through engagements with customers, we observe how data scientists create machine learning algorithms. 2009: We form a dedicated team for scalable ML
  • 17. 20162015 Apache SystemML June 2015: IBM Announces open- source SystemML September 2015: Code available on Github November 2015: SystemML enters Apache incubation June 2016: Second Apache release (0.10) February 2016: First release (0.9) of Apache SystemML
  • 18. SystemML at • Built algorithms for predicting treatment outcomes – Substantial improvement in accuracy • Moved from Hadoop MapReduce to Spark – SystemML supports both frameworks – Exact same code – 300X faster on 1/40th as many nodes
  • 19. SystemML at Cadent Technology “SystemML allows Cadent to implement advanced numerical programming methods in Apache Spark, empowering us to leverage specialized algorithms in our predictive analytics software.” Michael Zargham Chief Scientist Cadent is a leading provider of TV advertising and data solutions, reaching over 140 million homes and trusted by the world’s largest service providers.
  • 20. Demo!
  • 21. Demo Scenario • Application: Targeted ads using demographic information tied to cookies • Problem: The information is incomplete • Solution: Estimate the missing values – Treat the problem as a matrix completion problem
  • 22. Data • The U.S. Census Public Use Microdata Sample (PUMS) data set for 2010 • 10% sample of the U.S. population – We’ll use just California today • Use this full data set to generate synthetic incomplete data
  • 23. Demo Scenario • Application: Identify products that are complementary (often purchased together) • Problem: Customers are not currently buying the best complements at the same time • Solution: Suggest new product pairings – Treat the problem as a matrix completion problem
  • 24. Demographics Users i j Value of demographic field j for customer i Matrix Factorization Top Factor LeftFactor Multiply these two factors to produce a less- sparse matrix. × New nonzero values become interpolated demographic information
  • 25. Demo Part 1: Data wrangling
  • 26. Demo Part 2: Custom algorithm
  • 27. Key Points • SystemML, Spark, and Zeppelin work together • Linear algebra is great for data science • Customization is important
  • 28. How to get Apache SystemML
  • 29. The Apache SystemML Web Site http://systemml.apache.org Download the binary release! Try out some tutorials! Browse the source! Contribute to the project!
  • 30. THANK YOU. Please try out Apache SystemML! http://systemml.apache.org Special thanks to Nakul Jindal and Mike Dusenberry for helping with the demo!