SlideShare a Scribd company logo
Machine Learning with Big Data
using Apache Spark
Mukundan Agaram
Amit Singh
Agenda
Machine Learning Concepts1
Econometrics Model for Recession Prediction
Apache Spark Code Review
4
Platform & Data3
Prevalent Use Cases2
Other ML Concepts and Wrap Up
QA
5
What is Machine Learning
• Branch of AI
– Alan Turing – can machines think???
– “Field of study that gives Computers the ability to learn without
being explicitly programmed” – Arthur Samuel
• Learn from data
• Improve with experience
• Iteratively refine a model that can be used to predict outcomes of
questions based on previous learning
Types of Machine Learning
MachineLearning
Supervised
Regression
Interest rate
prediction
Classification Spam/No-spam
Unsupervised Clustering
Social Network
analysis
Recommender
Systems
Collaborative
Filtering
Netflix
recommendation
Prevalent Use Cases
• Spam Detection – Google Gmail
• Voice Recognition – Apple Siri
• Stock Trading
–High Frequency
–Recommendation Systems
–Algorithmic Trading
• Robotics
–Acquire skills – grasping objects, locomotion, automated driving and navigation
• Medicine and Healthcare
–Healthcare Analytics, Prediction based on Genomes, Health sensor analysis
• Advertising:
–Targeted Advertising based on interests and social media
• Retail and E Commerce:
–Frequency club cards, targeting coupons and promotions
–Recommendation Engines
Use Cases
• Gaming Analytics
– Predictive Analytics for Sports Games, Console based gaming profiles, upsell and
targeting in-app purchases and mods
• Internet of Things
– Large scale sensor data analysis for prediction, classification
• Social Network Analysis
– Facebook, LinkedIn
• Astronomy
– Galaxy formation
Languages and Platforms
• Apache Spark – MLlib
– Scala, Java, Python
• Mahout
• Python Libraries
– Scikit-learn, PyML, PyBrain, matplotlib
• R
– Open Source statistical programming language
• Matlab
• SAS
• Weka
• Octave
• Clojure
Apache Spark
Data Repositories
• UC Irvine Machine Learning Repository
• Infochimps
• Kaggle
• FRED – Federal Reserve Board in Kansas
• Many others...
Logistic Regression (Linear)
Logistic Regression (Non-Linear)
Logistic Regression
Model Design
• Data Collection
–Identify key inputs to the model
• Data Transformation and Curation
–Human ‘analyst’ should be able to view the data sets make predictions
–Data needs to be cleaned, scrubbed, transformed – normalized
–Generally most important step for any type of supervised learning
algorithm
• Review the data
• Visually make predictions of individual learning indicators
Sample Econometrics Model
• Objective: Predict Economic Conditions (Growth/Recession)
• Supervised Learning
• Widely used algorithms
• Logistic Regression
• SVM
• RandomForest (Decision Trees)
• Current Challenges
– Forecasts are either too early – 6-12 months before contraction starts
– Forecasts are too late and reported by NBER (National Bureau of Economic
Research) after recession has started
– Individuals and Corporations cannot plan effectively based on prevailing economic
conditions
Sample Econometrics Model
• Use ‘leading indicators’ for economic health
–Treasury Yield Curve between 10 year and 3 month (T10Y3M)
–Industrial Production (INDPRO)
–Unemployment insurance
–Market Returns – S&P500
Data Transformation and Curation
-3.
-1.5
0.
1.5
3.
2/2/1986 7/2/1988 12/2/1990 5/2/1993 10/2/1995 3/2/1998 8/2/2000 1/2/2003 6/2/2005 11/2/2007 4/2/2010 9/2/2012 2/2/2015 7/2/2017
Treasury Yield Curve - Normalized
T10Y3M Recession (1/0 - Y/N)
Data Transformation and Curation
-13.5
-9.
-4.5
0.
4.5
2/2/1986 7/2/1988 12/2/1990 5/2/1993 10/2/1995 3/2/1998 8/2/2000 1/2/2003 6/2/2005 11/2/2007 4/2/2010 9/2/2012 2/2/2015 7/2/2017
Industrial Production - Normalized
Frequency: MonthlyINDPRO
Frequency: MonthlyRecession (1/0 - Y/N)
Data Transformation and Curation
-0.875
-0.4375
0.
0.4375
0.875
1.3125
2/2/1986 9/2/1988 4/2/1991 11/2/1993 6/2/1996 1/2/1999 8/2/2001 3/2/2004 10/2/2006 5/2/2009 12/2/2011 7/2/2014 2/2/2017
Unemployment Insurance - Normalized
Frequency: MonthlyCCSA
Frequency: MonthlyRecession (1/0 - Y/N)
Data Transformation and Curation
-3.
-1.5
0.
1.5
3.
2/2/1986 9/2/1988 4/2/1991 11/2/1993 6/2/1996 1/2/1999 8/2/2001 3/2/2004 10/2/2006 5/2/2009 12/2/2011 7/2/2014 2/2/2017
S&P500 - Normalized
Frequency: MonthlyS&P500 - Norm
Frequency: MonthlyRecession (1/0 - Y/N)
Data Plots in Spark Shell
• Data Plots in Spark Shell (Demo)
Bias versus Variance
Model Training and Testing
• Model Data should be divided into
Training
Cross Validation
Testing set
• Splitting into these 3 helps improve model performance in
real world by eliminating bias and variance and helps get
model closer to optimal results
• More features does not necessarily mean a better prediction
• MLlib provides API to help with these operations
Model Performance Measurements
• Precision
• Recall
• F1 Score
• Confusion Matrix
Model Performance Measurements
• Precision
 How often does our algorithm have false positives
 = true positives / # predicted positive
 = true positives / (true positive + false positive)
 High precision is good (i.e. closer to 1)
 You want a big number, because you want false positive
to be as close to 0 as possible
Model Performance Measurements
• Recall
 How sensitive is our algorithm?
 Of all patients in set that actually have cancer, what
fraction did we correctly detect
 = true positives / # actual positives
 = true positive / (true positive + false negative)
• High recall is good (i.e. closer to 1)
• You want a big number, because you want false negative to
be as close to 0 as possible
Model Performance Measurements
F1Score (fscore)
• = 2 * (PR/ [P + R])
• Fscore is like taking the average of precision and recall
giving a higher weight to the lower value
Model Results and Code Review
• Logistic Regression
• SVM
• Random Forest
Apache Spark MLlib
• Algorithms Supported:
–Linear SVM
–Logistic Regression SGD
–Classification and Regression Tree
–K-Means Clustering
–Recommendation versus alternating mean squares
–Singular Value Decomposition
–Linear Regression with L1 and L2 Regularization
–Multinomial Naïve Bayes
–Basic Statistics
–Feature Transformations
Unsupervised Learning
• K Means Clustering
– Customer Segmentation
– Social Network Analysis
– Computer Data Center Analysis
– Astronomical Galaxy formations
• Recommendation Engines
Unsupervised Learning – K Means
Recommender Systems
• Class of information filtering system that
predicts the ‘rating’ or ‘preference’ user
would give to an item
• Examples:
– NetFlix
– Amazon
– Apple Genius
Recommender
• Collaborative Filtering
– User-User
– Item-Item
Recommender Systems
Q/A

More Related Content

PDF
Machine learning in action at Pipedrive
PDF
Azure Machine Learning
PDF
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
PPTX
Data Science Lifecycle
PPTX
Data science | What is Data science
PPTX
Predictive Analytics: Context and Use Cases
PDF
Introduction to machine learning
PPTX
Big data deep learning: applications and challenges
Machine learning in action at Pipedrive
Azure Machine Learning
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Data Science Lifecycle
Data science | What is Data science
Predictive Analytics: Context and Use Cases
Introduction to machine learning
Big data deep learning: applications and challenges

What's hot (20)

PPTX
Machine learning
PDF
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
PDF
Why APM Is Not the Same As ML Monitoring
PPT
Data Science in the Real World: Making a Difference
PPTX
Machine Learning using Big data
PDF
H2O for Medicine and Intro to H2O in Python
PPTX
Machine Learning - Challenges, Learnings & Opportunities
PDF
Introduction to Mahout and Machine Learning
PPTX
Introduction to Data Science
PPTX
Real-time Big Data Analytics: From Deployment to Production
PDF
Scaling AutoML-Driven Anomaly Detection With Luminaire
PPTX
A Beginner's Guide to Machine Learning with Scikit-Learn
PDF
GTU GeekDay Data Science and Applications
PPT
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
PDF
Machine Learning Classifiers
PPTX
Graph Based Machine Learning on Relational Data
PDF
Knowledge Discovery
PDF
The path to be a data scientist
PDF
Guiding through a typical Machine Learning Pipeline
PPTX
Introduction Big data
Machine learning
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Why APM Is Not the Same As ML Monitoring
Data Science in the Real World: Making a Difference
Machine Learning using Big data
H2O for Medicine and Intro to H2O in Python
Machine Learning - Challenges, Learnings & Opportunities
Introduction to Mahout and Machine Learning
Introduction to Data Science
Real-time Big Data Analytics: From Deployment to Production
Scaling AutoML-Driven Anomaly Detection With Luminaire
A Beginner's Guide to Machine Learning with Scikit-Learn
GTU GeekDay Data Science and Applications
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Machine Learning Classifiers
Graph Based Machine Learning on Relational Data
Knowledge Discovery
The path to be a data scientist
Guiding through a typical Machine Learning Pipeline
Introduction Big data
Ad

Viewers also liked (20)

PPTX
Azure Machine Learning Intro
PPT
Semi-supervised Learning
PPTX
Supervised and unsupervised learning
PDF
Unsupervised learning with Spark
PDF
Large scale logistic regression and linear support vector machines using spark
PDF
2014-06-20 Multinomial Logistic Regression with Apache Spark
PPT
PPT file
PPTX
Neural network for machine learning
PDF
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
PPTX
machine learning in the age of big data: new approaches and business applicat...
PPT
15857 cse422 unsupervised-learning
PPTX
Introduction to Neural networks (under graduate course) Lecture 7 of 9
PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
PDF
Multinomial Logistic Regression with Apache Spark
PDF
Generalized Linear Models in Spark MLlib and SparkR
PPTX
Scaling out logistic regression with Spark
PPT
Applying Reinforcement Learning for Network Routing
PDF
Power of Code: What you don’t know about what you know
PDF
One Size Doesn't Fit All: The New Database Revolution
PDF
Some Take-Home Message about Machine Learning
Azure Machine Learning Intro
Semi-supervised Learning
Supervised and unsupervised learning
Unsupervised learning with Spark
Large scale logistic regression and linear support vector machines using spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
PPT file
Neural network for machine learning
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
machine learning in the age of big data: new approaches and business applicat...
15857 cse422 unsupervised-learning
Introduction to Neural networks (under graduate course) Lecture 7 of 9
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
Multinomial Logistic Regression with Apache Spark
Generalized Linear Models in Spark MLlib and SparkR
Scaling out logistic regression with Spark
Applying Reinforcement Learning for Network Routing
Power of Code: What you don’t know about what you know
One Size Doesn't Fit All: The New Database Revolution
Some Take-Home Message about Machine Learning
Ad

Similar to Machine Learning with Big Data using Apache Spark (20)

PPTX
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
PPTX
Machine Learning and Analytics in Splunk
PPTX
Machine Learning and Analytics Breakout Session
PPTX
Unit 1-ML (1) (1).pptx
PDF
AlogoAnalytics Company Presentation
PPTX
unit 1.2 supervised learning.pptx
PPTX
Four stage business analytics model
PPTX
Topic2- Information Systems.pptx
PDF
An introduction to machine learning and statistics
PPTX
Machine Learning and Analytics Breakout Session
PPTX
INTRODUCTION TO ML basics of ml that one should know
PPTX
Azure machine learning
PPTX
Session 17-18 machine learning very important and good type student favour.pptx
PPTX
BIG DATA AND MACHINE LEARNING
PDF
How ml can improve purchase conversions
PPTX
Statistical Machine Learning Lecture notes
PDF
PPTX
ML game metrics monitoring system launch / Aleksandr Tolmachev (Xsolla)
PPTX
Machine learning
PDF
An explanation of machine learning for business
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
Machine Learning and Analytics in Splunk
Machine Learning and Analytics Breakout Session
Unit 1-ML (1) (1).pptx
AlogoAnalytics Company Presentation
unit 1.2 supervised learning.pptx
Four stage business analytics model
Topic2- Information Systems.pptx
An introduction to machine learning and statistics
Machine Learning and Analytics Breakout Session
INTRODUCTION TO ML basics of ml that one should know
Azure machine learning
Session 17-18 machine learning very important and good type student favour.pptx
BIG DATA AND MACHINE LEARNING
How ml can improve purchase conversions
Statistical Machine Learning Lecture notes
ML game metrics monitoring system launch / Aleksandr Tolmachev (Xsolla)
Machine learning
An explanation of machine learning for business

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPT
Quality review (1)_presentation of this 21
PPTX
Leprosy and NLEP programme community medicine
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
Computer network topology notes for revision
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IB Computer Science - Internal Assessment.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Data_Analytics_and_PowerBI_Presentation.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Analytics and business intelligence.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Database Infoormation System (DBIS).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Reliability_Chapter_ presentation 1221.5784
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Quality review (1)_presentation of this 21
Leprosy and NLEP programme community medicine
Lecture1 pattern recognition............
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx

Machine Learning with Big Data using Apache Spark

  • 1. Machine Learning with Big Data using Apache Spark Mukundan Agaram Amit Singh
  • 2. Agenda Machine Learning Concepts1 Econometrics Model for Recession Prediction Apache Spark Code Review 4 Platform & Data3 Prevalent Use Cases2 Other ML Concepts and Wrap Up QA 5
  • 3. What is Machine Learning • Branch of AI – Alan Turing – can machines think??? – “Field of study that gives Computers the ability to learn without being explicitly programmed” – Arthur Samuel • Learn from data • Improve with experience • Iteratively refine a model that can be used to predict outcomes of questions based on previous learning
  • 4. Types of Machine Learning MachineLearning Supervised Regression Interest rate prediction Classification Spam/No-spam Unsupervised Clustering Social Network analysis Recommender Systems Collaborative Filtering Netflix recommendation
  • 5. Prevalent Use Cases • Spam Detection – Google Gmail • Voice Recognition – Apple Siri • Stock Trading –High Frequency –Recommendation Systems –Algorithmic Trading • Robotics –Acquire skills – grasping objects, locomotion, automated driving and navigation • Medicine and Healthcare –Healthcare Analytics, Prediction based on Genomes, Health sensor analysis • Advertising: –Targeted Advertising based on interests and social media • Retail and E Commerce: –Frequency club cards, targeting coupons and promotions –Recommendation Engines
  • 6. Use Cases • Gaming Analytics – Predictive Analytics for Sports Games, Console based gaming profiles, upsell and targeting in-app purchases and mods • Internet of Things – Large scale sensor data analysis for prediction, classification • Social Network Analysis – Facebook, LinkedIn • Astronomy – Galaxy formation
  • 7. Languages and Platforms • Apache Spark – MLlib – Scala, Java, Python • Mahout • Python Libraries – Scikit-learn, PyML, PyBrain, matplotlib • R – Open Source statistical programming language • Matlab • SAS • Weka • Octave • Clojure
  • 9. Data Repositories • UC Irvine Machine Learning Repository • Infochimps • Kaggle • FRED – Federal Reserve Board in Kansas • Many others...
  • 13. Model Design • Data Collection –Identify key inputs to the model • Data Transformation and Curation –Human ‘analyst’ should be able to view the data sets make predictions –Data needs to be cleaned, scrubbed, transformed – normalized –Generally most important step for any type of supervised learning algorithm • Review the data • Visually make predictions of individual learning indicators
  • 14. Sample Econometrics Model • Objective: Predict Economic Conditions (Growth/Recession) • Supervised Learning • Widely used algorithms • Logistic Regression • SVM • RandomForest (Decision Trees) • Current Challenges – Forecasts are either too early – 6-12 months before contraction starts – Forecasts are too late and reported by NBER (National Bureau of Economic Research) after recession has started – Individuals and Corporations cannot plan effectively based on prevailing economic conditions
  • 15. Sample Econometrics Model • Use ‘leading indicators’ for economic health –Treasury Yield Curve between 10 year and 3 month (T10Y3M) –Industrial Production (INDPRO) –Unemployment insurance –Market Returns – S&P500
  • 16. Data Transformation and Curation -3. -1.5 0. 1.5 3. 2/2/1986 7/2/1988 12/2/1990 5/2/1993 10/2/1995 3/2/1998 8/2/2000 1/2/2003 6/2/2005 11/2/2007 4/2/2010 9/2/2012 2/2/2015 7/2/2017 Treasury Yield Curve - Normalized T10Y3M Recession (1/0 - Y/N)
  • 17. Data Transformation and Curation -13.5 -9. -4.5 0. 4.5 2/2/1986 7/2/1988 12/2/1990 5/2/1993 10/2/1995 3/2/1998 8/2/2000 1/2/2003 6/2/2005 11/2/2007 4/2/2010 9/2/2012 2/2/2015 7/2/2017 Industrial Production - Normalized Frequency: MonthlyINDPRO Frequency: MonthlyRecession (1/0 - Y/N)
  • 18. Data Transformation and Curation -0.875 -0.4375 0. 0.4375 0.875 1.3125 2/2/1986 9/2/1988 4/2/1991 11/2/1993 6/2/1996 1/2/1999 8/2/2001 3/2/2004 10/2/2006 5/2/2009 12/2/2011 7/2/2014 2/2/2017 Unemployment Insurance - Normalized Frequency: MonthlyCCSA Frequency: MonthlyRecession (1/0 - Y/N)
  • 19. Data Transformation and Curation -3. -1.5 0. 1.5 3. 2/2/1986 9/2/1988 4/2/1991 11/2/1993 6/2/1996 1/2/1999 8/2/2001 3/2/2004 10/2/2006 5/2/2009 12/2/2011 7/2/2014 2/2/2017 S&P500 - Normalized Frequency: MonthlyS&P500 - Norm Frequency: MonthlyRecession (1/0 - Y/N)
  • 20. Data Plots in Spark Shell • Data Plots in Spark Shell (Demo)
  • 22. Model Training and Testing • Model Data should be divided into Training Cross Validation Testing set • Splitting into these 3 helps improve model performance in real world by eliminating bias and variance and helps get model closer to optimal results • More features does not necessarily mean a better prediction • MLlib provides API to help with these operations
  • 23. Model Performance Measurements • Precision • Recall • F1 Score • Confusion Matrix
  • 24. Model Performance Measurements • Precision  How often does our algorithm have false positives  = true positives / # predicted positive  = true positives / (true positive + false positive)  High precision is good (i.e. closer to 1)  You want a big number, because you want false positive to be as close to 0 as possible
  • 25. Model Performance Measurements • Recall  How sensitive is our algorithm?  Of all patients in set that actually have cancer, what fraction did we correctly detect  = true positives / # actual positives  = true positive / (true positive + false negative) • High recall is good (i.e. closer to 1) • You want a big number, because you want false negative to be as close to 0 as possible
  • 26. Model Performance Measurements F1Score (fscore) • = 2 * (PR/ [P + R]) • Fscore is like taking the average of precision and recall giving a higher weight to the lower value
  • 27. Model Results and Code Review • Logistic Regression • SVM • Random Forest
  • 28. Apache Spark MLlib • Algorithms Supported: –Linear SVM –Logistic Regression SGD –Classification and Regression Tree –K-Means Clustering –Recommendation versus alternating mean squares –Singular Value Decomposition –Linear Regression with L1 and L2 Regularization –Multinomial Naïve Bayes –Basic Statistics –Feature Transformations
  • 29. Unsupervised Learning • K Means Clustering – Customer Segmentation – Social Network Analysis – Computer Data Center Analysis – Astronomical Galaxy formations • Recommendation Engines
  • 31. Recommender Systems • Class of information filtering system that predicts the ‘rating’ or ‘preference’ user would give to an item • Examples: – NetFlix – Amazon – Apple Genius
  • 34. Q/A