SlideShare a Scribd company logo
Machine Learning with MLlib
Spark - MLlib
MACHINE LEARNING
“Programming Computers to optimize
performance using Example Data or Past
Experience”
Spark - MLlib
MACHINE LEARNING?
Field of study that gives "computers the ability
to learn without being explicitly programmed."
-- Arthur Samuel, 1959
Spark - MLlib
HAVE YOU PLAYED MARIO?
How much time did it take you to learn & win the princess?
Spark - MLlib
How about automating it?
Spark - MLlib
‱ Program Learns to Play Mario
Observes the game & presses keys
Maximises Score
How about automating it?
Spark - MLlib
Spark - MLlib
‱ Program Learnt to play Mario
and other games
Without any need of programming
So?
Spark - MLlib
1. Write new rules as per the game
2. Just hook it to new game and let it play for a while
Question:
To make this program learn any other games such as PacMan we will
have to 

Spark - MLlib
1. Write new rules as per the game
2. Just hook it to new game and let it play for a while
Question:
To make this program learn any other games such as PacMan we will
have to 

Spark - MLlib
MACHINE LEARNING
‱ Branch of Artificial Intelligence
‱ Design and Development of Algorithms
‱ Computers Evolve Behaviour based on Empirical Data
Spark - MLlib
Recommend Friends, Dates, Products to end-user.
MACHINE LEARNING - APPLICATIONS
Spark - MLlib
Classify content into predefined groups.
MACHINE LEARNING - APPLICATIONS
Spark - MLlib
Identify key topics in large Collections of Text.
MACHINE LEARNING - APPLICATIONS
Spark - MLlib
Computer Vision - Identifying Objects
MACHINE LEARNING - APPLICATIONS
Spark - MLlib
Natural Language Processing
MACHINE LEARNING - APPLICATIONS
Spark - MLlib
MACHINE LEARNING - APPLICATIONS
‱ Find Similar content based on Object Properties.
‱ Detect Anomalies within given data.
‱ Ranking Search Results with User Feedback Learning.
‱ Classifying DNA sequences.
‱ Sentiment Analysis/ Opinion Mining
‱ BioInformatics.
‱ Speech and HandWriting Recognition.
Spark - MLlib
MACHINE LEARNING - TYPES?
Machine Learning
Supervised
Given example inputs & outputs,
learn to map inputs to outputs
Spark - MLlib
MACHINE LEARNING - TYPES?
Machine Learning
Supervised
Unsupervised
Given example inputs & outputs,
learn to map inputs to outputs
No labels given, find structure
Spark - MLlib
MACHINE LEARNING - TYPES?
Machine Learning
Supervised
Unsupervised
Reinforcement
Given example inputs & outputs,
learn to map inputs to outputs
No labels given, find structure
Dynamic environment, perform a certain goal
Spark - MLlib
MACHINE LEARNING - TYPES?
Machine Learning
Supervised
Unsupervised
Reinforcement
Classification
Regression
Clustering
Spark - MLlib
MACHINE LEARNING - CLASSIFICATION?
Spam?
Ye
s
No
Check
Email
We Use Logistic
Regression
Spark - MLlib
MACHINE LEARNING - REGRESSION?
Predicting a continuous-valued attribute
associated with an object.
In linear regression, we draw all possible lines
going through the points such that it is closest
to all.
Spark - MLlib
MACHINE LEARNING - CLUSTERING?
‱ To form a cluster
‱ based on some definition of
nearness
Spark - MLlib
MACHINE LEARNING - TOOLS
DATA SIZE CLASSFICATION TOOLS
Lines
Sample Data
Analysis and
Visualization
Whiteboard,

KBs - low MBs Prototype
Data
Analysis and
Visualization
Matlab, Octave, R,
Processing,
MBs - low GBs
Online Data
Analysis NumPy, SciPy,
Weka,
Visualization Flare, AmCharts,
Raphael, Protovis
GBs - TBs - PBs
Big Data
Analysis MLlib, SparkR, GraphX,
Mahout, Giraph
Spark - MLlib
Machine Learning Library (MLlib)
Goal is to make practical machine learning scalable and easy
Consists of common learning algorithms and utilities, including:
● Classification
● Regression
● Clustering
● Collaborative filtering
● Dimensionality reduction
● Lower-level optimization primitives
● Higher-level pipeline APIs
Spark - MLlib
MlLib Structure
ML Algorithms
Common learning algorithms
e.g. classification, regression, clustering,
and collaborative filtering
Featurization
Feature extraction, Transformation, Dimensionality
reduction, and Selection
Pipelines
Tools for constructing, evaluating,
and tuning ML Pipelines
Persistence
Saving and load algorithms, models,
and Pipelines
Utilities
Linear algebra, statistics, data handling, etc.
Spark - MLlib
MLlib - Collaborative Filtering
● Commonly used for recommender systems
● Techniques aim to fill in the missing entries of a user-item association
matrix
● Supports model-based collaborative filtering,
● Users and products are described by a small set of latent factors
○ that can be used to predict missing entries.
● MLlib uses the alternating least squares (ALS) algorithm to learn these
latent factors.
Spark - MLlib
Example - Movie Lens Recommendation (1)
Spark - MLlib
Example - Movie Lens Recommendation
https://github.com/cloudxlab/bigdata/blob/master/spark/examples/mllib/ml-recommender.scala
Demo
Spark - MLlib
Exercise - Movies suggestions for you!
1. Find the maximum user id
2. Create the next user id denoting yourselves
3. Put your ratings of various movies
4. Generate your movies recommendations
5. Write down the steps in your Google Doc and share
with support@cloudxlab.com.
Spark - MLlib
spark.mllib - DataTypes
Local vector
integer-typed and 0-based indices and double-typed values
dv2 = [1.0, 0.0, 3.0]
Labeled point
a local vector, either dense or sparse, associated with a label/response
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
Matrices:
Local matrix
Distributed matrix
RowMatrix
IndexedRowMatrix
CoordinateMatrix
BlockMatrix
Spark - MLlib
Pipe Lines
DataFrame:This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a
variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors,
true labels, and predictions.
Transformer: A Transformer is an algorithm which can transform one DataFrame into another
DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a
DataFrame with predictions.
Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.
E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML
workflow.
Parameter: All Transformers and Estimators now share a common API for specifying parameters.
Spark - MLlib
Pipe Lines
Spark - MLlib
spark.mllib - Basic Statistics
Summary statistics
Correlations
Stratified sampling
Hypothesis testing
Random data generation
Kernel density estimation
See https://spark.apache.org/docs/latest/mllib-statistics.html
Spark - MLlib
MLlib - Classification and Regression
MLlib supports various methods:
Binary Classification
linear SVMs, logistic regression, decision trees, random forests,
gradient-boosted trees, naive Bayes
Multiclass Classification
logistic regression, decision trees, random forests, naive Bayes
Regression
linear least squares, Lasso, ridge regression, decision trees, random
forests, gradient-boosted trees, isotonic regression
More Details>>
Spark - MLlib
MlLib - Other Classes of Algorithms
Dimensionality reduction:
https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
Feature extraction and transformation:
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Frequent pattern mining:
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
Evaluation metrics:
https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html
PMML model export:
https://spark.apache.org/docs/latest/mllib-pmml-model-export.html
Optimization (developer):
https://spark.apache.org/docs/latest/mllib-optimization.html
Thank you!
MLLib
reachus@cloudxlab.com
Spark - MLlib
MACHINE LEARNING - TYPES
Supervised Unsupervised Semi-Supervised Reinforcement
Spark - MLlib
MACHINE LEARNING - TYPES
Using Labeled training data, to create a Classifier
that can predict output for unseen inputs.
Supervised Unsupervised Semi-Supervised Reinforcement
Spark - MLlib
MACHINE LEARNING - TYPES
Example1: Spam Filter
Supervised Unsupervised Semi-Supervised Reinforcement
Spark - MLlib
MACHINE LEARNING - TYPES
Example1: Spam Filter
Supervised Unsupervised Semi-Supervised Reinforcement
Spark - MLlib
MACHINE LEARNING - TYPES
Supervised
Using unlabeled training data to create a function
that can predict output.
Unsupervised Semi-Supervised Reinforcement
Spark - MLlib
MACHINE LEARNING - TYPES
Make use of unlabeled data for training - typically
a small amount of labeled data with a large
amount of unlabeled data.
Supervised Unsupervised Semi-Supervised Reinforcement
Spark - MLlib
MACHINE LEARNING - TYPES
A computer program interacts with a dynamic
environment for goal gets feedback as it navigates
its problem space.
Supervised Unsupervised Semi-Supervised Reinforcement
Spark - MLlib
MACHINE LEARNING - TYPES
Spark - MLlib
MACHINE LEARNING - GRADIENT
DESCENT
‱ Instead of trying all lines, go into the
direction yielding better results.
Imagine yourself blindfolded on the mountainous
terrain.
And you have to find the best lowest point.
If your last step went higher, you will go in opposite
direction.
Other, you will keep going just faster
Spark - MLlib
import org.apache.spark.mllib.recommendation._
var raw = sc.textFile("/data/ml-1m/ratings.dat")
var mydata = [(2, 0.01), ....]
var mydatardd = mydata.parallelize().map(x => Ratings(0, x._1, x._2))
def parseRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 4)
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat)
}
val ratings = raw.map(parseRating)
totalratings = ratings.union(mydatardd)
val model = ALS.train(totalratings, 8, 5, 1)
var products = model.recommendProducts(1, 10)
//load data from movies , join it and display the names ordered by ratings
Example - Movie Lens Reco (ver 2.0)
Spark - MLlib
spark.mllib - Basic Statistics - Summary
from pyspark.mllib.stat import Statistics
sc = ... # SparkContext
mat = ... # an RDD of Vectors
# Compute column summary statistics.
summary = Statistics.colStats(mat)
print(summary.mean())
print(summary.variance())
print(summary.numNonzeros())
Spark - MLlib
MLlib - Clustering
● Clustering is an unsupervised learning problem
● Group subsets of entities with one another based on some notion of
similarity.
● Often used for exploratory analysis
Spark - MLlib
MLlib supports the following models:
K-means
Clusters the data points into a predefined number of clusters
Gaussian mixture
Subgroups within overall population
Power iteration clustering (PIC)
Clustering vertices of a graph given pairwise similarities as edge properties
Latent Dirichlet allocation (LDA)
Infers topics from a collection of text documents
Streaming k-means
Spark - MLlib
MLlib - k-means Example
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("/data/spark/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)
// Save and load model
clusters.save(sc, "KMeansModel1")
val sameModel = KMeansModel.load(sc, "KMeansModel1")
Spark - MLlib
MLlib - k-means Example
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile("/data/spark/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10,
initializationMode="random")
Spark - MLlib
MLlib - k-means Example
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x +
y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
# Save and load model
clusters.save(sc, "myModelPath")
sameModel = KMeansModel.load(sc, "myModelPath")
Spark - MLlib
Example - Movie Lens Recommendation
https://github.com/cloudxlab/bigdata/blob/master/spark/examples/mllib/ml-recommender.scala
Movie Lens - Movies
Training Set
(80%)
Test Set
(20%) Model
MLLib
Recommendations
Remove ratings
& Apply Model

More Related Content

What's hot (20)

PDF
Sparse Data Support in MLlib
Xiangrui Meng
 
PDF
Practical Machine Learning Pipelines with MLlib
Databricks
 
PPTX
Apache Spark Streaming
Zahra Eskandari
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Spark Summit EU talk by Reza Karimi
Spark Summit
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
PPTX
JVM languages "flame wars"
Gal Marder
 
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PDF
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
PDF
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
PPTX
Spark MLlib - Training Material
Bryan Yang
 
Sparse Data Support in MLlib
Xiangrui Meng
 
Practical Machine Learning Pipelines with MLlib
Databricks
 
Apache Spark Streaming
Zahra Eskandari
 
Introduction to Spark - DataFactZ
DataFactZ
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Spark Summit EU talk by Reza Karimi
Spark Summit
 
Spark real world use cases and optimizations
Gal Marder
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
JVM languages "flame wars"
Gal Marder
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Apache Spark MLlib
Zahra Eskandari
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Spark MLlib - Training Material
Bryan Yang
 

Similar to Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | CloudxLab (20)

PPTX
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
PDF
Spark m llib
Milad Alshomary
 
PPTX
Alpine innovation final v1.0
alpinedatalabs
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Summary machine learning and model deployment
Novita Sari
 
PDF
Big Data Science in Scala V2
Anastasia Bobyreva
 
PPTX
Data science on big data. Pragmatic approach
Pavel Mezentsev
 
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Jim Dowling
 
PDF
2014-08-14 Alpine Innovation to Spark
DB Tsai
 
PPTX
Learning spark ch11 - Machine Learning with MLlib
phanleson
 
PPTX
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
PDF
Sparking Science up with Research Recommendations
Maya Hristakeva
 
PDF
Splice Machine's use of Apache Spark and MLflow
Databricks
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PPTX
Sparking Science up with Research Recommendations by Maya Hristakeva
Spark Summit
 
PDF
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
PPTX
Using Apache Spark with IBM SPSS Modeler
Global Knowledge Training
 
PDF
Open and Automated Machine Learning
Joaquin Vanschoren
 
PDF
Python Machine Learning - Getting Started
Rafey Iqbal Rahman
 
PPTX
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Spark m llib
Milad Alshomary
 
Alpine innovation final v1.0
alpinedatalabs
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Summary machine learning and model deployment
Novita Sari
 
Big Data Science in Scala V2
Anastasia Bobyreva
 
Data science on big data. Pragmatic approach
Pavel Mezentsev
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Jim Dowling
 
2014-08-14 Alpine Innovation to Spark
DB Tsai
 
Learning spark ch11 - Machine Learning with MLlib
phanleson
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
Sparking Science up with Research Recommendations
Maya Hristakeva
 
Splice Machine's use of Apache Spark and MLflow
Databricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Spark Summit
 
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
Using Apache Spark with IBM SPSS Modeler
Global Knowledge Training
 
Open and Automated Machine Learning
Joaquin Vanschoren
 
Python Machine Learning - Getting Started
Rafey Iqbal Rahman
 
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Ad

More from CloudxLab (20)

PDF
Understanding computer vision with Deep Learning
CloudxLab
 
PDF
Deep Learning Overview
CloudxLab
 
PDF
Recurrent Neural Networks
CloudxLab
 
PDF
Natural Language Processing
CloudxLab
 
PDF
Naive Bayes
CloudxLab
 
PDF
Autoencoders
CloudxLab
 
PDF
Training Deep Neural Nets
CloudxLab
 
PDF
Reinforcement Learning
CloudxLab
 
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
PPTX
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
PPTX
Introduction to Deep Learning | CloudxLab
CloudxLab
 
PPTX
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
PPTX
Ensemble Learning and Random Forests
CloudxLab
 
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
CloudxLab
 
Naive Bayes
CloudxLab
 
Autoencoders
CloudxLab
 
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
CloudxLab
 
Ad

Recently uploaded (20)

PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PDF
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PDF
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 

Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Spark - MLlib MACHINE LEARNING “Programming Computers to optimize performance using Example Data or Past Experience”
  • 3. Spark - MLlib MACHINE LEARNING? Field of study that gives "computers the ability to learn without being explicitly programmed." -- Arthur Samuel, 1959
  • 4. Spark - MLlib HAVE YOU PLAYED MARIO? How much time did it take you to learn & win the princess?
  • 5. Spark - MLlib How about automating it?
  • 6. Spark - MLlib ‱ Program Learns to Play Mario Observes the game & presses keys Maximises Score How about automating it?
  • 8. Spark - MLlib ‱ Program Learnt to play Mario and other games Without any need of programming So?
  • 9. Spark - MLlib 1. Write new rules as per the game 2. Just hook it to new game and let it play for a while Question: To make this program learn any other games such as PacMan we will have to 

  • 10. Spark - MLlib 1. Write new rules as per the game 2. Just hook it to new game and let it play for a while Question: To make this program learn any other games such as PacMan we will have to 

  • 11. Spark - MLlib MACHINE LEARNING ‱ Branch of Artificial Intelligence ‱ Design and Development of Algorithms ‱ Computers Evolve Behaviour based on Empirical Data
  • 12. Spark - MLlib Recommend Friends, Dates, Products to end-user. MACHINE LEARNING - APPLICATIONS
  • 13. Spark - MLlib Classify content into predefined groups. MACHINE LEARNING - APPLICATIONS
  • 14. Spark - MLlib Identify key topics in large Collections of Text. MACHINE LEARNING - APPLICATIONS
  • 15. Spark - MLlib Computer Vision - Identifying Objects MACHINE LEARNING - APPLICATIONS
  • 16. Spark - MLlib Natural Language Processing MACHINE LEARNING - APPLICATIONS
  • 17. Spark - MLlib MACHINE LEARNING - APPLICATIONS ‱ Find Similar content based on Object Properties. ‱ Detect Anomalies within given data. ‱ Ranking Search Results with User Feedback Learning. ‱ Classifying DNA sequences. ‱ Sentiment Analysis/ Opinion Mining ‱ BioInformatics. ‱ Speech and HandWriting Recognition.
  • 18. Spark - MLlib MACHINE LEARNING - TYPES? Machine Learning Supervised Given example inputs & outputs, learn to map inputs to outputs
  • 19. Spark - MLlib MACHINE LEARNING - TYPES? Machine Learning Supervised Unsupervised Given example inputs & outputs, learn to map inputs to outputs No labels given, find structure
  • 20. Spark - MLlib MACHINE LEARNING - TYPES? Machine Learning Supervised Unsupervised Reinforcement Given example inputs & outputs, learn to map inputs to outputs No labels given, find structure Dynamic environment, perform a certain goal
  • 21. Spark - MLlib MACHINE LEARNING - TYPES? Machine Learning Supervised Unsupervised Reinforcement Classification Regression Clustering
  • 22. Spark - MLlib MACHINE LEARNING - CLASSIFICATION? Spam? Ye s No Check Email We Use Logistic Regression
  • 23. Spark - MLlib MACHINE LEARNING - REGRESSION? Predicting a continuous-valued attribute associated with an object. In linear regression, we draw all possible lines going through the points such that it is closest to all.
  • 24. Spark - MLlib MACHINE LEARNING - CLUSTERING? ‱ To form a cluster ‱ based on some definition of nearness
  • 25. Spark - MLlib MACHINE LEARNING - TOOLS DATA SIZE CLASSFICATION TOOLS Lines Sample Data Analysis and Visualization Whiteboard,
 KBs - low MBs Prototype Data Analysis and Visualization Matlab, Octave, R, Processing, MBs - low GBs Online Data Analysis NumPy, SciPy, Weka, Visualization Flare, AmCharts, Raphael, Protovis GBs - TBs - PBs Big Data Analysis MLlib, SparkR, GraphX, Mahout, Giraph
  • 26. Spark - MLlib Machine Learning Library (MLlib) Goal is to make practical machine learning scalable and easy Consists of common learning algorithms and utilities, including: ● Classification ● Regression ● Clustering ● Collaborative filtering ● Dimensionality reduction ● Lower-level optimization primitives ● Higher-level pipeline APIs
  • 27. Spark - MLlib MlLib Structure ML Algorithms Common learning algorithms e.g. classification, regression, clustering, and collaborative filtering Featurization Feature extraction, Transformation, Dimensionality reduction, and Selection Pipelines Tools for constructing, evaluating, and tuning ML Pipelines Persistence Saving and load algorithms, models, and Pipelines Utilities Linear algebra, statistics, data handling, etc.
  • 28. Spark - MLlib MLlib - Collaborative Filtering ● Commonly used for recommender systems ● Techniques aim to fill in the missing entries of a user-item association matrix ● Supports model-based collaborative filtering, ● Users and products are described by a small set of latent factors ○ that can be used to predict missing entries. ● MLlib uses the alternating least squares (ALS) algorithm to learn these latent factors.
  • 29. Spark - MLlib Example - Movie Lens Recommendation (1)
  • 30. Spark - MLlib Example - Movie Lens Recommendation https://github.com/cloudxlab/bigdata/blob/master/spark/examples/mllib/ml-recommender.scala Demo
  • 31. Spark - MLlib Exercise - Movies suggestions for you! 1. Find the maximum user id 2. Create the next user id denoting yourselves 3. Put your ratings of various movies 4. Generate your movies recommendations 5. Write down the steps in your Google Doc and share with [email protected].
  • 32. Spark - MLlib spark.mllib - DataTypes Local vector integer-typed and 0-based indices and double-typed values dv2 = [1.0, 0.0, 3.0] Labeled point a local vector, either dense or sparse, associated with a label/response pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) Matrices: Local matrix Distributed matrix RowMatrix IndexedRowMatrix CoordinateMatrix BlockMatrix
  • 33. Spark - MLlib Pipe Lines DataFrame:This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions. Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions. Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. Parameter: All Transformers and Estimators now share a common API for specifying parameters.
  • 35. Spark - MLlib spark.mllib - Basic Statistics Summary statistics Correlations Stratified sampling Hypothesis testing Random data generation Kernel density estimation See https://spark.apache.org/docs/latest/mllib-statistics.html
  • 36. Spark - MLlib MLlib - Classification and Regression MLlib supports various methods: Binary Classification linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive Bayes Multiclass Classification logistic regression, decision trees, random forests, naive Bayes Regression linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression More Details>>
  • 37. Spark - MLlib MlLib - Other Classes of Algorithms Dimensionality reduction: https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html Feature extraction and transformation: https://spark.apache.org/docs/latest/mllib-feature-extraction.html Frequent pattern mining: https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html Evaluation metrics: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html PMML model export: https://spark.apache.org/docs/latest/mllib-pmml-model-export.html Optimization (developer): https://spark.apache.org/docs/latest/mllib-optimization.html
  • 39. Spark - MLlib MACHINE LEARNING - TYPES Supervised Unsupervised Semi-Supervised Reinforcement
  • 40. Spark - MLlib MACHINE LEARNING - TYPES Using Labeled training data, to create a Classifier that can predict output for unseen inputs. Supervised Unsupervised Semi-Supervised Reinforcement
  • 41. Spark - MLlib MACHINE LEARNING - TYPES Example1: Spam Filter Supervised Unsupervised Semi-Supervised Reinforcement
  • 42. Spark - MLlib MACHINE LEARNING - TYPES Example1: Spam Filter Supervised Unsupervised Semi-Supervised Reinforcement
  • 43. Spark - MLlib MACHINE LEARNING - TYPES Supervised Using unlabeled training data to create a function that can predict output. Unsupervised Semi-Supervised Reinforcement
  • 44. Spark - MLlib MACHINE LEARNING - TYPES Make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Supervised Unsupervised Semi-Supervised Reinforcement
  • 45. Spark - MLlib MACHINE LEARNING - TYPES A computer program interacts with a dynamic environment for goal gets feedback as it navigates its problem space. Supervised Unsupervised Semi-Supervised Reinforcement
  • 46. Spark - MLlib MACHINE LEARNING - TYPES
  • 47. Spark - MLlib MACHINE LEARNING - GRADIENT DESCENT ‱ Instead of trying all lines, go into the direction yielding better results. Imagine yourself blindfolded on the mountainous terrain. And you have to find the best lowest point. If your last step went higher, you will go in opposite direction. Other, you will keep going just faster
  • 48. Spark - MLlib import org.apache.spark.mllib.recommendation._ var raw = sc.textFile("/data/ml-1m/ratings.dat") var mydata = [(2, 0.01), ....] var mydatardd = mydata.parallelize().map(x => Ratings(0, x._1, x._2)) def parseRating(str: String): Rating = { val fields = str.split("::") assert(fields.size == 4) Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat) } val ratings = raw.map(parseRating) totalratings = ratings.union(mydatardd) val model = ALS.train(totalratings, 8, 5, 1) var products = model.recommendProducts(1, 10) //load data from movies , join it and display the names ordered by ratings Example - Movie Lens Reco (ver 2.0)
  • 49. Spark - MLlib spark.mllib - Basic Statistics - Summary from pyspark.mllib.stat import Statistics sc = ... # SparkContext mat = ... # an RDD of Vectors # Compute column summary statistics. summary = Statistics.colStats(mat) print(summary.mean()) print(summary.variance()) print(summary.numNonzeros())
  • 50. Spark - MLlib MLlib - Clustering ● Clustering is an unsupervised learning problem ● Group subsets of entities with one another based on some notion of similarity. ● Often used for exploratory analysis
  • 51. Spark - MLlib MLlib supports the following models: K-means Clusters the data points into a predefined number of clusters Gaussian mixture Subgroups within overall population Power iteration clustering (PIC) Clustering vertices of a graph given pairwise similarities as edge properties Latent Dirichlet allocation (LDA) Infers topics from a collection of text documents Streaming k-means
  • 52. Spark - MLlib MLlib - k-means Example import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("/data/spark/kmeans_data.txt") val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache() // Cluster the data into two classes using KMeans val numClusters = 2 val numIterations = 20 val clusters = KMeans.train(parsedData, numClusters, numIterations) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) println("Within Set Sum of Squared Errors = " + WSSSE) // Save and load model clusters.save(sc, "KMeansModel1") val sameModel = KMeansModel.load(sc, "KMeansModel1")
  • 53. Spark - MLlib MLlib - k-means Example from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array from math import sqrt # Load and parse the data data = sc.textFile("/data/spark/mllib/kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")
  • 54. Spark - MLlib MLlib - k-means Example # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE)) # Save and load model clusters.save(sc, "myModelPath") sameModel = KMeansModel.load(sc, "myModelPath")
  • 55. Spark - MLlib Example - Movie Lens Recommendation https://github.com/cloudxlab/bigdata/blob/master/spark/examples/mllib/ml-recommender.scala Movie Lens - Movies Training Set (80%) Test Set (20%) Model MLLib Recommendations Remove ratings & Apply Model