SlideShare a Scribd company logo
with Apache Spark MLlib
#javaone
https://ua.linkedin.com/in/tarasmatyashovsky
2
I am not
a data science
engineer
3
4
lyrics
genre
5
“I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you”
https://github.com/tmatyashovsky/spark-ml-samples
6
“I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you”
https://github.com/tmatyashovsky/spark-ml-samples
7
8
 Look for particular words like “fear”, “fight”, “kill”,
“devil”, ”death”, etc.?
 Count length of a verse?
 Count unique words in a verse?
9
10
15-20
11
is the study of
computer
algorithms that
improve
automatically
through
experience
12
Supervise
d
learning
Unsupervise
d
learning
Reinforcemen
t
learning
13
14
 Date & time
 Conference name
 Speaker
 Talk name
 Track
 Duration
 Type
 Overall impression
 Overall rating
 Number of slides
 Time spent on live
coding
 Number of jokes
 Etc.
15
Learning algorithms
Hypotheses:
Сost function:
Features:
Target variable:
Training example:
Training set:
16
http://www.slideshare.net/liweiyang5/spark-mllib-training-material
17
Number of jokes during a talk
Speaker’s
rating
18
19
20
21
22
23
24
Positive
Negative
Impression
Number of jokes during a talk
25
26
27
28
29
30
31
Numberofjokesduringa
talk
Time (min.) spent on live
coding
Number of
clusters:
K = 5K = 2
32
33
 Initialize cluster centroids:
 assign each example to the closest
cluster centroid
 Recalculate centroids as an average (mean) of
examples assigned to a cluster
34
35
36
 Collect data set of lyrics:
 Abba, Ace of base, Backstreet Boys, Britney Spears,
Christina Aguilera, Madonna, etc.
 Black Sabbath, In Flames, Iron Maiden, Metallica,
Moonspell, Nightwish, Sentenced, etc.
 Create training set, i.e. label (0|1) + features
 Train logistic regression (or other classification
algorithm)
https://github.com/tmatyashovsky/spark-ml-samples
37
https://github.com/tmatyashovsky/spark-ml-samples
38
39
GloV
e Bag
of
Words
Word2VecTF-
IDF
http://spark.apache.org/docs/latest/ml-features.html#feature-extractors
40
 Produces unique fixed-size dense vectors
 Captures semantic and morphologic similarity
https://code.google.com/archive/p/word2vec/
41
Similar
scores
(cos ~ 1)
Opposite
scores
(cos ~ -1)
Unrelated
scores
(cos ~ 0)
http://bionlp-www.utu.fi/wv_demo/ http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png
42
43
Verse Cosine Distance
baby one more time 0.482028
crazy for you 0.437875
show me the meaning
of being lonely
0.258147
highway to hell -0.1120049
kill them all -0.231876
https://github.com/tmatyashovsky/spark-ml-samples
https://github.com/tmatyashovsky/spark-ml-samples
44
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
Under-fitting
(high bias)
Over-fitting
(high variance)
Appropriate
fitting
http://mlwiki.org/index.php/Overfitting
47
Training set (66,6%)
Test set (33%)
K = 3
48
Training set (66,6%)
Test set (33%)
K = 3
49
Training set (33,3%)
Test set (33%)
Training set (33,3%)
K = 3
50
51
Java
52
Weka
Encog
AerosolveFlinkM
L
https://github.com/josephmisiti/awesome-machine-learning
53
Easy of
use
Cloud
computing
Spee
d
Generali
ty
Data
processing
54
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
55
Is a library of ML algorithms and utilities
designed to run in parallel on Spark cluster
56
 Introduces a few new data types, e.g.
vector (dense and sparse), labeled point,
rating, etc.
 Allows to invoke various algorithms on
distributed datasets (RDD/Dataset)
http://spark.apache.org/docs/latest/mllib-guide.html
57
http://spark.apache.org/docs/latest/mllib-guide.html
Build on
top of
RDDs
Build on
top of
Datasets
spark.mll
ib
spark.ml
58
 Utilities: linear algebra, statistics, etc.
 Features extraction, features transforming, etc.
 Regression
 Classification
 Clustering
 Collaborative filtering, e.g. alternating least squares
 Dimensionality reduction
 And many more
http://spark.apache.org/docs/latest/mllib-guide.html
59
”All” spark.mllib features plus:
• Pipelines
• Persistence
• Model selection and tuning:
• Train validation split
• K-folds cross validation
http://spark.apache.org/docs/latest/ml-guide.html
60
Raw data Transformer
Estimator
[parameters]
Transformer
[parameters]
Estimator
[parameters]
Dataset Dataset
Dataset
Dataset
http://spark.apache.org/docs/latest/ml-pipeline.html
Cross
Validator
[pipeline,
evaluator,
parameters]
Dataset
61
Using Spark MLlib Pipeline
Lyrics
https://github.com/tmatyashovsky/spark-ml-samples
63
I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
64
Lyrics Cleanser
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
65
I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
66
Lyrics Cleanser
Dataset
Numerator
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
67
Im a rolling thunder a pouring rain
Im comin on like a hurricane
My lightnings flashing across the sky
Youre only young but youre gonna die
I wont take no prisoners wont spare no lives
Nobodys putting up a fight
I got my bell Im gonna take you to hell
Im gonna get you Satan get you
https://github.com/tmatyashovsky/spark-ml-samples
68
1
2
3
4
5
6
7
8
Lyrics Cleanser
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
69
im a rolling thunder a pouring rain
im comin on like a hurricane
My lightnings flashing across the sky
youre only young but youre gonna die
I wont take no prisoners wont spare no lives
nobodys putting up a fight
I got my bell im gonna take you to hell
im gonna get you satan get you
https://github.com/tmatyashovsky/spark-ml-samples
70
1
2
3
4
5
6
7
8
Lyrics Cleanser
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
71
im rolling thunder pouring rain
im comin like hurricane
lightnings flashing across sky
youre young youre gonna die
wont take prisoners wont spare lives
nobodiys putting fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
72
1
2
3
4
5
6
7
8
Lyrics Cleanser
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
73
4
im roll thunder pour rain
im comin like hurrican
lightn flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
74
1
2
3
4
5
6
7
8
verse1
verse2
8
im roll thunder pour rain
im comin like hurrican
Light n flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
got bell im gonna take hell
im gonna get satan get
https://github.com/tmatyashovsky/spark-ml-samples
75
1
2
3
4
5
6
7
8
verse1
Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Dataset
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
76
4
[0.036463763926011056,
-0.013076733228398295,
...
0.03816963326281462]
https://github.com/tmatyashovsky/spark-ml-samples
77
feature1
feature2
[-0.013962931134021625,
0.049275818325650804,
...
-0.058982484615766086]
8
[0.036463763926011056,
-0.013076733228398295,
0.044362547532774695,
0.03816963326281462,
...
-0.013962931134021625,
0.049275818325650804,
-0.058982484615766086]
https://github.com/tmatyashovsky/spark-ml-samples
78
feature1
Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
79
Probability:
[0.9212126972383768,
0.07878730276162313]
Prediction:
0.0
https://github.com/tmatyashovsky/spark-ml-samples
80
Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Cross
Validator
Model
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
81
[0.8454839775240359,
0.9061236588248319,
0.9527128936788524,
0.9522790271664413,
...
0.9526248129757111,
0.9522790271664411]
https://github.com/tmatyashovsky/spark-ml-samples
82
Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Cross
Validator
Model
Dataset
https://github.com/tmatyashovsky/spark-ml-samples
83
84
85
86
 ML is not as complex as it seems from an applied
perspective
 Existing libraries and frameworks reduce a lot of
tedious work
 For instance, Spark MLlib can help to build nice ML
pipelines
Design by
87
 https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms
 Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia
 https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
 https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
 https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
 https://www.kaggle.com/c/dogs-vs-cats/
 http://yann.lecun.com/exdb/mnist/
 http://www.bcl.hamilton.ie/~barak/teach/F98/ECE547/hw1/index.html
 http://www.slideshare.net/jeykottalam/pipelines-ampcamp
 https://github.com/master/spark-stemming
 https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-apache-spark.html
 http://www.degeneratestate.org/posts/2016/Apr/20/heavy-metal-and-natural-language-processing-part-1/
 https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html
 https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms
 http://www.slideshare.net/liweiyang5/spark-mllib-training-material
 https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htm
 http://www.slideshare.net/databricks/combining-machine-learning-frameworks-with-apache-spark l
 https://databricks.com/blog/2015/10/20/audience-modeling-with-apache-spark-ml-pipelines.html
 https://github.com/deeplearning4j/deeplearning4j
 http://deeplearning4j.org/spark
 http://mlwiki.org/index.php/Overfitting
 http://bionlp-www.utu.fi/wv_demo/
 https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/
88

More Related Content

PPTX
Introduction to Amazon S3
PPTX
Introduction to Hadoop and Hadoop component
PDF
XGBoost & LightGBM
PPTX
Random forest
PDF
Data Visualization in Data Science
PPTX
Delta lake and the delta architecture
PPTX
Gradient Boosted trees
PPTX
Phishing ppt
Introduction to Amazon S3
Introduction to Hadoop and Hadoop component
XGBoost & LightGBM
Random forest
Data Visualization in Data Science
Delta lake and the delta architecture
Gradient Boosted trees
Phishing ppt

What's hot (20)

PPTX
Spark architecture
PDF
Machine Learning with Spark MLlib
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Spark streaming
PDF
Introduction to MLflow
PDF
What is in a Lucene index?
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PPTX
Apache Spark Fundamentals
PDF
Introduction to elasticsearch
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Apache Flume
PPTX
Apache Flink and what it is used for
PDF
Introduction to Spark with Python
PPTX
Apache Spark Architecture
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PPTX
Apache Spark Components
PDF
Introduction to Spark Streaming
PPTX
Apache Spark overview
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Apache spark
Spark architecture
Machine Learning with Spark MLlib
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Spark streaming
Introduction to MLflow
What is in a Lucene index?
Scaling your Data Pipelines with Apache Spark on Kubernetes
Apache Spark Fundamentals
Introduction to elasticsearch
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Flume
Apache Flink and what it is used for
Introduction to Spark with Python
Apache Spark Architecture
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Apache Spark Components
Introduction to Spark Streaming
Apache Spark overview
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Apache spark
Ad

Viewers also liked (20)

PDF
Introduction to Machine Learning with Spark
PDF
MLlib: Spark's Machine Learning Library
PPTX
Yace 3.0
PPTX
MLlib and Machine Learning on Spark
PDF
Large-Scale Machine Learning with Apache Spark
PDF
Practical Machine Learning Pipelines with MLlib
PPTX
Machine Learning With Spark
PDF
Spark DataFrames and ML Pipelines
PDF
Ingesting Drone Data into Big Data Platforms
PDF
R, Scikit-Learn and Apache Spark ML - What difference does it make?
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
PDF
Giraph+Gora in ApacheCon14
PPTX
Introduction to Apache Spark and MLlib
PDF
Reactive dashboard’s using apache spark
PPTX
Seattle spark-meetup-032317
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PPTX
Introduction to Apache Spark
PPTX
Machine learning com Apache Spark
Introduction to Machine Learning with Spark
MLlib: Spark's Machine Learning Library
Yace 3.0
MLlib and Machine Learning on Spark
Large-Scale Machine Learning with Apache Spark
Practical Machine Learning Pipelines with MLlib
Machine Learning With Spark
Spark DataFrames and ML Pipelines
Ingesting Drone Data into Big Data Platforms
R, Scikit-Learn and Apache Spark ML - What difference does it make?
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Giraph+Gora in ApacheCon14
Introduction to Apache Spark and MLlib
Reactive dashboard’s using apache spark
Seattle spark-meetup-032317
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Introduction to Apache Spark
Machine learning com Apache Spark
Ad

More from Taras Matyashovsky (12)

PPTX
Morning 3 anniversary
PPTX
Distinguish Pop from Heavy Metal using Apache Spark MLlib
PPTX
Morning at Lohika 2nd anniversary
PPTX
Confession of an Engineer
PPTX
Influence. The Psychology of Persuasion (in IT)
PPTX
JEEConf 2015 - Introduction to real-time big data with Apache Spark
PPTX
Morning at Lohika 1st anniversary
PPTX
Introduction to real time big data with Apache Spark
PPTX
New life inside monolithic application
PDF
Distributed applications using Hazelcast
PPTX
Morning at Lohika
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
Morning 3 anniversary
Distinguish Pop from Heavy Metal using Apache Spark MLlib
Morning at Lohika 2nd anniversary
Confession of an Engineer
Influence. The Psychology of Persuasion (in IT)
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Morning at Lohika 1st anniversary
Introduction to real time big data with Apache Spark
New life inside monolithic application
Distributed applications using Hazelcast
Morning at Lohika
From cache to in-memory data grid. Introduction to Hazelcast.

Recently uploaded (20)

PPT
Drone Technology Electronics components_1
PDF
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
PPTX
web development for engineering and engineering
PPTX
Geodesy 1.pptx...............................................
PDF
ETO & MEO Certificate of Competency Questions and Answers
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Internship_Presentation_Final engineering.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
573137875-Attendance-Management-System-original
PPTX
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Chapter 6 Design in software Engineeing.ppt
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
Drone Technology Electronics components_1
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
web development for engineering and engineering
Geodesy 1.pptx...............................................
ETO & MEO Certificate of Competency Questions and Answers
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Internship_Presentation_Final engineering.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Structs to JSON How Go Powers REST APIs.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Foundation to blockchain - A guide to Blockchain Tech
573137875-Attendance-Management-System-original
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
Internet of Things (IOT) - A guide to understanding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Chapter 6 Design in software Engineeing.ppt
Operating System & Kernel Study Guide-1 - converted.pdf

Introduction to ML with Apache Spark MLlib

Editor's Notes

  • #19: Score of the speaker based on xxx.
  • #26: Quantity of jokes used. Liked or not liked the speaker.
  • #34: Assign or index each example to the cluster centroid closest to it Recalculate or move centroids as an average (mean) of examples assigned to a cluster Repeat until centroids not longer move
  • #41: Bag of words – a single word is a one hot encoding vector with the size of the dictionary. As a result – a lot of sparse vectors.
  • #42: Behind the scenes - a two-layer neural net that processes text. Captures semantic and morphologic similarity so similar words are close in the vector space Similar words would be clustered together in the high dimensional sphere. 
  • #43: If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close. For two completely random words, the similarity is pretty close to 0. On an opposite side there is not an antonym, but usually just a noise. Used Google News Negative 300.
  • #44: My corpus - 8316 words
  • #53: Let’s finally go to the implementation using a library or framework that is going to help us to avoid tedious transformations and provide algorithms as well as feature extractors out-of-the-box.