SlideShare a Scribd company logo
LogAnalyticsinDataCenterwith
ApacheSparkandMachineLearningDataMass Summit 2017
Agnieszka Potulska
Intel Technology Poland
agnieszka.potulska@intel.com
Piotr Tylenda
Intel Technology Poland
piotr.tylenda@intel.com
© 2017 Intel Corporation 2
Agenda
1. What problem we would like to solve?
2. Data pipeline and key components
3. Cluster analysis
4. TF-IDF, word2vec
5. k-means algorithm
6. PySpark example
7. Clustering visualization sneak-peek
8. Lessons learned
Source: https://www.iconsfind.com/20151011/check-checklist-document-list-menu-todo-todo-list-icon/
© 2017 Intel Corporation 3
Problem statement
Workload - stimulus applied to the
observed target, with predefined
actions and observable parameters.
In other words - actions that we execute
on the specified server.
1. Need for workload log failure
information management
2. Duplication of work – engineers
analyzing similar problems
independently
© 2017 Intel Corporation 4
Workload execution
Logs collection
Expert analysis
Workload execution
Logs collection pipeline
Machine learning analysis
Standard workflow Machine Learning workflow
Decision Automated decisionAutomated decision
© 2017 Intel Corporation 5
Log Collection & Analysis
Machine Learning
Full Text Search
Workload
scheduler
Metadata
Clusters
* Other names and brands may be claimed as the property of others.
*
*
*
*
*
*
*
*
© 2017 Intel Corporation 6
Key Components
 Apache Kafka*
– Enables publishing data from numerous producers
– Logs are streamed as small messages in real time
 Apache Spark Streaming*
– Feeds data to HDFS* in micro-batches
 ELK* stack
– Full-text search and visualization of data in cluster
 Apache Spark*
– Machine learning batch processing
 Apache Zeppelin*
– Web-based workbook repository for data scientists * Other names and brands may be claimed as the property of others.
© 2017 Intel Corporation 7
Cluster Analysis
• Workload logs can be treated as text documents –
there are suitable clustering algorithms!
• Objective is to group dataset into clusters.
• Objects assigned to the same cluster are more similar (using a predefined
similarity measure) to each other than to objects in other clusters.
• Major technique used in exploratory data mining.
• Unsupervised machine learning.
© 2017 Intel Corporation 8
Cluster Analysis – s1 Dataset Example
Dataset: P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006
© 2017 Intel Corporation 9
Cluster Analysis Algorithms
• Hierarchical clustering (ex. single-linkage
clustering)
• Density-based clustering (ex. DBSCAN)
• Distribution-based clustering (ex. EM algorithm)
• Centroid-based clustering (methods derived
from k-means algorithm)
Source: https://upload.wikimedia.org/wikipedia/commons/1/12/Iris_dendrogram.png
© 2017 Intel Corporation 10
Log Data Machine Learning Steps
Filtering and Stopwords Removal
Tokenization
TF-IDF / word2vec Conversion
Normalization (optional)
k-means Clustering
© 2017 Intel Corporation 11
Feature Vectorization
How to represent texts and words as vectors?
Vector Space Model
DOC1 DOC2 DOC3
home 14 19 45
stop 9 0 0
event 0 32 4
documents are
represented as
vectors
each dimension of the vector
space corresponds to a word
© 2017 Intel Corporation 12
Term Frequency – Inverse Document Frequency
 Widely used in text mining, search and
classification tasks.
 Adds weights to the documents
vectors that reflect the importance of
the term.
W D1 D2 IDF TF-IDF
D1 D2
I 1 1 log⁡(3
3) 0 0
like 1 1 log⁡(3
3) 0 0
red 1 1 log⁡(3
3) 0 0
do 0 1 log⁡(3
2) 0 0.176
not 0 1 log⁡(3
2) 0 0.176
D1: I like red.
D2: I do not like red.
© 2017 Intel Corporation 13
Word2vec
• Developed by Mikolov et al. (Google 2013)*
• Word Embedding – produces vector
representation of words.
• Which words occurred near other words?
• Spark ML implementation of word2vec
supports cosine distance as similarity
measure – produced better results in our use
case.
• Provides dimensionality reduction – more
suitable for Spark k-means.
apple
banana
orange
bicycle
book
notepad
*Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space"
© 2017 Intel Corporation 14
k-means Algorithm
• Clustering algorithm which groups n objects
(points) into k groups, where k is a predefined
parameter.
• Each object is assigned to group which has the
closest (most similar) centroid to this object.
• k-means defines a whole family of algorithms
such as k-medians, k-medoids or c-means.
© 2017 Intel Corporation 15
k-means Problem Definition
Given a dataset of 𝑛 objects X = {𝑥1, 𝑥2, 𝑥3, … , 𝑥 𝑛}, where each object is a 𝑑-
dimensional vector (𝑥𝑖 ∈ ℝ 𝑑).
k-means problem is defined then as: divide the dataset X into 𝑘 groups
(clusters) C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} in such a way that within-set sum of squares is
minimized:
𝐸 𝐶 = 𝑑2(𝑥, 𝐶𝑖)
𝑥𝜖𝐶 𝑖
𝑘
𝑖=1
where 𝐶𝑖 =
1
𝐶 𝑖
𝑥𝑥∈𝐶𝑖
is a centroid of a given group, 𝑑(𝑥, 𝑦) is a distance
function between x and y.
This is an NP-hard problem.
© 2017 Intel Corporation 16
Heuristic Solution – k-means Algorithm
begin
initialization*: divide dataset 𝑋 into 𝑘 random, exclusive groups;
do
foreach group, compute its 𝐿2
−norm* centroid;
foreach object in dataset, assign it to the closest* group (using centroids);
while any 𝐶𝑖 group assignment changed;
end
𝑋 – dataset
𝑘 – expected number of clusters k-means
C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘}
(set of clusters)
© 2017 Intel Corporation 17
k-means Algorithm – Example (1)
Let's consider the following 2D points dataset.
Visualization based on: https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
© 2017 Intel Corporation 18
k-means Algorithm – Example (2)
The dataset will be divided into 4 clusters (k=4), Euclidean distance will be used as similarity measure.
© 2017 Intel Corporation 19
k-means Algorithm – Example (3)
Algorithm is initialized by choosing 4 center points randomly. The initial clustering will be random.
© 2017 Intel Corporation 20
k-means Algorithm – Example (4)
Each point is assigned to the closest center using Euclidean distance. This determines
the initial group assignment.
© 2017 Intel Corporation 21
k-means Algorithm – Example (5)
The new group assignment determines new centroid positions.
Each point is assigned to the closest centroid again.
© 2017 Intel Corporation 22
k-means Algorithm – Example (6)
These steps are repeated until the algorithm converges,
i.e. no points are assigned to a different cluster.
© 2017 Intel Corporation 23
k-means Algorithm – Example (7)
Some points have been reassigned again...
© 2017 Intel Corporation 24
k-means Algorithm – Example (8)
And again...
© 2017 Intel Corporation 25
k-means Algorithm – Example (9)
In the final iteration, the new positions of centroids have been calculated
and no point has been assigned to a different group. DONE!
© 2017 Intel Corporation 26
k-means Algorithm – Initialization
• The most important part of k-means algorithm.
• Initialization predefines how the algorithm will
converge.
• Different initialization output will give different
clustering.
• Simple approach: Random, Forgy, MacQueen, Kaufman.
• Advanced approach*: k-means++ and k-means||
*Source: David Arthur and Sergei Vassilvitskii: „The Advantages of Careful Seeding” [2007]
Random method
Forgy method
© 2017 Intel Corporation 27
k-means++ Initialization
1. Choose the first center („seed”) 𝑐1 in dataset 𝑋
(uniform distribution).
2. Choose the next center 𝑐𝑖⁡by choosing 𝑥 ∈ 𝑋
with probability
𝑑 𝑥, 𝑐 𝑥
2
𝑑(𝑥′, 𝑐(𝑥′))𝑥′∈𝑋
where 𝑐(𝑥) is the closest center for 𝑥
(„𝑑2
-weighting”).
3. Repeat step 2. until 𝑘 initial centers are
selected.
© 2017 Intel Corporation 28
Example (PySpark )
k = 25 # Number of clusters
df = sqlContext.read.parquet("data.parquet") # Input dataframe, 'content' column defines a single document
tokenizer = RegexTokenizer(inputCol="content", outputCol="words", gaps=True, pattern="W") # Word tokenization by white spaces
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered_words") # Standard stopwords removal
word2vec = Word2Vec(vectorSize=100, minCount=5, windowSize=10, maxIter=2, inputCol=remover.getOutputCol(), outputCol="features")
kmeans = KMeans(k=k, predictionCol="prediction", initMode="k-means||", initSteps=10, tol=1e-7, maxIter=600)
pipeline = Pipeline(stages=[tokenizer, remover, word2vec, kmeans])
print("Starting k-means... (k={0})".format(k))
model = pipeline.fit(df)
clustering_df = model.transform(df) # Column 'predictions' contains cluster number for each document
kmeans_model = model.stages[-1]
print("Cluster centers: {0}".format(pprint.pformat(kmeans_model.clusterCenters())))
print("Within set sum of squared errors ({0}) = {1}".format(k, kmeans_model.computeCost(clustering_df)))
* Other names and brands may be claimed as the property of others.
*
© 2017 Intel Corporation 29
Cluster Validation - WSSSE
• k-means minimizes within-set sum of squared
errors (WSSSE).
• The most simple internal cluster validation index
(no external labelling needed).
• Decreases monotonically with number of
detected clusters.
• Can be used for detection of number of clusters
(𝑘 parameter), „elbow method”.
© 2017 Intel Corporation 30
Interactive Datacenter Log Clustering Visualization
Timeframe 72h
Workload servers 71
Raw log data 127 GB
Log messages 172 million
Clusters 56
© 2017 Intel Corporation 31
Lessons learned
INFO: Error happened. Reboot platform
ERROR: This is a debug message
WARN: Critical error occurred
 Efficient logging requires consistency.
 Spark ML k-means implementation supports only Euclidean distance –
currently no support for cosine similarity.
 Document clustering is very sensitive to data preprocessing quality.
Log Analytics in Datacenter with Apache Spark and Machine Learning
Backup
© 2017 Intel Corporation 34
Cluster Analysis – Use Cases
• Data exploration
• Statistical data analysis
• Recommender systems
• Text mining
• Pattern recognition
• Image segmentation and analysis
• Bioinformatics
• Medicine
• Market research
© 2017 Intel Corporation 35
Clustering Validation – ℱ1-score
• External clustering validation index (requires external labelling).
• Set-overlapping based measure.
• ℱ1-score for a cluster 𝐶𝑗 with respect to an external classification
𝑉𝑖 is defined as harmonic mean of precision and recall:
ℱ1 𝑉𝑖, 𝐶𝑗 =
2
1
𝑝𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛
+
1
𝑟𝑒𝑐𝑎𝑙𝑙
• Micro-averaged ℱ1-score for clustering C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} and
classification V = {𝑉1, 𝑉2, 𝑉3, … , 𝑉𝑚} is then defined as:
ℱ1 𝑉, 𝐶 =
|𝑉𝑖|
𝑛
max
1≤𝑗≤𝑘
ℱ1 𝑉𝑖, 𝐶𝑗
𝑚
𝑖=1 Source: https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg
(CC BY-SA 4.0)
Ad

Recommended

PDF
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
Masumi Shirakawa
 
PDF
Neural Networks: Support Vector machines
Mostafa G. M. Mostafa
 
PDF
Neural Networks: Model Building Through Linear Regression
Mostafa G. M. Mostafa
 
PDF
Encoding survey
Rajeev Raman
 
PDF
CSMR11b.ppt
Ptidej Team
 
PDF
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
CSCJournals
 
PDF
CSC446: Pattern Recognition (LN5)
Mostafa G. M. Mostafa
 
PDF
Speaker Diarization
HONGJOO LEE
 
PDF
CSC446: Pattern Recognition (LN6)
Mostafa G. M. Mostafa
 
PPTX
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
PDF
Neural Networks: Radial Bases Functions (RBF)
Mostafa G. M. Mostafa
 
PPTX
Tg noh jeju_workshop
Tae-Gil Noh
 
PDF
Compressed Sensing using Generative Model
kenluck2001
 
PPTX
Spark algorithms
Ashutosh Trivedi
 
PDF
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEM
ijcisjournal
 
PDF
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
Daiki Tanaka
 
PDF
Lecture 5: Neural Networks II
Sang Jun Lee
 
PDF
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
csandit
 
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
PDF
Radial Basis Function Interpolation
Jesse Bettencourt
 
PDF
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
PPTX
Lec08 optimizations
Taras Zakharchenko
 
PDF
Csc446: Pattern Recognition
Mostafa G. M. Mostafa
 
PPTX
A Fast and Dirty Intro to NetworkX (and D3)
Lynn Cherny
 
PPT
[ppt]
butest
 
PDF
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 
PDF
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
PDF
CSC446: Pattern Recognition (LN7)
Mostafa G. M. Mostafa
 
PPTX
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
PPTX
Classification & Clustering.pptx
ImXaib
 

More Related Content

What's hot (20)

PDF
CSC446: Pattern Recognition (LN6)
Mostafa G. M. Mostafa
 
PPTX
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
PDF
Neural Networks: Radial Bases Functions (RBF)
Mostafa G. M. Mostafa
 
PPTX
Tg noh jeju_workshop
Tae-Gil Noh
 
PDF
Compressed Sensing using Generative Model
kenluck2001
 
PPTX
Spark algorithms
Ashutosh Trivedi
 
PDF
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEM
ijcisjournal
 
PDF
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
Daiki Tanaka
 
PDF
Lecture 5: Neural Networks II
Sang Jun Lee
 
PDF
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
csandit
 
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
PDF
Radial Basis Function Interpolation
Jesse Bettencourt
 
PDF
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
PPTX
Lec08 optimizations
Taras Zakharchenko
 
PDF
Csc446: Pattern Recognition
Mostafa G. M. Mostafa
 
PPTX
A Fast and Dirty Intro to NetworkX (and D3)
Lynn Cherny
 
PPT
[ppt]
butest
 
PDF
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 
PDF
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
PDF
CSC446: Pattern Recognition (LN7)
Mostafa G. M. Mostafa
 
CSC446: Pattern Recognition (LN6)
Mostafa G. M. Mostafa
 
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
Neural Networks: Radial Bases Functions (RBF)
Mostafa G. M. Mostafa
 
Tg noh jeju_workshop
Tae-Gil Noh
 
Compressed Sensing using Generative Model
kenluck2001
 
Spark algorithms
Ashutosh Trivedi
 
PEC - AN ALTERNATE AND MORE EFFICIENT PUBLIC KEY CRYPTOSYSTEM
ijcisjournal
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
Daiki Tanaka
 
Lecture 5: Neural Networks II
Sang Jun Lee
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
csandit
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Radial Basis Function Interpolation
Jesse Bettencourt
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Lec08 optimizations
Taras Zakharchenko
 
Csc446: Pattern Recognition
Mostafa G. M. Mostafa
 
A Fast and Dirty Intro to NetworkX (and D3)
Lynn Cherny
 
[ppt]
butest
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
CSC446: Pattern Recognition (LN7)
Mostafa G. M. Mostafa
 

Similar to Log Analytics in Datacenter with Apache Spark and Machine Learning (20)

PPTX
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
PPTX
Classification & Clustering.pptx
ImXaib
 
PPTX
Large Scale Data Clustering: an overview
Vahid Mirjalili
 
PPT
multiarmed bandit.ppt
LPrashanthi
 
PPT
DataMining dgfg dfg fg dsfg dfg- Copy.ppt
JITENDER773791
 
PDF
Machine Learning, Statistics And Data Mining
Jason J Pulikkottil
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
PPTX
Introduction to Clustering algorithm
hadifar
 
PPT
PPT file
butest
 
PDF
Chapter 10.1,2,3 pdf.pdf
Amy Aung
 
PDF
Best data science training, best data science training institute in Chennai
hrhrenurenu
 
PDF
business analytics course in delhi
devipatnala1
 
PDF
data science training
devipatnala1
 
PDF
Data science training
prathyusha1234
 
PDF
data science institute in bangalore
devipatnala1
 
PDF
Best data science training, best data science training institute in hyderabad.
hrhrenurenu
 
PDF
Data science certification
prathyusha1234
 
PDF
Data scientist course in hyderabad
prathyusha1234
 
PDF
Best data science training, best data science training institute in hyderabad.
sripadojwarumavilas
 
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
Classification & Clustering.pptx
ImXaib
 
Large Scale Data Clustering: an overview
Vahid Mirjalili
 
multiarmed bandit.ppt
LPrashanthi
 
DataMining dgfg dfg fg dsfg dfg- Copy.ppt
JITENDER773791
 
Machine Learning, Statistics And Data Mining
Jason J Pulikkottil
 
05 k-means clustering
Subhas Kumar Ghosh
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Introduction to Clustering algorithm
hadifar
 
PPT file
butest
 
Chapter 10.1,2,3 pdf.pdf
Amy Aung
 
Best data science training, best data science training institute in Chennai
hrhrenurenu
 
business analytics course in delhi
devipatnala1
 
data science training
devipatnala1
 
Data science training
prathyusha1234
 
data science institute in bangalore
devipatnala1
 
Best data science training, best data science training institute in hyderabad.
hrhrenurenu
 
Data science certification
prathyusha1234
 
Data scientist course in hyderabad
prathyusha1234
 
Best data science training, best data science training institute in hyderabad.
sripadojwarumavilas
 
Ad

Recently uploaded (20)

PPTX
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
PDF
Measurecamp Copenhagen - Consent Context
Human37
 
PPTX
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
Taqyea
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
PPTX
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
PPTX
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
PPTX
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
PDF
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
PDF
Predicting Titanic Survival Presentation
praxyfarhana
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PPTX
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
DOCX
The Influence off Flexible Work Policies
sales480687
 
PPTX
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
PPTX
Data Visualisation in data science for students
confidenceascend
 
PDF
All the DataOps, all the paradigms .
Lars Albertsson
 
PPTX
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
PDF
Residential Zone 4 for industrial village
MdYasinArafat13
 
PDF
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
PDF
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
PPTX
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
Measurecamp Copenhagen - Consent Context
Human37
 
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
Taqyea
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
Predicting Titanic Survival Presentation
praxyfarhana
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
The Influence off Flexible Work Policies
sales480687
 
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
Data Visualisation in data science for students
confidenceascend
 
All the DataOps, all the paradigms .
Lars Albertsson
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
Residential Zone 4 for industrial village
MdYasinArafat13
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Ad

Log Analytics in Datacenter with Apache Spark and Machine Learning

  • 2. © 2017 Intel Corporation 2 Agenda 1. What problem we would like to solve? 2. Data pipeline and key components 3. Cluster analysis 4. TF-IDF, word2vec 5. k-means algorithm 6. PySpark example 7. Clustering visualization sneak-peek 8. Lessons learned Source: https://www.iconsfind.com/20151011/check-checklist-document-list-menu-todo-todo-list-icon/
  • 3. © 2017 Intel Corporation 3 Problem statement Workload - stimulus applied to the observed target, with predefined actions and observable parameters. In other words - actions that we execute on the specified server. 1. Need for workload log failure information management 2. Duplication of work – engineers analyzing similar problems independently
  • 4. © 2017 Intel Corporation 4 Workload execution Logs collection Expert analysis Workload execution Logs collection pipeline Machine learning analysis Standard workflow Machine Learning workflow Decision Automated decisionAutomated decision
  • 5. © 2017 Intel Corporation 5 Log Collection & Analysis Machine Learning Full Text Search Workload scheduler Metadata Clusters * Other names and brands may be claimed as the property of others. * * * * * * * *
  • 6. © 2017 Intel Corporation 6 Key Components  Apache Kafka* – Enables publishing data from numerous producers – Logs are streamed as small messages in real time  Apache Spark Streaming* – Feeds data to HDFS* in micro-batches  ELK* stack – Full-text search and visualization of data in cluster  Apache Spark* – Machine learning batch processing  Apache Zeppelin* – Web-based workbook repository for data scientists * Other names and brands may be claimed as the property of others.
  • 7. © 2017 Intel Corporation 7 Cluster Analysis • Workload logs can be treated as text documents – there are suitable clustering algorithms! • Objective is to group dataset into clusters. • Objects assigned to the same cluster are more similar (using a predefined similarity measure) to each other than to objects in other clusters. • Major technique used in exploratory data mining. • Unsupervised machine learning.
  • 8. © 2017 Intel Corporation 8 Cluster Analysis – s1 Dataset Example Dataset: P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006
  • 9. © 2017 Intel Corporation 9 Cluster Analysis Algorithms • Hierarchical clustering (ex. single-linkage clustering) • Density-based clustering (ex. DBSCAN) • Distribution-based clustering (ex. EM algorithm) • Centroid-based clustering (methods derived from k-means algorithm) Source: https://upload.wikimedia.org/wikipedia/commons/1/12/Iris_dendrogram.png
  • 10. © 2017 Intel Corporation 10 Log Data Machine Learning Steps Filtering and Stopwords Removal Tokenization TF-IDF / word2vec Conversion Normalization (optional) k-means Clustering
  • 11. © 2017 Intel Corporation 11 Feature Vectorization How to represent texts and words as vectors? Vector Space Model DOC1 DOC2 DOC3 home 14 19 45 stop 9 0 0 event 0 32 4 documents are represented as vectors each dimension of the vector space corresponds to a word
  • 12. © 2017 Intel Corporation 12 Term Frequency – Inverse Document Frequency  Widely used in text mining, search and classification tasks.  Adds weights to the documents vectors that reflect the importance of the term. W D1 D2 IDF TF-IDF D1 D2 I 1 1 log⁡(3 3) 0 0 like 1 1 log⁡(3 3) 0 0 red 1 1 log⁡(3 3) 0 0 do 0 1 log⁡(3 2) 0 0.176 not 0 1 log⁡(3 2) 0 0.176 D1: I like red. D2: I do not like red.
  • 13. © 2017 Intel Corporation 13 Word2vec • Developed by Mikolov et al. (Google 2013)* • Word Embedding – produces vector representation of words. • Which words occurred near other words? • Spark ML implementation of word2vec supports cosine distance as similarity measure – produced better results in our use case. • Provides dimensionality reduction – more suitable for Spark k-means. apple banana orange bicycle book notepad *Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space"
  • 14. © 2017 Intel Corporation 14 k-means Algorithm • Clustering algorithm which groups n objects (points) into k groups, where k is a predefined parameter. • Each object is assigned to group which has the closest (most similar) centroid to this object. • k-means defines a whole family of algorithms such as k-medians, k-medoids or c-means.
  • 15. © 2017 Intel Corporation 15 k-means Problem Definition Given a dataset of 𝑛 objects X = {𝑥1, 𝑥2, 𝑥3, … , 𝑥 𝑛}, where each object is a 𝑑- dimensional vector (𝑥𝑖 ∈ ℝ 𝑑). k-means problem is defined then as: divide the dataset X into 𝑘 groups (clusters) C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} in such a way that within-set sum of squares is minimized: 𝐸 𝐶 = 𝑑2(𝑥, 𝐶𝑖) 𝑥𝜖𝐶 𝑖 𝑘 𝑖=1 where 𝐶𝑖 = 1 𝐶 𝑖 𝑥𝑥∈𝐶𝑖 is a centroid of a given group, 𝑑(𝑥, 𝑦) is a distance function between x and y. This is an NP-hard problem.
  • 16. © 2017 Intel Corporation 16 Heuristic Solution – k-means Algorithm begin initialization*: divide dataset 𝑋 into 𝑘 random, exclusive groups; do foreach group, compute its 𝐿2 −norm* centroid; foreach object in dataset, assign it to the closest* group (using centroids); while any 𝐶𝑖 group assignment changed; end 𝑋 – dataset 𝑘 – expected number of clusters k-means C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} (set of clusters)
  • 17. © 2017 Intel Corporation 17 k-means Algorithm – Example (1) Let's consider the following 2D points dataset. Visualization based on: https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
  • 18. © 2017 Intel Corporation 18 k-means Algorithm – Example (2) The dataset will be divided into 4 clusters (k=4), Euclidean distance will be used as similarity measure.
  • 19. © 2017 Intel Corporation 19 k-means Algorithm – Example (3) Algorithm is initialized by choosing 4 center points randomly. The initial clustering will be random.
  • 20. © 2017 Intel Corporation 20 k-means Algorithm – Example (4) Each point is assigned to the closest center using Euclidean distance. This determines the initial group assignment.
  • 21. © 2017 Intel Corporation 21 k-means Algorithm – Example (5) The new group assignment determines new centroid positions. Each point is assigned to the closest centroid again.
  • 22. © 2017 Intel Corporation 22 k-means Algorithm – Example (6) These steps are repeated until the algorithm converges, i.e. no points are assigned to a different cluster.
  • 23. © 2017 Intel Corporation 23 k-means Algorithm – Example (7) Some points have been reassigned again...
  • 24. © 2017 Intel Corporation 24 k-means Algorithm – Example (8) And again...
  • 25. © 2017 Intel Corporation 25 k-means Algorithm – Example (9) In the final iteration, the new positions of centroids have been calculated and no point has been assigned to a different group. DONE!
  • 26. © 2017 Intel Corporation 26 k-means Algorithm – Initialization • The most important part of k-means algorithm. • Initialization predefines how the algorithm will converge. • Different initialization output will give different clustering. • Simple approach: Random, Forgy, MacQueen, Kaufman. • Advanced approach*: k-means++ and k-means|| *Source: David Arthur and Sergei Vassilvitskii: „The Advantages of Careful Seeding” [2007] Random method Forgy method
  • 27. © 2017 Intel Corporation 27 k-means++ Initialization 1. Choose the first center („seed”) 𝑐1 in dataset 𝑋 (uniform distribution). 2. Choose the next center 𝑐𝑖⁡by choosing 𝑥 ∈ 𝑋 with probability 𝑑 𝑥, 𝑐 𝑥 2 𝑑(𝑥′, 𝑐(𝑥′))𝑥′∈𝑋 where 𝑐(𝑥) is the closest center for 𝑥 („𝑑2 -weighting”). 3. Repeat step 2. until 𝑘 initial centers are selected.
  • 28. © 2017 Intel Corporation 28 Example (PySpark ) k = 25 # Number of clusters df = sqlContext.read.parquet("data.parquet") # Input dataframe, 'content' column defines a single document tokenizer = RegexTokenizer(inputCol="content", outputCol="words", gaps=True, pattern="W") # Word tokenization by white spaces remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered_words") # Standard stopwords removal word2vec = Word2Vec(vectorSize=100, minCount=5, windowSize=10, maxIter=2, inputCol=remover.getOutputCol(), outputCol="features") kmeans = KMeans(k=k, predictionCol="prediction", initMode="k-means||", initSteps=10, tol=1e-7, maxIter=600) pipeline = Pipeline(stages=[tokenizer, remover, word2vec, kmeans]) print("Starting k-means... (k={0})".format(k)) model = pipeline.fit(df) clustering_df = model.transform(df) # Column 'predictions' contains cluster number for each document kmeans_model = model.stages[-1] print("Cluster centers: {0}".format(pprint.pformat(kmeans_model.clusterCenters()))) print("Within set sum of squared errors ({0}) = {1}".format(k, kmeans_model.computeCost(clustering_df))) * Other names and brands may be claimed as the property of others. *
  • 29. © 2017 Intel Corporation 29 Cluster Validation - WSSSE • k-means minimizes within-set sum of squared errors (WSSSE). • The most simple internal cluster validation index (no external labelling needed). • Decreases monotonically with number of detected clusters. • Can be used for detection of number of clusters (𝑘 parameter), „elbow method”.
  • 30. © 2017 Intel Corporation 30 Interactive Datacenter Log Clustering Visualization Timeframe 72h Workload servers 71 Raw log data 127 GB Log messages 172 million Clusters 56
  • 31. © 2017 Intel Corporation 31 Lessons learned INFO: Error happened. Reboot platform ERROR: This is a debug message WARN: Critical error occurred  Efficient logging requires consistency.  Spark ML k-means implementation supports only Euclidean distance – currently no support for cosine similarity.  Document clustering is very sensitive to data preprocessing quality.
  • 34. © 2017 Intel Corporation 34 Cluster Analysis – Use Cases • Data exploration • Statistical data analysis • Recommender systems • Text mining • Pattern recognition • Image segmentation and analysis • Bioinformatics • Medicine • Market research
  • 35. © 2017 Intel Corporation 35 Clustering Validation – ℱ1-score • External clustering validation index (requires external labelling). • Set-overlapping based measure. • ℱ1-score for a cluster 𝐶𝑗 with respect to an external classification 𝑉𝑖 is defined as harmonic mean of precision and recall: ℱ1 𝑉𝑖, 𝐶𝑗 = 2 1 𝑝𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛 + 1 𝑟𝑒𝑐𝑎𝑙𝑙 • Micro-averaged ℱ1-score for clustering C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} and classification V = {𝑉1, 𝑉2, 𝑉3, … , 𝑉𝑚} is then defined as: ℱ1 𝑉, 𝐶 = |𝑉𝑖| 𝑛 max 1≤𝑗≤𝑘 ℱ1 𝑉𝑖, 𝐶𝑗 𝑚 𝑖=1 Source: https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg (CC BY-SA 4.0)