Log Analytics in Datacenter with Apache Spark and Machine Learning

LogAnalyticsinDataCenterwith
ApacheSparkandMachineLearningDataMass Summit 2017
Agnieszka Potulska
Intel Technology Poland
agnieszka.potulska@intel.com
Piotr Tylenda
Intel Technology Poland
piotr.tylenda@intel.com

© 2017 Intel Corporation 2
Agenda
1. What problem we would like to solve?
2. Data pipeline and key components
3. Cluster analysis
4. TF-IDF, word2vec
5. k-means algorithm
6. PySpark example
7. Clustering visualization sneak-peek
8. Lessons learned
Source: https://www.iconsfind.com/20151011/check-checklist-document-list-menu-todo-todo-list-icon/

Problem statement
Workload - stimulus applied to the
observed target, with predefined
actions and observable parameters.
In other words - actions that we execute
on the specified server.
1. Need for workload log failure
information management
2. Duplication of work – engineers
analyzing similar problems
independently

Workload execution
Logs collection
Expert analysis
Workload execution
Logs collection pipeline
Machine learning analysis
Standard workflow Machine Learning workflow
Decision Automated decisionAutomated decision

Log Collection & Analysis
Machine Learning
Full Text Search
Workload
scheduler
Metadata
Clusters
* Other names and brands may be claimed as the property of others.
*
*
*
*
*
*
*
*

Key Components
 Apache Kafka*
– Enables publishing data from numerous producers
– Logs are streamed as small messages in real time
 Apache Spark Streaming*
– Feeds data to HDFS* in micro-batches
 ELK* stack
– Full-text search and visualization of data in cluster
 Apache Spark*
– Machine learning batch processing
 Apache Zeppelin*
– Web-based workbook repository for data scientists * Other names and brands may be claimed as the property of others.

Cluster Analysis
• Workload logs can be treated as text documents –
there are suitable clustering algorithms!
• Objective is to group dataset into clusters.
• Objects assigned to the same cluster are more similar (using a predefined
similarity measure) to each other than to objects in other clusters.
• Major technique used in exploratory data mining.
• Unsupervised machine learning.

Cluster Analysis – s1 Dataset Example
Dataset: P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006

Cluster Analysis Algorithms
• Hierarchical clustering (ex. single-linkage
clustering)
• Density-based clustering (ex. DBSCAN)
• Distribution-based clustering (ex. EM algorithm)
• Centroid-based clustering (methods derived
from k-means algorithm)
Source: https://upload.wikimedia.org/wikipedia/commons/1/12/Iris_dendrogram.png

Log Data Machine Learning Steps
Filtering and Stopwords Removal
Tokenization
TF-IDF / word2vec Conversion
Normalization (optional)
k-means Clustering

Feature Vectorization
How to represent texts and words as vectors?
Vector Space Model
DOC1 DOC2 DOC3
home 14 19 45
stop 9 0 0
event 0 32 4
documents are
represented as
vectors
each dimension of the vector
space corresponds to a word

Term Frequency – Inverse Document Frequency
 Widely used in text mining, search and
classification tasks.
 Adds weights to the documents
vectors that reflect the importance of
the term.
W D1 D2 IDF TF-IDF
D1 D2
I 1 1 log⁡(3
3) 0 0
like 1 1 log⁡(3
3) 0 0
red 1 1 log⁡(3
3) 0 0
do 0 1 log⁡(3
2) 0 0.176
not 0 1 log⁡(3
2) 0 0.176
D1: I like red.
D2: I do not like red.

Word2vec
• Developed by Mikolov et al. (Google 2013)*
• Word Embedding – produces vector
representation of words.
• Which words occurred near other words?
• Spark ML implementation of word2vec
supports cosine distance as similarity
measure – produced better results in our use
case.
• Provides dimensionality reduction – more
suitable for Spark k-means.
apple
banana
orange
bicycle
book
notepad
*Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space"

k-means Algorithm
• Clustering algorithm which groups n objects
(points) into k groups, where k is a predefined
parameter.
• Each object is assigned to group which has the
closest (most similar) centroid to this object.
• k-means defines a whole family of algorithms
such as k-medians, k-medoids or c-means.

k-means Problem Definition
Given a dataset of 𝑛 objects X = {𝑥1, 𝑥2, 𝑥3, … , 𝑥 𝑛}, where each object is a 𝑑-
dimensional vector (𝑥𝑖 ∈ ℝ 𝑑).
k-means problem is defined then as: divide the dataset X into 𝑘 groups
(clusters) C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} in such a way that within-set sum of squares is
minimized:
𝐸 𝐶 = 𝑑2(𝑥, 𝐶𝑖)
𝑥𝜖𝐶 𝑖
𝑘
𝑖=1
where 𝐶𝑖 =
1
𝐶 𝑖
𝑥𝑥∈𝐶𝑖
is a centroid of a given group, 𝑑(𝑥, 𝑦) is a distance
function between x and y.
This is an NP-hard problem.

Heuristic Solution – k-means Algorithm
begin
initialization*: divide dataset 𝑋 into 𝑘 random, exclusive groups;
do
foreach group, compute its 𝐿2
−norm* centroid;
foreach object in dataset, assign it to the closest* group (using centroids);
while any 𝐶𝑖 group assignment changed;
end
𝑋 – dataset
𝑘 – expected number of clusters k-means
C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘}
(set of clusters)

k-means Algorithm – Example (1)
Let's consider the following 2D points dataset.
Visualization based on: https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

The dataset will be divided into 4 clusters (k=4), Euclidean distance will be used as similarity measure.

Algorithm is initialized by choosing 4 center points randomly. The initial clustering will be random.

Each point is assigned to the closest center using Euclidean distance. This determines
the initial group assignment.

The new group assignment determines new centroid positions.
Each point is assigned to the closest centroid again.

These steps are repeated until the algorithm converges,
i.e. no points are assigned to a different cluster.

Some points have been reassigned again...

And again...

In the final iteration, the new positions of centroids have been calculated
and no point has been assigned to a different group. DONE!

k-means Algorithm – Initialization
• The most important part of k-means algorithm.
• Initialization predefines how the algorithm will
converge.
• Different initialization output will give different
clustering.
• Simple approach: Random, Forgy, MacQueen, Kaufman.
• Advanced approach*: k-means++ and k-means||
*Source: David Arthur and Sergei Vassilvitskii: „The Advantages of Careful Seeding” [2007]
Random method
Forgy method

k-means++ Initialization
1. Choose the first center („seed”) 𝑐1 in dataset 𝑋
(uniform distribution).
2. Choose the next center 𝑐𝑖⁡by choosing 𝑥 ∈ 𝑋
with probability
𝑑 𝑥, 𝑐 𝑥
2
𝑑(𝑥′, 𝑐(𝑥′))𝑥′∈𝑋
where 𝑐(𝑥) is the closest center for 𝑥
(„𝑑2
-weighting”).
3. Repeat step 2. until 𝑘 initial centers are
selected.

Example (PySpark )
k = 25 # Number of clusters
df = sqlContext.read.parquet("data.parquet") # Input dataframe, 'content' column defines a single document
tokenizer = RegexTokenizer(inputCol="content", outputCol="words", gaps=True, pattern="W") # Word tokenization by white spaces
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered_words") # Standard stopwords removal
word2vec = Word2Vec(vectorSize=100, minCount=5, windowSize=10, maxIter=2, inputCol=remover.getOutputCol(), outputCol="features")
kmeans = KMeans(k=k, predictionCol="prediction", initMode="k-means||", initSteps=10, tol=1e-7, maxIter=600)
pipeline = Pipeline(stages=[tokenizer, remover, word2vec, kmeans])
print("Starting k-means... (k={0})".format(k))
model = pipeline.fit(df)
clustering_df = model.transform(df) # Column 'predictions' contains cluster number for each document
kmeans_model = model.stages[-1]
print("Cluster centers: {0}".format(pprint.pformat(kmeans_model.clusterCenters())))
print("Within set sum of squared errors ({0}) = {1}".format(k, kmeans_model.computeCost(clustering_df)))
* Other names and brands may be claimed as the property of others.
*

Cluster Validation - WSSSE
• k-means minimizes within-set sum of squared
errors (WSSSE).
• The most simple internal cluster validation index
(no external labelling needed).
• Decreases monotonically with number of
detected clusters.
• Can be used for detection of number of clusters
(𝑘 parameter), „elbow method”.

Interactive Datacenter Log Clustering Visualization
Timeframe 72h
Workload servers 71
Raw log data 127 GB
Log messages 172 million
Clusters 56

Lessons learned
INFO: Error happened. Reboot platform
ERROR: This is a debug message
WARN: Critical error occurred
 Efficient logging requires consistency.
 Spark ML k-means implementation supports only Euclidean distance –
currently no support for cosine similarity.
 Document clustering is very sensitive to data preprocessing quality.

Log Analytics in Datacenter with Apache Spark and Machine Learning

Cluster Analysis – Use Cases
• Data exploration
• Statistical data analysis
• Recommender systems
• Text mining
• Pattern recognition
• Image segmentation and analysis
• Bioinformatics
• Medicine
• Market research

Clustering Validation – ℱ1-score
• External clustering validation index (requires external labelling).
• Set-overlapping based measure.
• ℱ1-score for a cluster 𝐶𝑗 with respect to an external classification
𝑉𝑖 is defined as harmonic mean of precision and recall:
ℱ1 𝑉𝑖, 𝐶𝑗 =
2
1
𝑝𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛
+
1
𝑟𝑒𝑐𝑎𝑙𝑙
• Micro-averaged ℱ1-score for clustering C = {𝐶1, 𝐶2, 𝐶3, … , 𝐶 𝑘} and
classification V = {𝑉1, 𝑉2, 𝑉3, … , 𝑉𝑚} is then defined as:
ℱ1 𝑉, 𝐶 =
|𝑉𝑖|
𝑛
max
1≤𝑗≤𝑘
ℱ1 𝑉𝑖, 𝐶𝑗
𝑚
𝑖=1 Source: https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg
(CC BY-SA 4.0)

Log Analytics in Datacenter with Apache Spark and Machine Learning

Recommended

More Related Content

What's hot (20)

Similar to Log Analytics in Datacenter with Apache Spark and Machine Learning (20)

Recently uploaded (20)

Log Analytics in Datacenter with Apache Spark and Machine Learning