Mool - Automated Log Analysis using Data Science and ML

1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Root Cause Analysis using ML
Rohit Choudhary & Gaurav Nagar, Hortonworks

Introduction
 HDP
– Cumulative Big Data Package with 25+ Certified Open Source Apache Projects
– Source Code arrives from both - Community and Internal Engineering
 QE and Certification Process
– Every change goes through Git and Gerrit
– System tests are written for each components, 100s of new tests added every release
 Release Stability
– Determined by System Test failure and pass percentages
– Once new features and System Tests and are at 100%, we call the release done!
 Releases
– On-premise Releases
– Cloud Releases – HDI and HDC

Problem Statement
 Test Suite Size
– System Tests are organized as Suites, also called Splits – 700
– Several 1000s of test cases, executed in every run
 Infrastructure
– YarnCloud Infrastructure &OpenStack Infrastructure
– 700 X 5 Node+ HDP Clusters – Creation and Tear Downs
– Test Suites are run on each clusters and Logs are collected
– Test produce 1-1.5 TB of System Logs across our stack everyday
 Failure Assessments and Subsequent Process
– Component owners undertake the responsibilities of identifying failures
– Time-taking, Repetitive without increasing system knowledge
– Restrictive (reduces our ability to release faster)

Mool – Automated Log Analysis
 Root Cause across components in one click
– Identify common failure causes across components
 Recommend Actions instead of assisted search with
– Systemic Knowledge/Repository of Errors and their associations
– Recency of occurrence
– Source modifications as data features
– Current and past reported issues in ticketing systems
 Integrate with downstream process lifecycle
– Test Analysis
– Ticketing system integration
Mool – Sanskrit meaning Root

Past Industry Efforts – AALA @Siemens
 Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet
“The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms
do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true
answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the
test system verdicts.”

RCA Analysis Process
Log Message Feature Extraction
Test Failure Feature Extraction
Feature Extraction
1
Enriched with Test Execution Time
Origin Components
Enrichment
Error Categorization
RCA Analysis
Error Repository Upgrades
Learning
2
3

Algebra
TC2
TC1
TC3
TC4
Run ID
Component
Suite
E2
E1
E3
E4
Test Case – Error Correlation
TC1 = {E1, E2,E4}
TC2 = {E1, E3}
TC3 = {E3, E4}
TC4 = {E1, E4}
Error – Test Case Correlation (Conversely)
E1 = {TC1, TC2,TC4}
E4 = {TC1, TC4}
Where Components = {C1, C2, C3, C4}

Algebra - Explained
Suite i
Suite l…
Suite j
Suite k
Suite n
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
SingleClusterRun

Algebra - Explained
Suite i
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
Multi-clusterRun
Suite i
T =t2
E1, E2, E3… TCi1, TCi2, Ti3…
T =t1 T =f2
Suite i
T =t3
E1, E2, E3… TC1, TC2, T3…
T =t2 T =f3
Suite i
T =t4
E1, E2, E3… TC1, TC2, T3…
T =t3 T =f4

Error Paths and Feature Extraction
Hive Server2
Yarn
ATS
HDFS
Livy
Yarn
HDFS
Pig
Hive
Yarn
HDFS
Spark Oozie WorkflowHive Suite
Test Suites
Stack Call
E1, E2, E3 E1, E2, E3, E4, E5, E6 En….
Test Case Features = {name, suite_name, start_time, end_time, status}
Error Features = {stacktrace, message, occurrence_time, origin, category, file_name}
Errors

Salient Points: Failure Sample & Error Samples
Test Case Failures

System Interactions
Ensemble Modeling &
Learning
Customer
Reports
Data Pipeline
Source
Code
Historical
Error DB
Ticket
Systems
Recommendations Automated Actions
Metadata Store

Application Architecture
Log Accumulation/release
branch
Grok parsers for
HDP/Ambari components
Identical Match
(Stacktrace)
Nearest Match
(Levenshtein Adaptation)
RCA/Associative AnalysisError Hierarchy
Association
Automated Ticket
Processing
Recommendation
Based on Recency
Unsupervised Learning
Ingestion
Outcome

Split Processor
Test Clusters
Storage
Deployment Architecture
Livy (Job Server)
HDFS
Spark Jobs
MetaData
Store
Log Daemon
Log daemons
Push Logs into HDFS
Trigger Analysis at End of Run
Web Application
Manual Input for Selection/Rejection of Outcome
Data Processing Data SourceApplication

RCA Versus Error Graph Creation
 Error Graphs Creation Failed
– FP Growth Algorithm did not yield desired results
– Too many closed loops, cyclic dependencies
– Time as a split dimension was not enough
 Moved towards RCAs
– Origin of the error chain was easier to find out
– Accuracy was higher
– Enough data supporting multiple code-flows
 Easier to validate through out system Analysts
– Unsupervised Learning is hard to validate without manual intervention

RCA Rejections
 False Positives are very prevalent
– Dominating Exceptions because of frequent code path execution
– They are repetitive and need to be ignored, statistically based on decile values
 Priority versus Ignored versus Historical
– Historical RCA’s based on the source code changes and recency allows final decision
– If corresponding tickets are open, then those issues take priority
 Common Exceptions or Common RCA’s
– Prioritize the ones that are causing cross-component failures

RCA Graph

Quick Stats
Item Data
Total Run Ids Analyzed 14410
Total Splits across components 115 K
Raw errors parsed from logs 120 M
Unique Errors 45025
Total Test Case failure 170 K
Errors related to Failed Test Cases 592 K
Unique Errors related to Failed
Test Cases
30570

Adoption Challenges
 Great for fast changing code base
– Individual component owners have reported upto 99% accuracy
– Multi-component use case scenarios needs improvement
 Log collection required multiple iterations
– Order of logs being written and collected
– Central Log server issues
 Stable releases are harder to instrument
– Our internal team has been unable to use it
– Source code changes are minimal/recency parameters are harder to provide
 Unsupervised learning verification is harder
– Very hard to effectively judge performance of models without manual interference

Future Work
 Unsupervised learning validation using automated techniques
 Online processing using Spark Streaming
 Event based error detection on live production clusters
 Correlation with other log events/customer use cases

Thank You
Rohit Choudhary & Gaurav Nagar

Mool - Automated Log Analysis using Data Science and ML

More Related Content

What's hot (20)

Similar to Mool - Automated Log Analysis using Data Science and ML (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Mool - Automated Log Analysis using Data Science and ML

Editor's Notes