SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Root Cause Analysis using ML
Rohit Choudhary & Gaurav Nagar, Hortonworks
2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Introduction
 HDP
– Cumulative Big Data Package with 25+ Certified Open Source Apache Projects
– Source Code arrives from both - Community and Internal Engineering
 QE and Certification Process
– Every change goes through Git and Gerrit
– System tests are written for each components, 100s of new tests added every release
 Release Stability
– Determined by System Test failure and pass percentages
– Once new features and System Tests and are at 100%, we call the release done!
 Releases
– On-premise Releases
– Cloud Releases – HDI and HDC
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Problem Statement
 Test Suite Size
– System Tests are organized as Suites, also called Splits – 700
– Several 1000s of test cases, executed in every run
 Infrastructure
– YarnCloud Infrastructure &OpenStack Infrastructure
– 700 X 5 Node+ HDP Clusters – Creation and Tear Downs
– Test Suites are run on each clusters and Logs are collected
– Test produce 1-1.5 TB of System Logs across our stack everyday
 Failure Assessments and Subsequent Process
– Component owners undertake the responsibilities of identifying failures
– Time-taking, Repetitive without increasing system knowledge
– Restrictive (reduces our ability to release faster)
4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Log Analysis
 Root Cause across components in one click
– Identify common failure causes across components
 Recommend Actions instead of assisted search with
– Systemic Knowledge/Repository of Errors and their associations
– Recency of occurrence
– Source modifications as data features
– Current and past reported issues in ticketing systems
 Integrate with downstream process lifecycle
– Test Analysis
– Ticketing system integration
Mool – Sanskrit meaning Root
5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Past Industry Efforts – AALA @Siemens
 Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet
“The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms
do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true
answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the
test system verdicts.”
6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Analysis Process
Log Message Feature Extraction
Test Failure Feature Extraction
Feature Extraction
1
Enriched with Test Execution Time
Origin Components
Enrichment
Error Categorization
RCA Analysis
Error Repository Upgrades
Learning
2
3
7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra
TC2
TC1
TC3
TC4
Run ID
Component
Suite
E2
E1
E3
E4
Test Case – Error Correlation
TC1 = {E1, E2,E4}
TC2 = {E1, E3}
TC3 = {E3, E4}
TC4 = {E1, E4}
Error – Test Case Correlation (Conversely)
E1 = {TC1, TC2,TC4}
E4 = {TC1, TC4}
Where Components = {C1, C2, C3, C4}
8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra - Explained
Suite i
Suite l…
Suite j
Suite k
Suite n
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
SingleClusterRun
9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra - Explained
Suite i
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
Multi-clusterRun
Suite i
T =t2
E1, E2, E3… TCi1, TCi2, Ti3…
T =t1 T =f2
Suite i
T =t3
E1, E2, E3… TC1, TC2, T3…
T =t2 T =f3
Suite i
T =t4
E1, E2, E3… TC1, TC2, T3…
T =t3 T =f4
10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Error Paths and Feature Extraction
Hive Server2
Yarn
ATS
HDFS
Livy
Yarn
HDFS
Pig
Hive
Yarn
HDFS
Spark Oozie WorkflowHive Suite
Test Suites
Stack Call
E1, E2, E3 E1, E2, E3, E4, E5, E6 En….
Test Case Features = {name, suite_name, start_time, end_time, status}
Error Features = {stacktrace, message, occurrence_time, origin, category, file_name}
Errors
11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Salient Points: Failure Sample & Error Samples
Test Case Failures
12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
System Interactions
Ensemble Modeling &
Learning
Customer
Reports
Data Pipeline
Source
Code
Historical
Error DB
Ticket
Systems
Recommendations Automated Actions
Metadata Store
13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Application Architecture
Log Accumulation/release
branch
Grok parsers for
HDP/Ambari components
Identical Match
(Stacktrace)
Nearest Match
(Levenshtein Adaptation)
RCA/Associative AnalysisError Hierarchy
Association
Automated Ticket
Processing
Recommendation
Based on Recency
Unsupervised Learning
Ingestion
Outcome
14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Split Processor
Test Clusters
Storage
Deployment Architecture
Livy (Job Server)
HDFS
Spark Jobs
MetaData
Store
Log Daemon
Log daemons
Push Logs into HDFS
Trigger Analysis at End of Run
Web Application
Manual Input for Selection/Rejection of Outcome
Data Processing Data SourceApplication
15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Versus Error Graph Creation
 Error Graphs Creation Failed
– FP Growth Algorithm did not yield desired results
– Too many closed loops, cyclic dependencies
– Time as a split dimension was not enough
 Moved towards RCAs
– Origin of the error chain was easier to find out
– Accuracy was higher
– Enough data supporting multiple code-flows
 Easier to validate through out system Analysts
– Unsupervised Learning is hard to validate without manual intervention
16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Rejections
 False Positives are very prevalent
– Dominating Exceptions because of frequent code path execution
– They are repetitive and need to be ignored, statistically based on decile values
 Priority versus Ignored versus Historical
– Historical RCA’s based on the source code changes and recency allows final decision
– If corresponding tickets are open, then those issues take priority
 Common Exceptions or Common RCA’s
– Prioritize the ones that are causing cross-component failures
17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Graph
18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Quick Stats
Item Data
Total Run Ids Analyzed 14410
Total Splits across components 115 K
Raw errors parsed from logs 120 M
Unique Errors 45025
Total Test Case failure 170 K
Errors related to Failed Test Cases 592 K
Unique Errors related to Failed
Test Cases
30570
19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Adoption Challenges
 Great for fast changing code base
– Individual component owners have reported upto 99% accuracy
– Multi-component use case scenarios needs improvement
 Log collection required multiple iterations
– Order of logs being written and collected
– Central Log server issues
 Stable releases are harder to instrument
– Our internal team has been unable to use it
– Source code changes are minimal/recency parameters are harder to provide
 Unsupervised learning verification is harder
– Very hard to effectively judge performance of models without manual interference
20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Future Work
 Unsupervised learning validation using automated techniques
 Online processing using Spark Streaming
 Event based error detection on live production clusters
 Correlation with other log events/customer use cases
21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Thank You
Rohit Choudhary & Gaurav Nagar

More Related Content

PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Debunking Common Myths in Stream Processing
PPTX
Hive edw-dataworks summit-eu-april-2017
PDF
Apache Metron in the Real World
PDF
SparkR Best Practices for R Data Scientists
PPTX
Apache Atlas: Governance for your Data
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PDF
Deep learning 101
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Debunking Common Myths in Stream Processing
Hive edw-dataworks summit-eu-april-2017
Apache Metron in the Real World
SparkR Best Practices for R Data Scientists
Apache Atlas: Governance for your Data
LLAP: Sub-Second Analytical Queries in Hive
Deep learning 101

What's hot (20)

PDF
Achieving a 360-degree view of manufacturing via open source industrial data ...
PPTX
Accelerating TensorFlow with RDMA for high-performance deep learning
PDF
#HSTokyo16 Apache Spark Crash Course
PPTX
Quality for the Hadoop Zoo
PDF
Solving Cybersecurity at Scale
PPTX
Why is my Hadoop cluster slow?
PPTX
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PDF
Scalable OCR with NiFi and Tesseract
PPTX
Design Patterns For Real Time Streaming Data Analytics
PPTX
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
SAM—streaming analytics made easy
PPTX
Row/Column- Level Security in SQL for Apache Spark
PPTX
Omid: scalable and highly available transaction processing for Apache Phoenix
PDF
Fast SQL on Hadoop, really?
PPTX
Fine-Grained Security for Spark and Hive
PPTX
Machine Learning in the IoT with Apache NiFi
PPTX
File Format Benchmark - Avro, JSON, ORC and Parquet
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Achieving a 360-degree view of manufacturing via open source industrial data ...
Accelerating TensorFlow with RDMA for high-performance deep learning
#HSTokyo16 Apache Spark Crash Course
Quality for the Hadoop Zoo
Solving Cybersecurity at Scale
Why is my Hadoop cluster slow?
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Hadoop & Cloud Storage: Object Store Integration in Production
Scalable OCR with NiFi and Tesseract
Design Patterns For Real Time Streaming Data Analytics
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
LLAP: Sub-Second Analytical Queries in Hive
SAM—streaming analytics made easy
Row/Column- Level Security in SQL for Apache Spark
Omid: scalable and highly available transaction processing for Apache Phoenix
Fast SQL on Hadoop, really?
Fine-Grained Security for Spark and Hive
Machine Learning in the IoT with Apache NiFi
File Format Benchmark - Avro, JSON, ORC and Parquet
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Ad

Similar to Mool - Automated Log Analysis using Data Science and ML (20)

PDF
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
PPTX
Effective Testing of Apache Accumulo Iterators
PDF
10_years_Experience_in_Automation
DOC
002 srikanth system & network administrator 8+yrs
PDF
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
PPTX
OpenTelemetry 101 FTW
PPTX
SDN Controller - Programming Challenges
PDF
IEC 60870-5 101 Protocol Server Simulator User manual
DOCX
Tarun_Medimi
PPTX
SCM Transformation Challenges and How to Overcome Them
PPTX
Connectivity challenges APC Europe by Alan Weber
DOCX
Michael_Joshua_Validation
PPTX
Achieving a 360 degree view of manufacturing
DOC
Soma_Mishra_Resume
PDF
Define enterprise integration strategy by industry leader bhawani nandanprasad
PPT
Cs 568 Spring 10 Lecture 5 Estimation
DOCX
Rajesh - CV
PDF
Lee Wei Yann Resume 2016
PPT
eG Innovations
PDF
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Effective Testing of Apache Accumulo Iterators
10_years_Experience_in_Automation
002 srikanth system & network administrator 8+yrs
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
OpenTelemetry 101 FTW
SDN Controller - Programming Challenges
IEC 60870-5 101 Protocol Server Simulator User manual
Tarun_Medimi
SCM Transformation Challenges and How to Overcome Them
Connectivity challenges APC Europe by Alan Weber
Michael_Joshua_Validation
Achieving a 360 degree view of manufacturing
Soma_Mishra_Resume
Define enterprise integration strategy by industry leader bhawani nandanprasad
Cs 568 Spring 10 Lecture 5 Estimation
Rajesh - CV
Lee Wei Yann Resume 2016
eG Innovations
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
cuic standard and advanced reporting.pdf
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Review of recent advances in non-invasive hemoglobin estimation
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced Soft Computing BINUS July 2025.pdf
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Sensors and Actuators in IoT Systems using pdf
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf

Mool - Automated Log Analysis using Data Science and ML

  • 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Mool – Automated Root Cause Analysis using ML Rohit Choudhary & Gaurav Nagar, Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Introduction  HDP – Cumulative Big Data Package with 25+ Certified Open Source Apache Projects – Source Code arrives from both - Community and Internal Engineering  QE and Certification Process – Every change goes through Git and Gerrit – System tests are written for each components, 100s of new tests added every release  Release Stability – Determined by System Test failure and pass percentages – Once new features and System Tests and are at 100%, we call the release done!  Releases – On-premise Releases – Cloud Releases – HDI and HDC
  • 3. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Problem Statement  Test Suite Size – System Tests are organized as Suites, also called Splits – 700 – Several 1000s of test cases, executed in every run  Infrastructure – YarnCloud Infrastructure &OpenStack Infrastructure – 700 X 5 Node+ HDP Clusters – Creation and Tear Downs – Test Suites are run on each clusters and Logs are collected – Test produce 1-1.5 TB of System Logs across our stack everyday  Failure Assessments and Subsequent Process – Component owners undertake the responsibilities of identifying failures – Time-taking, Repetitive without increasing system knowledge – Restrictive (reduces our ability to release faster)
  • 4. 4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Mool – Automated Log Analysis  Root Cause across components in one click – Identify common failure causes across components  Recommend Actions instead of assisted search with – Systemic Knowledge/Repository of Errors and their associations – Recency of occurrence – Source modifications as data features – Current and past reported issues in ticketing systems  Integrate with downstream process lifecycle – Test Analysis – Ticketing system integration Mool – Sanskrit meaning Root
  • 5. 5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Past Industry Efforts – AALA @Siemens  Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet “The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the test system verdicts.”
  • 6. 6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Analysis Process Log Message Feature Extraction Test Failure Feature Extraction Feature Extraction 1 Enriched with Test Execution Time Origin Components Enrichment Error Categorization RCA Analysis Error Repository Upgrades Learning 2 3
  • 7. 7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra TC2 TC1 TC3 TC4 Run ID Component Suite E2 E1 E3 E4 Test Case – Error Correlation TC1 = {E1, E2,E4} TC2 = {E1, E3} TC3 = {E3, E4} TC4 = {E1, E4} Error – Test Case Correlation (Conversely) E1 = {TC1, TC2,TC4} E4 = {TC1, TC4} Where Components = {C1, C2, C3, C4}
  • 8. 8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra - Explained Suite i Suite l… Suite j Suite k Suite n T =t Errors Test Cases E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… T =0 T =f SingleClusterRun
  • 9. 9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra - Explained Suite i T =t Errors Test Cases E1, E2, E3… TC1, TC2, T3… T =0 T =f Multi-clusterRun Suite i T =t2 E1, E2, E3… TCi1, TCi2, Ti3… T =t1 T =f2 Suite i T =t3 E1, E2, E3… TC1, TC2, T3… T =t2 T =f3 Suite i T =t4 E1, E2, E3… TC1, TC2, T3… T =t3 T =f4
  • 10. 10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Error Paths and Feature Extraction Hive Server2 Yarn ATS HDFS Livy Yarn HDFS Pig Hive Yarn HDFS Spark Oozie WorkflowHive Suite Test Suites Stack Call E1, E2, E3 E1, E2, E3, E4, E5, E6 En…. Test Case Features = {name, suite_name, start_time, end_time, status} Error Features = {stacktrace, message, occurrence_time, origin, category, file_name} Errors
  • 11. 11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Salient Points: Failure Sample & Error Samples Test Case Failures
  • 12. 12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved System Interactions Ensemble Modeling & Learning Customer Reports Data Pipeline Source Code Historical Error DB Ticket Systems Recommendations Automated Actions Metadata Store
  • 13. 13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Application Architecture Log Accumulation/release branch Grok parsers for HDP/Ambari components Identical Match (Stacktrace) Nearest Match (Levenshtein Adaptation) RCA/Associative AnalysisError Hierarchy Association Automated Ticket Processing Recommendation Based on Recency Unsupervised Learning Ingestion Outcome
  • 14. 14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Split Processor Test Clusters Storage Deployment Architecture Livy (Job Server) HDFS Spark Jobs MetaData Store Log Daemon Log daemons Push Logs into HDFS Trigger Analysis at End of Run Web Application Manual Input for Selection/Rejection of Outcome Data Processing Data SourceApplication
  • 15. 15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Versus Error Graph Creation  Error Graphs Creation Failed – FP Growth Algorithm did not yield desired results – Too many closed loops, cyclic dependencies – Time as a split dimension was not enough  Moved towards RCAs – Origin of the error chain was easier to find out – Accuracy was higher – Enough data supporting multiple code-flows  Easier to validate through out system Analysts – Unsupervised Learning is hard to validate without manual intervention
  • 16. 16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Rejections  False Positives are very prevalent – Dominating Exceptions because of frequent code path execution – They are repetitive and need to be ignored, statistically based on decile values  Priority versus Ignored versus Historical – Historical RCA’s based on the source code changes and recency allows final decision – If corresponding tickets are open, then those issues take priority  Common Exceptions or Common RCA’s – Prioritize the ones that are causing cross-component failures
  • 17. 17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Graph
  • 18. 18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Quick Stats Item Data Total Run Ids Analyzed 14410 Total Splits across components 115 K Raw errors parsed from logs 120 M Unique Errors 45025 Total Test Case failure 170 K Errors related to Failed Test Cases 592 K Unique Errors related to Failed Test Cases 30570
  • 19. 19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Adoption Challenges  Great for fast changing code base – Individual component owners have reported upto 99% accuracy – Multi-component use case scenarios needs improvement  Log collection required multiple iterations – Order of logs being written and collected – Central Log server issues  Stable releases are harder to instrument – Our internal team has been unable to use it – Source code changes are minimal/recency parameters are harder to provide  Unsupervised learning verification is harder – Very hard to effectively judge performance of models without manual interference
  • 20. 20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Future Work  Unsupervised learning validation using automated techniques  Online processing using Spark Streaming  Event based error detection on live production clusters  Correlation with other log events/customer use cases
  • 21. 21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Thank You Rohit Choudhary & Gaurav Nagar

Editor's Notes

  • #2: TALK TRACK Mool is the application th [NEXT SLIDE]