SlideShare a Scribd company logo
Improvi
RV College of
Engineering
Go, change the
world
1
Improving Efficiency of Machine Learning Algorithms
Using HPCC Systems Platform
Dr. G. Shobha
Professor, CSE Department
RV College of Engineering, Bengaluru - 59
RV College of
Engineering
PRESENTATION CONTENTS
Go, change the world
2
Introduction and Motivation
HPCC Systems Architecture
Parallel DBSCAN Algorithm
Experimental Results &
Conclusions
RV College of
Engineering
Introduction and Motivation
Go, change the world
3
Key Factors of Machine
Learning
1. Large Data Sets
Millions of labelled images, thousands of hours of speech
2. Improved Models and Algorithms
• Deep Neural Networks: hundreds of layers, millions of parameters
3. Efficient Computation for Machine Learning:
• Computational power for ML increased by ~100x since 2010
• Gains (GPU, CPU) almost stagnant in latest generations
• Computation times are extremely large anyway (days to weeks to months)
Go-to Solution: Distribute Machine Learning Applications to Multiple Processors and Nodes
RV College of
Engineering
Introduction and Motivation
Go, change the world
4
Machine Learning in One Node
RV College of
Engineering
Introduction and Motivation
Go, change the world
5
Distributed Machine Learning
RV College of
Engineering
Introduction and Motivation
Go, change the world
6
Parallel Processing Architectures for Distributed
Machine Learning
1. Map Reduce
Ex : Hadoop , Spark, Data Torrent
Limitations of Hadoop
Go-to Solution: HPCC Systems Architecture by LexisNexis Risk Solutions
2. Data Flow
Ex : HPCC Systems
RV College of
Engineering
HPCC Systems Architecture
Go, change the world
7
THOR :
• data refinery engine
• gives the user control over data
transformations.
• facilitates optimal operational
capacity on mixed schema data
ROXIE :
• search engine
• speed real-time queries through
interfaces such as REST, SOAP and
XML.
• reduces the latency associated
with querying
ECL (Enterprise Data Control Language).
- High Level language for parallel data
processing
- Dataflow architecture
- implicitly parallel and declarative in nature
provides several constructs to simplify parallel
compute operations
RV College of
Engineering
Go, change the world
8
Advantages of HPCC Systems Architecture for Distributed Machine Learning
• Highly integrated system environment
- capabilities from raw data processing to high-performance queries
and data analysis using a common language;
• Optimized cluster approach
- provides high performance at a much lower system cost than other
system alternatives
• Stable and reliable processing environment proven in production applications
for varied organizations over a 15-year period;
• Innovative data-centric programming language (ECL)
• High-level of fault resilience and capabilities
• Suitable for a wide range of data-intensive
HPCC Systems Architecture
Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
RV College of
Engineering
Density Based Spatial Clustering Application with
Noise (DBSCAN)
Go, change the world
9
• Clusters are dense region the data space, separated by
regions of lower object density
• A cluster is defined as a maximal set of density-connected
points
• Discovers clusters of arbitrary shape
RV College of
Engineering
Go, change the world
10
Two parameters:
Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Eps-
neighborhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition:
|NEps (q)| >= MinPts
Density Based Spatial Clustering Application with Noise
(DBSCAN)
computationally inefficient task when applied to large amounts of data, especially on big data platforms.
RV College of
Engineering
Go, change the world
11
DBSCAN
RV College of
Engineering
Go, change the world
12
Drawback : Computationally inefficient when applied to large amounts of data,
especially on big data platforms
Sequential DBSCAN Algorithm
Go To Solution : Parallel DBSCAN Algorithm On HPCC Systems Big data Platform
Specification Value
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order Little Endian
Model Name Intel Xeon
CPU GHz 2.4
Core (s) 6
RAM 6GB
Hard Disk 128GB
Processor Specification for Each Node
Data Set : Frog’s , MFCC
Dimension : 20
RV College of
Engineering
Go, change the world
13
Parallel DBSCAN Algorithm On HPCC Systems Platform
1. Spraying the Data
• Thor Engine distributes Data Points , assigned to global unique Ids across the
nodes in cluster evenly
• Each of the local nodes then sort the data points by their unique ids
• Send the data to local clustering stage
2. Local Clustering DBSCAN Algorithm is executed on each local node in HPCC Cluster.
2 operations
• Union : Final cluster is represented by highest core point.
• Find : Used to identify the parent i.e., highest core point,, for each
point(node) in the tree.
3. Global Merge • Trees are merged together to form Global Clusters – point
belong to more than one tree in different nodes.
• the final clusters are obtained which are represented by
their highest core point across all nodes
RV College of
Engineering
Go, change the world
14
Parallel DBSCAN Algorithm On HPCC Big data Platform
(Source code - https://github.com/hpcc-systems/dbscan)
contributors - Yathish & Team
RV College of
Engineering
Go, change the world
15
Experimental Results & Conclusions
Size Eps
distan
ce
Minpts
in a
cluster
Time on
single node
(in s)
Time on
two
nodes (in
s)
Time on three
nodes (in s)
4800 0.2 2 16.35 14.5 15.86
6000 0.3 9 35.24 22.246 23.471
7200 0.3 10 53.48 44.426 45.63
9000 0.35 10 112.80 50.57 53.642
14300 0.4 20 535.74 213.92
2
203.184
30000 0.4 20 3924.7 964.61
6
727.33
50000 0.5 30 24948.6 5124.3 3266.462
0
5000
10000
15000
20000
25000
30000
4800 6000 7200 9000 14300 30000 50000
ExecutionTime(seconds)
Size
Serial vs Parallel Execution Time
Serial Parallel (2 Nodes) Parallel (3 Nodes)
RV College of
Engineering
Go, change the world
16
Conclusions
• Multi node setup outperforms the single node setup in all cases
• Increase data points increases the parallel algorithm to perform better than its serial
counterpart
• HPCC Platform supports cross platform developments in languages like C++, python,
etc., which makes it to develop applications at a faster pace.
• Thor and Roxie components of HPCC Platform enables faster data ingestion and data
query across multiple nodes - Makes it efficient in implementing machine learning
algorithms
• the Platform parallelizes the sequential algorithms across multiple nodes efficiently.
RV College of
Engineering
Go, change the world
17
References
• https://researchcollaborations.elsevier.com/en/organisations/httpswwwrvceeduin
• MQTT protocol support for ROXIE ,https://github.com/hpcc-systems/mqtt-for-roxie
• Automated Data Skew Profiler, https://github.com/notharsh/DataSkewProfiler
• Extending current ML library with LexisNexis HPCC Systems
https://github.com/lilyclemson/DBSCAN/tree/project
• Image Processing Library in HPCC , https://github.com/TanmayH/HPCC-OPENCV
• Fraud detection in value based cards,https://github.com/aksharprasad/HPCC
• Evaluation of machine learning algorithms,
https://github.com/suryanarayanan21/ML_Core
• Interfacing Octave with ECL GitHub Link : https://github.com/Sathvik10/Octave-
Plugin
• Continuous integration of Roxie query / data deployments using Jenkins,
https://github.com/JUJayashree/jenkin_JOB_xml
RV College of
Engineering
Go, change the world
18
Acknowledge
Prof. Jyothi, Asst. Prof. CSE Dept., RVCE
Vasanth, Instructor, CSE Dept., RVCE
Students of RVCE
1. Jayant Suresh
2. Harsh Mishra
3. Amogh Vardhan Kashi
4. Manjunath Jakkaraddi
5. Shubham Phal
6. Tanmay Hukkeri
7. Yathish H R
8. Akshar Prasad
9. Sathvik K R
10. A Suryanarayanan
Currently working Students
1. Varsha R Jenni
2. Akhil Dua
3. Atreya Bain
4. Anurag Singh Bhadauria
5. Ambu Karthik
6. Rohit Sachin
RV College of
Engineering
Go, change the world
19

More Related Content

What's hot (20)

PDF
Scalable Algorithm Design with MapReduce
Pietro Michiardi
 
PPT
DIET_BLAST
Frederic Desprez
 
PDF
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Jonathan Dursi
 
PDF
H04502048051
ijceronline
 
PPT
Hybrid networking and distribution
vivek pratap singh
 
PDF
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
PPTX
KIISE:SIGDB Workshop presentation.
Kyong-Ha Lee
 
PDF
MapReduce: Distributed Computing for Machine Learning
butest
 
PDF
Fault tolerant mechanisms in Big Data
Karan Pardeshi
 
PDF
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PDF
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Cloudera, Inc.
 
PDF
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PDF
The Convergence of HPC and Deep Learning
inside-BigData.com
 
PDF
London bosc2010
BOSC 2010
 
PDF
Cluster Schedulers
Pietro Michiardi
 
PPT
Sector Sphere 2009
lilyco
 
PDF
Parallel Data Processing with MapReduce: A Survey
Kyong-Ha Lee
 
PPTX
SparkNet presentation
Sneh Pahilwani
 
PPTX
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Saliya Ekanayake
 
PDF
Implementation of linear regression and logistic regression on Spark
Dalei Li
 
Scalable Algorithm Design with MapReduce
Pietro Michiardi
 
DIET_BLAST
Frederic Desprez
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Jonathan Dursi
 
H04502048051
ijceronline
 
Hybrid networking and distribution
vivek pratap singh
 
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
KIISE:SIGDB Workshop presentation.
Kyong-Ha Lee
 
MapReduce: Distributed Computing for Machine Learning
butest
 
Fault tolerant mechanisms in Big Data
Karan Pardeshi
 
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Cloudera, Inc.
 
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
The Convergence of HPC and Deep Learning
inside-BigData.com
 
London bosc2010
BOSC 2010
 
Cluster Schedulers
Pietro Michiardi
 
Sector Sphere 2009
lilyco
 
Parallel Data Processing with MapReduce: A Survey
Kyong-Ha Lee
 
SparkNet presentation
Sneh Pahilwani
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Saliya Ekanayake
 
Implementation of linear regression and logistic regression on Spark
Dalei Li
 

Similar to Improving Efficiency of Machine Learning Algorithms using HPCC Systems (20)

PPTX
High performance computing for research
Esteban Hernandez
 
PPT
Distributed_and_cloud_computing-unit-1.ppt
lunalovegood66
 
PDF
High–Performance Computing
BRAC University Computer Club
 
PDF
HUG Ireland Event - HPCC Presentation Slides
John Mulhall
 
PPT
High Performance Computing
Divyen Patel
 
PPTX
(19-23)CC Unit-1 ppt.pptx
NithishaYadavv
 
PPTX
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
PDF
Presentation at Wright State University
HPCC Systems
 
PDF
Foundation of High Performance Computing HPC
nadiabha
 
PDF
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
PPTX
Cloud Computing-UNIT 1 claud computing basics
moeincanada007
 
PPTX
High Performance Computing shortly HPC.ppt
srinuvasrao101
 
PPTX
The Download: Tech Talks by the HPCC Systems Community, Episode 11
HPCC Systems
 
PPTX
UNIT-1-PARADIGMS.pptx cloud computing cc
JahnaviNarala
 
PPTX
High Performance Computer
Ashok Raj
 
PPTX
B9 cmis
Priyanka Sinha
 
PPTX
Overview of HPC.pptx
sundariprabhu
 
PDF
Accelerate Big Data Processing with High-Performance Computing Technologies
Intel® Software
 
PPTX
Leveraging Intra-Node Parallelization in HPCC Systems
HPCC Systems
 
PPTX
Parallel Distributed Deep Learning on HPCC Systems
HPCC Systems
 
High performance computing for research
Esteban Hernandez
 
Distributed_and_cloud_computing-unit-1.ppt
lunalovegood66
 
High–Performance Computing
BRAC University Computer Club
 
HUG Ireland Event - HPCC Presentation Slides
John Mulhall
 
High Performance Computing
Divyen Patel
 
(19-23)CC Unit-1 ppt.pptx
NithishaYadavv
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Presentation at Wright State University
HPCC Systems
 
Foundation of High Performance Computing HPC
nadiabha
 
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
Cloud Computing-UNIT 1 claud computing basics
moeincanada007
 
High Performance Computing shortly HPC.ppt
srinuvasrao101
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
HPCC Systems
 
UNIT-1-PARADIGMS.pptx cloud computing cc
JahnaviNarala
 
High Performance Computer
Ashok Raj
 
Overview of HPC.pptx
sundariprabhu
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Intel® Software
 
Leveraging Intra-Node Parallelization in HPCC Systems
HPCC Systems
 
Parallel Distributed Deep Learning on HPCC Systems
HPCC Systems
 
Ad

More from HPCC Systems (20)

PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
HPCC Systems
 
PPTX
Towards Trustable AI for Complex Systems
HPCC Systems
 
PPTX
Welcome
HPCC Systems
 
PPTX
Closing / Adjourn
HPCC Systems
 
PPTX
Community Website: Virtual Ribbon Cutting
HPCC Systems
 
PPTX
Path to 8.0
HPCC Systems
 
PPTX
Release Cycle Changes
HPCC Systems
 
PPTX
Geohashing with Uber’s H3 Geospatial Index
HPCC Systems
 
PPTX
Advancements in HPCC Systems Machine Learning
HPCC Systems
 
PPTX
Docker Support
HPCC Systems
 
PPTX
Expanding HPCC Systems Deep Neural Network Capabilities
HPCC Systems
 
PPTX
DataPatterns - Profiling in ECL Watch
HPCC Systems
 
PPTX
Leveraging the Spark-HPCC Ecosystem
HPCC Systems
 
PPTX
Work Unit Analysis Tool
HPCC Systems
 
PPTX
Community Award Ceremony
HPCC Systems
 
PPTX
Dapper Tool - A Bundle to Make your ECL Neater
HPCC Systems
 
PPTX
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
HPCC Systems
 
PPTX
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
HPCC Systems
 
PPTX
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
HPCC Systems
 
PPTX
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
HPCC Systems
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
HPCC Systems
 
Towards Trustable AI for Complex Systems
HPCC Systems
 
Welcome
HPCC Systems
 
Closing / Adjourn
HPCC Systems
 
Community Website: Virtual Ribbon Cutting
HPCC Systems
 
Path to 8.0
HPCC Systems
 
Release Cycle Changes
HPCC Systems
 
Geohashing with Uber’s H3 Geospatial Index
HPCC Systems
 
Advancements in HPCC Systems Machine Learning
HPCC Systems
 
Docker Support
HPCC Systems
 
Expanding HPCC Systems Deep Neural Network Capabilities
HPCC Systems
 
DataPatterns - Profiling in ECL Watch
HPCC Systems
 
Leveraging the Spark-HPCC Ecosystem
HPCC Systems
 
Work Unit Analysis Tool
HPCC Systems
 
Community Award Ceremony
HPCC Systems
 
Dapper Tool - A Bundle to Make your ECL Neater
HPCC Systems
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
HPCC Systems
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
HPCC Systems
 
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
HPCC Systems
 
Leveraging HPCC Systems as Part of an Information Security, Privacy, and Comp...
HPCC Systems
 
Ad

Recently uploaded (20)

DOCX
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
PDF
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
PDF
Digital-Transformation-for-Federal-Agencies.pdf.pdf
One Federal Solution
 
DOCX
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
DOCX
Udemy - data management Luisetto Mauro.docx
M. Luisetto Pharm.D.Spec. Pharmacology
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PPTX
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
PPTX
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
PPTX
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
PDF
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
PDF
Predicting Titanic Survival Presentation
praxyfarhana
 
PDF
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
PPTX
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PPTX
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
DOCX
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
Digital-Transformation-for-Federal-Agencies.pdf.pdf
One Federal Solution
 
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
Udemy - data management Luisetto Mauro.docx
M. Luisetto Pharm.D.Spec. Pharmacology
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Predicting Titanic Survival Presentation
praxyfarhana
 
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 

Improving Efficiency of Machine Learning Algorithms using HPCC Systems

  • 1. Improvi RV College of Engineering Go, change the world 1 Improving Efficiency of Machine Learning Algorithms Using HPCC Systems Platform Dr. G. Shobha Professor, CSE Department RV College of Engineering, Bengaluru - 59
  • 2. RV College of Engineering PRESENTATION CONTENTS Go, change the world 2 Introduction and Motivation HPCC Systems Architecture Parallel DBSCAN Algorithm Experimental Results & Conclusions
  • 3. RV College of Engineering Introduction and Motivation Go, change the world 3 Key Factors of Machine Learning 1. Large Data Sets Millions of labelled images, thousands of hours of speech 2. Improved Models and Algorithms • Deep Neural Networks: hundreds of layers, millions of parameters 3. Efficient Computation for Machine Learning: • Computational power for ML increased by ~100x since 2010 • Gains (GPU, CPU) almost stagnant in latest generations • Computation times are extremely large anyway (days to weeks to months) Go-to Solution: Distribute Machine Learning Applications to Multiple Processors and Nodes
  • 4. RV College of Engineering Introduction and Motivation Go, change the world 4 Machine Learning in One Node
  • 5. RV College of Engineering Introduction and Motivation Go, change the world 5 Distributed Machine Learning
  • 6. RV College of Engineering Introduction and Motivation Go, change the world 6 Parallel Processing Architectures for Distributed Machine Learning 1. Map Reduce Ex : Hadoop , Spark, Data Torrent Limitations of Hadoop Go-to Solution: HPCC Systems Architecture by LexisNexis Risk Solutions 2. Data Flow Ex : HPCC Systems
  • 7. RV College of Engineering HPCC Systems Architecture Go, change the world 7 THOR : • data refinery engine • gives the user control over data transformations. • facilitates optimal operational capacity on mixed schema data ROXIE : • search engine • speed real-time queries through interfaces such as REST, SOAP and XML. • reduces the latency associated with querying ECL (Enterprise Data Control Language). - High Level language for parallel data processing - Dataflow architecture - implicitly parallel and declarative in nature provides several constructs to simplify parallel compute operations
  • 8. RV College of Engineering Go, change the world 8 Advantages of HPCC Systems Architecture for Distributed Machine Learning • Highly integrated system environment - capabilities from raw data processing to high-performance queries and data analysis using a common language; • Optimized cluster approach - provides high performance at a much lower system cost than other system alternatives • Stable and reliable processing environment proven in production applications for varied organizations over a 15-year period; • Innovative data-centric programming language (ECL) • High-level of fault resilience and capabilities • Suitable for a wide range of data-intensive HPCC Systems Architecture
  • 9. Major features: • Discover clusters of arbitrary shape • Handle noise • One scan • Need density parameters as termination condition RV College of Engineering Density Based Spatial Clustering Application with Noise (DBSCAN) Go, change the world 9 • Clusters are dense region the data space, separated by regions of lower object density • A cluster is defined as a maximal set of density-connected points • Discovers clusters of arbitrary shape
  • 10. RV College of Engineering Go, change the world 10 Two parameters: Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps- neighborhood of that point NEps(p): {q belongs to D | dist(p,q) <= Eps} Directly density-reachable: A point p is directly density- reachable from a point q wrt. Eps, MinPts if 1) p belongs to NEps(q) 2) core point condition: |NEps (q)| >= MinPts Density Based Spatial Clustering Application with Noise (DBSCAN)
  • 11. computationally inefficient task when applied to large amounts of data, especially on big data platforms. RV College of Engineering Go, change the world 11 DBSCAN
  • 12. RV College of Engineering Go, change the world 12 Drawback : Computationally inefficient when applied to large amounts of data, especially on big data platforms Sequential DBSCAN Algorithm Go To Solution : Parallel DBSCAN Algorithm On HPCC Systems Big data Platform Specification Value Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Model Name Intel Xeon CPU GHz 2.4 Core (s) 6 RAM 6GB Hard Disk 128GB Processor Specification for Each Node Data Set : Frog’s , MFCC Dimension : 20
  • 13. RV College of Engineering Go, change the world 13 Parallel DBSCAN Algorithm On HPCC Systems Platform 1. Spraying the Data • Thor Engine distributes Data Points , assigned to global unique Ids across the nodes in cluster evenly • Each of the local nodes then sort the data points by their unique ids • Send the data to local clustering stage 2. Local Clustering DBSCAN Algorithm is executed on each local node in HPCC Cluster. 2 operations • Union : Final cluster is represented by highest core point. • Find : Used to identify the parent i.e., highest core point,, for each point(node) in the tree. 3. Global Merge • Trees are merged together to form Global Clusters – point belong to more than one tree in different nodes. • the final clusters are obtained which are represented by their highest core point across all nodes
  • 14. RV College of Engineering Go, change the world 14 Parallel DBSCAN Algorithm On HPCC Big data Platform (Source code - https://github.com/hpcc-systems/dbscan) contributors - Yathish & Team
  • 15. RV College of Engineering Go, change the world 15 Experimental Results & Conclusions Size Eps distan ce Minpts in a cluster Time on single node (in s) Time on two nodes (in s) Time on three nodes (in s) 4800 0.2 2 16.35 14.5 15.86 6000 0.3 9 35.24 22.246 23.471 7200 0.3 10 53.48 44.426 45.63 9000 0.35 10 112.80 50.57 53.642 14300 0.4 20 535.74 213.92 2 203.184 30000 0.4 20 3924.7 964.61 6 727.33 50000 0.5 30 24948.6 5124.3 3266.462 0 5000 10000 15000 20000 25000 30000 4800 6000 7200 9000 14300 30000 50000 ExecutionTime(seconds) Size Serial vs Parallel Execution Time Serial Parallel (2 Nodes) Parallel (3 Nodes)
  • 16. RV College of Engineering Go, change the world 16 Conclusions • Multi node setup outperforms the single node setup in all cases • Increase data points increases the parallel algorithm to perform better than its serial counterpart • HPCC Platform supports cross platform developments in languages like C++, python, etc., which makes it to develop applications at a faster pace. • Thor and Roxie components of HPCC Platform enables faster data ingestion and data query across multiple nodes - Makes it efficient in implementing machine learning algorithms • the Platform parallelizes the sequential algorithms across multiple nodes efficiently.
  • 17. RV College of Engineering Go, change the world 17 References • https://researchcollaborations.elsevier.com/en/organisations/httpswwwrvceeduin • MQTT protocol support for ROXIE ,https://github.com/hpcc-systems/mqtt-for-roxie • Automated Data Skew Profiler, https://github.com/notharsh/DataSkewProfiler • Extending current ML library with LexisNexis HPCC Systems https://github.com/lilyclemson/DBSCAN/tree/project • Image Processing Library in HPCC , https://github.com/TanmayH/HPCC-OPENCV • Fraud detection in value based cards,https://github.com/aksharprasad/HPCC • Evaluation of machine learning algorithms, https://github.com/suryanarayanan21/ML_Core • Interfacing Octave with ECL GitHub Link : https://github.com/Sathvik10/Octave- Plugin • Continuous integration of Roxie query / data deployments using Jenkins, https://github.com/JUJayashree/jenkin_JOB_xml
  • 18. RV College of Engineering Go, change the world 18 Acknowledge Prof. Jyothi, Asst. Prof. CSE Dept., RVCE Vasanth, Instructor, CSE Dept., RVCE Students of RVCE 1. Jayant Suresh 2. Harsh Mishra 3. Amogh Vardhan Kashi 4. Manjunath Jakkaraddi 5. Shubham Phal 6. Tanmay Hukkeri 7. Yathish H R 8. Akshar Prasad 9. Sathvik K R 10. A Suryanarayanan Currently working Students 1. Varsha R Jenni 2. Akhil Dua 3. Atreya Bain 4. Anurag Singh Bhadauria 5. Ambu Karthik 6. Rohit Sachin
  • 19. RV College of Engineering Go, change the world 19