SlideShare a Scribd company logo
Drew Conway’s
Venn Diagram.
Multivariate algorithms in distributed data processing/
computing
Most popular distributed system concepts
• Distributed data processing algorithms (Spark, MapReduce) : They are
algorithms designed to solve computational problems by dividing the
work across multiple nodes in a network Ex:— used for processing large
data sets in parallel across a cluster of computers.
• Distributed consensus algorithms (Paxos) : used for achieving
consensus in a distributed system. Raft — a consensus algorithm for
managing a replicated log.
• Distributed lock management (ZooKeeper, Etcd) — used for
coordinating access to shared resources in a distributed system.
• Leader election ( Bully Algorithm, Ring Algorithm) In Bully Algorithm, a process that wishes to
become the coordinator will broadcast a message to all other processes in the network,
challenging them to prove that they are the coordinator. If a process receives a challenge
message, it will compare its own priority to that of the challenger. If the challenger has a higher
priority, the challenged process will step down and acknowledge the challenger as the new
coordinator.
• The Ring Algorithm commences when any process within the ring detects the failure of the
current coordinator. Upon detection, the initiating process prepares an election message
comprising its own process number and transmits it to its immediate successor in the ring. If the
successor process is also deemed non-functional, the initiating process bypasses it and forwards
the message to the subsequent process in the ring. This cycle continues until the message
circulates back to the initiating process, accumulating the process numbers of all functional
processes encountered along the way. Subsequently, the process with the highest ID among
those listed in the message is elected as the new coordinator.
• Following the election, the initiating process disseminates another message throughout the ring,
informing all processes of the newly elected coordinator. This process ensures the decentralized
and resilient election of a coordinator within the distributed system organized in a ring topology.
Distributed Storage Concepts
• Distributed Hash Table (DHT) — used for storing key-value pairs in a distributed system. Ex :
Chord. It is a distributed hash table (DHT) protocol that is used to map keys to nodes in a
decentralized network.
• Distributed File Systems (HDFS, GlusterFS ) — used for storing and managing large amounts
of data across a cluster of computers.
• Distributed databases (Cassandra, MongoDB) — used for storing and retrieving data across
a distributed system.
• Distributed caching (Memcached, Redis) — used for improving the performance of
applications by caching frequently accessed data in memory across a cluster of machines.
Others:
• Distributed tracing (Zipkin, Jaeger) — used for monitoring and debugging
distributed systems by tracking requests as they flow through the system. Jaeger
traces requests as they propagate through a distributed system and collects data
about the request latency, the components involved in processing the request,
and any errors that may have occurred. The collected data is then presented in a
user-friendly graphical interface that provides a clear and comprehensive view of
the request flow.
• Distributed task scheduling (Apache Mesos, Kubernetes) — used for scheduling
and managing tasks in a distributed system.
• Gossip Protocol — a protocol for efficiently spreading information in a large
network.
Spark Architecture
• The Spark follows the master-slave architecture. Its cluster consists of a
single master and multiple slaves.
• The Spark architecture depends upon two abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDD)
• The Resilient Distributed Datasets are the group of data items that can
be stored in-memory on worker nodes. Here,
• Resilient: Restore the data on failure.
• Distributed: Data is distributed among different nodes.
• Dataset: Group of data.
Directed Acyclic Graph (DAG)
• Directed Acyclic Graph is a finite direct graph that performs a sequence
of computations on data. Each node is an RDD partition, and the edge is
a transformation on top of data. Here, the graph refers the navigation
whereas directed and acyclic refers to how it is done.
• Let's understand the Spark architecture.
Multivariate algorithms in distributed data processing computing.pptx
Driver Program
• The Driver Program is a process that runs the main() function of the
application and creates the SparkContext object. The purpose
of SparkContext is to coordinate the spark applications, running as
independent sets of processes on a cluster.
• To run on a cluster, the SparkContext connects to a different type of
cluster managers and then perform the following tasks: -
• It acquires executors on nodes in the cluster.
• Then, it sends your application code to the executors. Here, the
application code can be defined by JAR or Python files passed to the
SparkContext.
• At last, the SparkContext sends tasks to the executors to run.
Cluster Manager
• The role of the cluster manager is to allocate resources across
applications. The Spark is capable enough of running on a large number
of clusters.
• It consists of various types of cluster managers such as Hadoop YARN,
Apache Mesos and Standalone Scheduler.
• Here, the Standalone Scheduler is a standalone spark cluster manager
that facilitates to install Spark on an empty set of machines.
• Worker Node
▫ The worker node is a slave node
▫ Its role is to run the application code in the cluster.
• Executor
▫ An executor is a process launched for an application on a worker node.
▫ It runs tasks and keeps data in memory or disk storage across them.
▫ It read and write data to the external sources.
▫ Every application contains its executor.
• Task
▫ A unit of work that will be sent to one executor.
Spark Components
Multivariate algorithms in distributed data processing computing.pptx
1. Spark Core
• The Spark Core is the heart of Spark and performs the core functionality.
• It holds the components for task scheduling, fault recovery, interacting
with storage systems and memory management.
2. Spark SQL
• The Spark SQL is built on the top of Spark Core. It provides support for
structured data.
• It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL?called the HQL (Hive Query Language).
• It supports JDBC and ODBC connections that establish a relation between Java
objects and existing databases, data warehouses and business intelligence
tools.
• It also supports various sources of data like Hive tables, Parquet, and JSON.
3. Spark Streaming
• Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
• It uses Spark Core's fast scheduling capability to perform streaming
analytics.
• It accepts data in mini-batches and performs RDD transformations on that
data.
• Its design ensures that the applications written for streaming data can be
reused to analyze batches of historical data with little modification.
• The log files generated by web servers can be considered as a real-time
example of a data stream.
4. MLlib
• The MLlib is a Machine Learning library that contains various machine
learning algorithms.
• These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
• It is nine times faster than the disk-based implementation used by
Apache Mahout.
5. GraphX
• The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
• It facilitates to create a directed graph with arbitrary properties
attached to each vertex and edge.
• To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.

More Related Content

Similar to Multivariate algorithms in distributed data processing computing.pptx (20)

PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PDF
Spark
newmooxx
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Introduction to Spark Training
Spark Summit
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Apache Spark
SugumarSarDurai
 
PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PPTX
Software architecture for data applications
Ding Li
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Spark
newmooxx
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Intro to Spark development
Spark Summit
 
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Apache Spark Core
Girish Khanzode
 
Introduction to Spark Training
Spark Summit
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark
SugumarSarDurai
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
Introduction to Spark - DataFactZ
DataFactZ
 
Apache Spark - A High Level overview
Karan Alang
 
Software architecture for data applications
Ding Li
 

Recently uploaded (20)

PPTX
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
PPTX
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
PDF
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
PDF
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
PPTX
SYMPATHOMIMETICS[ADRENERGIC AGONISTS] pptx
saip95568
 
PDF
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
PDF
Romanticism in Love and Sacrifice An Analysis of Oscar Wilde’s The Nightingal...
KaryanaTantri21
 
PDF
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
nabilahk908
 
PPTX
F-BLOCK ELEMENTS POWER POINT PRESENTATIONS
mprpgcwa2024
 
PPTX
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
PDF
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
PDF
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
PPTX
Elo the Hero is an story about a young boy who became hero.
TeacherEmily1
 
PPTX
JSON, XML and Data Science introduction.pptx
Ramakrishna Reddy Bijjam
 
DOCX
MUSIC AND ARTS 5 DLL MATATAG LESSON EXEMPLAR QUARTER 1_Q1_W1.docx
DianaValiente5
 
PDF
Our Guide to the July 2025 USPS® Rate Change
Postal Advocate Inc.
 
PPTX
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
PPTX
How to Add New Item in CogMenu in Odoo 18
Celine George
 
PPTX
How Physics Enhances Our Quality of Life.pptx
AngeliqueTolentinoDe
 
PPTX
2025 Completing the Pre-SET Plan Form.pptx
mansk2
 
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
SYMPATHOMIMETICS[ADRENERGIC AGONISTS] pptx
saip95568
 
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
Romanticism in Love and Sacrifice An Analysis of Oscar Wilde’s The Nightingal...
KaryanaTantri21
 
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
nabilahk908
 
F-BLOCK ELEMENTS POWER POINT PRESENTATIONS
mprpgcwa2024
 
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
Elo the Hero is an story about a young boy who became hero.
TeacherEmily1
 
JSON, XML and Data Science introduction.pptx
Ramakrishna Reddy Bijjam
 
MUSIC AND ARTS 5 DLL MATATAG LESSON EXEMPLAR QUARTER 1_Q1_W1.docx
DianaValiente5
 
Our Guide to the July 2025 USPS® Rate Change
Postal Advocate Inc.
 
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
How to Add New Item in CogMenu in Odoo 18
Celine George
 
How Physics Enhances Our Quality of Life.pptx
AngeliqueTolentinoDe
 
2025 Completing the Pre-SET Plan Form.pptx
mansk2
 
Ad

Multivariate algorithms in distributed data processing computing.pptx

  • 2. Multivariate algorithms in distributed data processing/ computing
  • 3. Most popular distributed system concepts • Distributed data processing algorithms (Spark, MapReduce) : They are algorithms designed to solve computational problems by dividing the work across multiple nodes in a network Ex:— used for processing large data sets in parallel across a cluster of computers. • Distributed consensus algorithms (Paxos) : used for achieving consensus in a distributed system. Raft — a consensus algorithm for managing a replicated log. • Distributed lock management (ZooKeeper, Etcd) — used for coordinating access to shared resources in a distributed system.
  • 4. • Leader election ( Bully Algorithm, Ring Algorithm) In Bully Algorithm, a process that wishes to become the coordinator will broadcast a message to all other processes in the network, challenging them to prove that they are the coordinator. If a process receives a challenge message, it will compare its own priority to that of the challenger. If the challenger has a higher priority, the challenged process will step down and acknowledge the challenger as the new coordinator. • The Ring Algorithm commences when any process within the ring detects the failure of the current coordinator. Upon detection, the initiating process prepares an election message comprising its own process number and transmits it to its immediate successor in the ring. If the successor process is also deemed non-functional, the initiating process bypasses it and forwards the message to the subsequent process in the ring. This cycle continues until the message circulates back to the initiating process, accumulating the process numbers of all functional processes encountered along the way. Subsequently, the process with the highest ID among those listed in the message is elected as the new coordinator. • Following the election, the initiating process disseminates another message throughout the ring, informing all processes of the newly elected coordinator. This process ensures the decentralized and resilient election of a coordinator within the distributed system organized in a ring topology.
  • 5. Distributed Storage Concepts • Distributed Hash Table (DHT) — used for storing key-value pairs in a distributed system. Ex : Chord. It is a distributed hash table (DHT) protocol that is used to map keys to nodes in a decentralized network. • Distributed File Systems (HDFS, GlusterFS ) — used for storing and managing large amounts of data across a cluster of computers. • Distributed databases (Cassandra, MongoDB) — used for storing and retrieving data across a distributed system. • Distributed caching (Memcached, Redis) — used for improving the performance of applications by caching frequently accessed data in memory across a cluster of machines.
  • 6. Others: • Distributed tracing (Zipkin, Jaeger) — used for monitoring and debugging distributed systems by tracking requests as they flow through the system. Jaeger traces requests as they propagate through a distributed system and collects data about the request latency, the components involved in processing the request, and any errors that may have occurred. The collected data is then presented in a user-friendly graphical interface that provides a clear and comprehensive view of the request flow. • Distributed task scheduling (Apache Mesos, Kubernetes) — used for scheduling and managing tasks in a distributed system. • Gossip Protocol — a protocol for efficiently spreading information in a large network.
  • 7. Spark Architecture • The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple slaves. • The Spark architecture depends upon two abstractions: • Resilient Distributed Dataset (RDD) • Directed Acyclic Graph (DAG)
  • 8. Resilient Distributed Datasets (RDD) • The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker nodes. Here, • Resilient: Restore the data on failure. • Distributed: Data is distributed among different nodes. • Dataset: Group of data.
  • 9. Directed Acyclic Graph (DAG) • Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph refers the navigation whereas directed and acyclic refers to how it is done. • Let's understand the Spark architecture.
  • 11. Driver Program • The Driver Program is a process that runs the main() function of the application and creates the SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running as independent sets of processes on a cluster. • To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks: - • It acquires executors on nodes in the cluster. • Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python files passed to the SparkContext. • At last, the SparkContext sends tasks to the executors to run.
  • 12. Cluster Manager • The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of clusters. • It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler. • Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines.
  • 13. • Worker Node ▫ The worker node is a slave node ▫ Its role is to run the application code in the cluster. • Executor ▫ An executor is a process launched for an application on a worker node. ▫ It runs tasks and keeps data in memory or disk storage across them. ▫ It read and write data to the external sources. ▫ Every application contains its executor. • Task ▫ A unit of work that will be sent to one executor.
  • 16. 1. Spark Core • The Spark Core is the heart of Spark and performs the core functionality. • It holds the components for task scheduling, fault recovery, interacting with storage systems and memory management.
  • 17. 2. Spark SQL • The Spark SQL is built on the top of Spark Core. It provides support for structured data. • It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive variant of SQL?called the HQL (Hive Query Language). • It supports JDBC and ODBC connections that establish a relation between Java objects and existing databases, data warehouses and business intelligence tools. • It also supports various sources of data like Hive tables, Parquet, and JSON.
  • 18. 3. Spark Streaming • Spark Streaming is a Spark component that supports scalable and fault- tolerant processing of streaming data. • It uses Spark Core's fast scheduling capability to perform streaming analytics. • It accepts data in mini-batches and performs RDD transformations on that data. • Its design ensures that the applications written for streaming data can be reused to analyze batches of historical data with little modification. • The log files generated by web servers can be considered as a real-time example of a data stream.
  • 19. 4. MLlib • The MLlib is a Machine Learning library that contains various machine learning algorithms. • These include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis. • It is nine times faster than the disk-based implementation used by Apache Mahout.
  • 20. 5. GraphX • The GraphX is a library that is used to manipulate graphs and perform graph-parallel computations. • It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge. • To manipulate graph, it supports various fundamental operators like subgraph, join Vertices, and aggregate Messages.

Editor's Notes

  • #1: https://towardsdatascience.com/the-essential-data-science-venn-diagram-35800c3bef40
  • #2: https://medium.com/@vinciabhinav7/commonly-used-distributed-algorithms-8215156f0f18
  • #4: https://www.geeksforgeeks.org/difference-between-ring-and-bully-algorithm/