3. Most popular distributed system concepts
• Distributed data processing algorithms (Spark, MapReduce) : They are
algorithms designed to solve computational problems by dividing the
work across multiple nodes in a network Ex:— used for processing large
data sets in parallel across a cluster of computers.
• Distributed consensus algorithms (Paxos) : used for achieving
consensus in a distributed system. Raft — a consensus algorithm for
managing a replicated log.
• Distributed lock management (ZooKeeper, Etcd) — used for
coordinating access to shared resources in a distributed system.
4. • Leader election ( Bully Algorithm, Ring Algorithm) In Bully Algorithm, a process that wishes to
become the coordinator will broadcast a message to all other processes in the network,
challenging them to prove that they are the coordinator. If a process receives a challenge
message, it will compare its own priority to that of the challenger. If the challenger has a higher
priority, the challenged process will step down and acknowledge the challenger as the new
coordinator.
• The Ring Algorithm commences when any process within the ring detects the failure of the
current coordinator. Upon detection, the initiating process prepares an election message
comprising its own process number and transmits it to its immediate successor in the ring. If the
successor process is also deemed non-functional, the initiating process bypasses it and forwards
the message to the subsequent process in the ring. This cycle continues until the message
circulates back to the initiating process, accumulating the process numbers of all functional
processes encountered along the way. Subsequently, the process with the highest ID among
those listed in the message is elected as the new coordinator.
• Following the election, the initiating process disseminates another message throughout the ring,
informing all processes of the newly elected coordinator. This process ensures the decentralized
and resilient election of a coordinator within the distributed system organized in a ring topology.
5. Distributed Storage Concepts
• Distributed Hash Table (DHT) — used for storing key-value pairs in a distributed system. Ex :
Chord. It is a distributed hash table (DHT) protocol that is used to map keys to nodes in a
decentralized network.
• Distributed File Systems (HDFS, GlusterFS ) — used for storing and managing large amounts
of data across a cluster of computers.
• Distributed databases (Cassandra, MongoDB) — used for storing and retrieving data across
a distributed system.
• Distributed caching (Memcached, Redis) — used for improving the performance of
applications by caching frequently accessed data in memory across a cluster of machines.
6. Others:
• Distributed tracing (Zipkin, Jaeger) — used for monitoring and debugging
distributed systems by tracking requests as they flow through the system. Jaeger
traces requests as they propagate through a distributed system and collects data
about the request latency, the components involved in processing the request,
and any errors that may have occurred. The collected data is then presented in a
user-friendly graphical interface that provides a clear and comprehensive view of
the request flow.
• Distributed task scheduling (Apache Mesos, Kubernetes) — used for scheduling
and managing tasks in a distributed system.
• Gossip Protocol — a protocol for efficiently spreading information in a large
network.
7. Spark Architecture
• The Spark follows the master-slave architecture. Its cluster consists of a
single master and multiple slaves.
• The Spark architecture depends upon two abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
8. Resilient Distributed Datasets (RDD)
• The Resilient Distributed Datasets are the group of data items that can
be stored in-memory on worker nodes. Here,
• Resilient: Restore the data on failure.
• Distributed: Data is distributed among different nodes.
• Dataset: Group of data.
9. Directed Acyclic Graph (DAG)
• Directed Acyclic Graph is a finite direct graph that performs a sequence
of computations on data. Each node is an RDD partition, and the edge is
a transformation on top of data. Here, the graph refers the navigation
whereas directed and acyclic refers to how it is done.
• Let's understand the Spark architecture.
11. Driver Program
• The Driver Program is a process that runs the main() function of the
application and creates the SparkContext object. The purpose
of SparkContext is to coordinate the spark applications, running as
independent sets of processes on a cluster.
• To run on a cluster, the SparkContext connects to a different type of
cluster managers and then perform the following tasks: -
• It acquires executors on nodes in the cluster.
• Then, it sends your application code to the executors. Here, the
application code can be defined by JAR or Python files passed to the
SparkContext.
• At last, the SparkContext sends tasks to the executors to run.
12. Cluster Manager
• The role of the cluster manager is to allocate resources across
applications. The Spark is capable enough of running on a large number
of clusters.
• It consists of various types of cluster managers such as Hadoop YARN,
Apache Mesos and Standalone Scheduler.
• Here, the Standalone Scheduler is a standalone spark cluster manager
that facilitates to install Spark on an empty set of machines.
13. • Worker Node
▫ The worker node is a slave node
▫ Its role is to run the application code in the cluster.
• Executor
▫ An executor is a process launched for an application on a worker node.
▫ It runs tasks and keeps data in memory or disk storage across them.
▫ It read and write data to the external sources.
▫ Every application contains its executor.
• Task
▫ A unit of work that will be sent to one executor.
16. 1. Spark Core
• The Spark Core is the heart of Spark and performs the core functionality.
• It holds the components for task scheduling, fault recovery, interacting
with storage systems and memory management.
17. 2. Spark SQL
• The Spark SQL is built on the top of Spark Core. It provides support for
structured data.
• It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL?called the HQL (Hive Query Language).
• It supports JDBC and ODBC connections that establish a relation between Java
objects and existing databases, data warehouses and business intelligence
tools.
• It also supports various sources of data like Hive tables, Parquet, and JSON.
18. 3. Spark Streaming
• Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
• It uses Spark Core's fast scheduling capability to perform streaming
analytics.
• It accepts data in mini-batches and performs RDD transformations on that
data.
• Its design ensures that the applications written for streaming data can be
reused to analyze batches of historical data with little modification.
• The log files generated by web servers can be considered as a real-time
example of a data stream.
19. 4. MLlib
• The MLlib is a Machine Learning library that contains various machine
learning algorithms.
• These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
• It is nine times faster than the disk-based implementation used by
Apache Mahout.
20. 5. GraphX
• The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
• It facilitates to create a directed graph with arbitrary properties
attached to each vertex and edge.
• To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.