Understanding Distributed Databases Scalability

Performance
Also an overloaded term with different
meanings depending on the context
See definition next
Scalability
An overloaded term that has been
perverted by technical marketing
The ability of a database to
improve performance when adding
more resources
Scalability & Performance
1

3
S C A L A B I L I T Y & P E R F O R M A N C E
Throughput:
The number of operations per time unit (e.g.,
transactions per second, operations per second, queries per
second)
Response time:
The time from submitting an operation (e.g., transaction,
query, individual row operation) until receiving the
answer
Database Performance Metrics

4
deliver better performance
by adding more resources
Scalability

5
reduce response time by
adding more resources
Speedup

Adding more resources (CPU, memory and disk)
to a centralized database yields more throughput
Adding more nodes to a distributed database (in
a cluster) yields more throughput
Vertical vs Horizontal Scalability

Do all databases scale the
same?
Scalability Factor
7
Can we measure and compare
scalabilities?
Measures scalability:
scale up for vertical
scalability and scale out
for horizontal scalability

Scalability Factor
The scale out factor provides the throughput of
a cluster size normalized to the relative
throughput of a single node
It can also be defined as the ratio between the
throughputs of a database with one node and a
database with n cluster nodes

What is the optimal scalability?
Types of Scalability
9
What is the worst scalability?
Scalability can be logarithmic or
linear, but can be also null or even
negative

Types of Scalability
Some databases have negative scalability, as adding more nodes to the system yields
a throughput lower than with a single node
Many databases have sublinear scalability
Often, scalability is null for write workloads and logarithmic for read/write workloads
Linear scalability is the optimal case: with a cluster of n nodes, you get n times the
throughput of a single node
For instance, if a single node delivers 1,000 transactions per second, a cluster of 100
nodes delivers a throughput of 100,000 transactions per second

Logarithmic Scale Out
Results from wasting capacity due to redundant work and/or contention
Open source databases such as MariaDB rely on cluster replication (see our blog post on Cluster Replication)
Cluster replication yields logarithmic scalability: since the writes are executed by all nodes, only the read fraction of
the workload can provide scalability
Shared disk databases also have logarithmic scalability: the need for a concurrency control protocol that locks disk
pages to be written results in substantial contention that increases with the cluster size
T Y P E S O F S C A L A B I L I T Y

Linear Scalability
Key-value stores (see our blog post on NoSQL) typically provide linear scalability because they are very simple,
without addressing the hard problem of scaling transactional management (the so-called ACID properties)
Transactional databases that exhibit linear scalability are very few (but since this blog series is vendor agnostic, we
don't discuss them)
T Y P E S O F S C A L A B I L I T Y

Types of Speed Up
Speed up can also show different behaviors, from null to linear
Linear speed up means that the response time obtained with one node is divided by n with n nodes
Null speed up means, for instance, that a given query always exhibit the same response time with one or more
nodes
Null speed up happens in a database without a parallel/OLAP query engine (i.e., without intra-query parallelism):
with inter-query parallelism only, each node is able to process a subset of the queries, but each query can only be
executed by a single node

The two main metrics for measuring the performance of a database are throughput
and response time
Throughput measures the number of operations (transactions, queries, inserts) per
unit of time
Response time measures how long it takes to execute a particular operation
14
2
1
3
Main Takeaways

Scalability is the ability of the database to handle bigger loads with more resources
In a distributed database, we talk about horizontal scalability where more
resources mean more nodes
In a centralized database, we talk about vertical scalability where more resources
mean more CPU, memory, and disk
15
2
2
3
Main Takeaways

Speed up is related to scalability but a different concept
Refers to the ability of reducing response time by adding more resources
Again, can be horizontal for a distributed database or vertical for a centralized
database
16
3
2
3
Main Takeaways

Scalability and speed up can be of different kinds
Negative and null are of no interest
Logarithmic scalability can be better but only for a few nodes and high proportion
of reads
17
4
Linear scalability is optimal since each new node contributes the same in terms of
additional load that can be handled
2
3
Main Takeaways

References
[Özsu & Valduriez 2020] Tamer Özsu, Patrick
Valduriez.
Principles of Distributed Database Systems, 4th
Edition.
Springer, 2020.

Relevant Posts from the Blog
How To Measure Scalability and Performance
Cluster Replication
Shared Nothing
Architectures
NoSQL

About
About the authors:
Dr. Ricardo Jimenez-Peris is the CEO and
founder of LeanXcale. Before founding
LeanXcale, he was for over 25 years a
researcher in distributed database systems,
published over 100 scientific publications and
has been director of the Distributed Systems
Lab and university professor on distributed
systems.
Dr. Patrick Valduriez is a researcher at INRIA,
co-author of the book “Principles of Distributed
Database Systems” that has educated legions
of students and engineers in this field and more
recently, Scientific Advisor of LeanXcale.
About this blog series:
This blog series aims at educating database
practitioners in topics commonly not well
understood, often due to false or confusing
marketing messages. The blog provides the
foundations and tools to let the reader
actually evaluate database systems, learn
their real capabilities and be able to compare
the performance of the different alternatives
for its targeted workload. The blog is vendor
agnostic and does not mention specific
vendors, sometimes open source databases
are mentioned to illustrate concepts.
About LeanXcale:
LeanXcale is a startup making a NewSQL
database. Since the blog is vendor
agnostic, we do not talk about LeanXcale
itself. Readers interested on LeanXcale
can visit LeanXcale web site.

Understanding Distributed Databases Scalability

Recommended

More Related Content

What's hot (20)

Similar to Understanding Distributed Databases Scalability (20)

Recently uploaded (20)

Understanding Distributed Databases Scalability