Softwae and database in data communication network

Fundamental software – Chapter 24
NOSQL Databases and Big Data Storage Systems
prepared by : Ayoub S. Al Sahafi - Ref. No. : CL186
Supervisor: Assistance Professor Dr. Mahmoud Alkhasawneh
Al-Madinah International University
Faculty of computer and information Technology

 24.1 Introduction to NOSQL Systems
 24.1.1 Emergence of NOSQL Systems
 24.1.2 Characteristics of NOSQL Systems
 24.1.3 Categories of NOSQL Systems
 24.2 The CAP Theorem
 24.3 Document-Based NOSQL Systems and MongoDB

 24.3.1 MongoDB Data Model
 24.3.2 MongoDB CRUD Operations
 24.3.3 MongoDB Distributed Systems Characteristics
 24.4 NOSQL Key-Value Stores
 24.4.1 DynamoDB Overview
 24.4.2 Voldemort Key-Value Distributed Data Store

 24.4.3 Examples of Other Key-Value Stores
 24.5 Column-Based or Wide Column NOSQL Systems
 24.5.1 Hbase Data Model and Versioning
 24.6.2 The Cypher Query Language of Neo4j
 24.6.3 Neo4j Interfaces and Distributed System Characteristics
 24.7 Summary

Introduction
 An introduction to NOSQL systems, their characteristics, and how they
differ from SQL systems.
 Four general categories of NoSQL systems.
 How NOSQL systems approach the issue of consistency among multiple
replicas (copies) by using the paradigm known as eventual consistency.
 The CAP theorem
 Present an overview of each category of NOSQL .
 In this chapter we will describe the following:

Introduction
 There are systems developed to manage large amounts of data in
organizations such as Google, Amazon, Facebook, and Twitter and in
applications such as social media, Web links, e-mail.
 NoSQL is an approach to database design that can contain a wide
different of data models, including key-value, document, columnar and
graph formats.

Introduction
 NoSQL, which stands for ”Not only SQL," is an alternative to traditional
relational databases in which data is placed in tables and data schema is
carefully designed before the database is built. NoSQL databases are
especially useful for working with large sets of distributed data.
 Most NOSQL systems are distributed databases or distributed storage
systems, with a focus on semi structured data storage, high performance,
availability, data replication, and scalability as opposed to an emphasis on
immediate data consistency, powerful query languages, and structured
data storage.

Terminology
No Term Definition
1 NOSQL NOSQL Databases and Big Data Storage Systems
2 CAP Consistency , Availability , Partition
3 CRUD Create, Read, Update, Delete
4 BigTable
is a compressed, high performance, proprietary data storage system built on Google File
System
5 DynamoDB is a hosted NoSQL database offered by Amazon Web Services (AWS)
6 Cassandra Wide-column store based on ideas of BigTable and DynamoDB
7 Availability is the condition wherein a given resource can be accessed by its consumers
8 Scalability
Database scalability is the ability of a database to handle changing demands by
adding/removing resources.
9 Graph-based
a graph database (GDB) is a database that uses graph structures for semantic queries with
nodes, edges, and properties to represent and store data.
Hbase
HBase is a column-oriented non-relational database management system that runs on top of
Hadoop Distributed File System (HDFS).

Terminology
No Term Definition
1 Object databases
is a database management system in which information is represented in the form of objects as
used in object-oriented programming.
2 XML databases
is a data persistence software system that allows data to be specified, and sometimes stored, in
XML format
3 Document-based
A document-oriented database, or document store, is a computer program designed for
storing, retrieving and managing document-oriented information, also known as semi-
data.
4 document-oriented
A document-oriented database, or document store, is a computer program designed for
storing, retrieving and managing document-oriented information, also known as semi-
data.
5 object-relational
An object-relational database (ORD), or object-relational database management system
(ORDBMS), is a database management system (DBMS) similar to a relational database, but with
an object-oriented database model
Hybrid NOSQL
systems
Combining Relational and NoSQL

Terminology
No Term Definition
JSON
range partitioning
Range partitioning is a type of relational database partitioning wherein the
is based on a predefined range for a specific data field such as uniquely numbered
IDs, dates or simple values like currency.
hash partitioning
Hash partitioning is a partitioning technique where a hash key is used to distribute
rows evenly across the different partitions (sub-tables
primary key
A primary key is a field in a table which uniquely identifies each row/record in a
database table

24.1.1 Emergence of NOSQL Systems
 Many companies and organizations are faced with applications that store
vast amounts of data.
 There are millions of web applications users who submit posts, many with
images and videos.
 User profiles, user relationships, and posts must all be stored in a huge
collection of data stores.
 Some data for this type of application is not suitable for a traditional
relational system and it needs multiple types of databases and data storage
systems.

24.1.1 Emergence of NOSQL Systems
 For this reason, some organizations decided to develop their own
systems:
 Google developed a NoSQL system known as BigTable, it is an open
source NoSQL.
 Amazon developed a NOSQL system called DynamoDB.
 Facebook developed a NOSQL system called Cassandra.

24.1.2 Characteristics of NOSQL Systems
 We divide the characteristics into two categories:
b) and those related to data models and query languages:
a) those related to distributed databases and distributed systems:
1- availability.
2- Scalability.
3- Replication Models.
4- Sharding of Files.
5- High-Performance Data Access.
As we show in the following diagram:

24.1.2 Characteristics of NOSQL Systems
NOSQL’s characteristics categories:
Scalability
Replication
Models
availability
Sharding of Files
High-
Performance Data
Access
distributed
systems
categories
systems
distributed systems
horizontal
vertical
master-slave
master-master
NOSQL systems
hashing
object keys
achieve techniques
related to distributed
databases and distributed
systems
NOSQL characteristics
Categories
related to data models and
query languages

24.1.3 Categories of NOSQL Systems
 NOSQL systems have been characterized into four major categories:
2. NOSQL key-value stores.
1. Document-based NOSQL systems.
3. Column based or wide column NOSQL systems.
4. Graph-based NOSQL systems.
Additional categories can be added as follows to include some systems that are not easily
categorized into the these four categories:
5. Graph-based NOSQL systems.
6. Hybrid NOSQL systems.
7. Object databases.
8. XML databases.

24.2 The CAP Theorem
 The CAP theorem can be used to explain some competing requirements
in a distributed system with replication.
The three letters in CAP refer to three desirable properties of
distributed systems with replicated data (following Diagram):
C A P
C Consistency (among replicated copies)
A availability (of the system for read and write operations)
P partition
tolerance (in the face of the nodes in the system
being partitioned by a network fault).

24.2 The CAP Theorem
The three letters in CAP refer to three desirable properties of
distributed systems with replicated data (following Diagram):
All client can find a replica of data, even in case
of partial node failures
All Client see the same view of data, even right
after update or delete
The system continues to work, even in presence
of partial network failure

24.3 Document-Based NoSQL Systems and MongoDB
 Document-based or document-oriented NoSQL systems store data as
collections of similar documents. These types of systems are also
sometimes known as document stores.
 A major difference between document-based systems versus object
and object-relational systems and XML is that there is no requirement
to specify a schema—rather, the documents are specified as self-
describing data.

24.3.1 MongoDB Data Model
 MongoDB documents are stored in BSON (Binary JSON) format, which
is a variation of JSON with some additional data types and is more
efficient for storage than JSON.
 Individual documents are stored in a collection.
As a simple example: COMPANY database.
The following command can be used to create a collection called project to hold
PROJECT objects from the COMPANY database:
db.create Collection(“project”, { capped : true, size : 1310720, max : 500 } )
* The collection is capped; this means it has upper limits on its storage space (size) and
number of documents (max).

24.3.2 MongoDB CRUD Operations
 MongoDb has several CRUD operations, where CRUD stands for
(create, read, update, delete).
 Documents can be created and inserted into their collections using
the insert operation, whose format is:
db.<collection_name>.insert(<document(s)>)
 The delete operation is called remove, and the format is:
db.<collection_name>.remove(<condition>)
 For read queries, the main command is called find, and the format is:
db.<collection_name>.find(<condition>)

24.3.3 MongoDB Distributed Systems Characteristics
 The concept of replica set is used in Mongo DB to create multiple
copies of the same data set on different nodes in the distributed
system, and it uses a variation of the master-slave approach for
replication.
 For example, suppose that we want to replicate a particular
document collection C. A replica set will have one primary copy of
the collection C stored in one node N1, and at least one secondary
copy (replica) of C stored at another node N2. Additional copies can
be stored in nodes N3, N4, etc.

24.3.3 MongoDB Distributed Systems Characteristics
 There are two ways to partition a collection into shards in MongoDB:
1.range partitioning
2.and hash partitioning.
 The partitioning field—known as the shard key in MongoDB—must
have two characteristics:
1.it must exist in every document in the collection,
2.and it must have an index.

24.4 NoSQL Key-Value Stores
 Key-value stores focus on high performance, availability, and scalability
by storing data in a distributed storage system.
 The main characteristic of key-value stores is the fact that every value
(data item) must be associated with a unique key, and that retrieving
the value by supplying the key must be very fast.

24.4.1 DynamoDB Overview
 The DynamoDB system is an Amazon product and is available as part
of Amazon’s AWS/SDK platforms (Amazon Web Services/Software
Development Kit).
 The basic data model in DynamoDB uses the concepts of tables, items,
and attributes.
• When a table is created, it is required to specify a table name and
a primary key.
• The primary key can be one of the following two types:
1.A single attribute. 2. A pair of attributes.

24.4.2 Voldemort Key-Value Distributed Data Store
 Voldemort is an open source system available through Apache 2.0
open source licensing rules. It is based on Amazon’s DynamoDB.
 Some of the features of Voldemort are as follows:
1. Simple basic operations.
2. High-level formatted data values.
Example :
o s.put(k, v) inserts an item as a key-value pair with key k and value v.
o s.delete(k) deletes the item whose key is k from the store.
o v = s.get(k) retrieves the value v associated with key k.
The values v in the (k, v) items can be specified in JSON (JavaScript Object Notation), and
the system will convert between JSON and the internal storage format.

24.4.3 Examples of Other Key-Value Stores
 Oracle key-value store. Oracle has one of the well-known SQL relational database systems, and
Oracle also offers a system based on the key-value store concept; this system is called the Oracle
NoSQL Database.
 Oracle key-value store. Oracle has one of the well-known SQL relational database systems, and
Oracle also offers a system based on the key-value store concept; this system is called the Oracle
NoSQL Database.
24.5 Column-Based or Wide Column NOSQL Systems
 Another category of NOSQL systems is known as column-based or wide column systems.
 The Google distributed storage system for big data, known as BigTable, is a well-known example of this
class of NOSQL systems, and it is used in many Google applications that require large amounts of data
storage, such as Gmail.
 Big- Table uses the Google File System (GFS) for data storage and distribution. An open source system
known as Apache Hbase is somewhat similar to Google Big- Table, but it typically uses HDFS (Hadoop
Distributed File System) for data storage.

24.5.1 Hbase Data Model and Versioning
 The data model in Hbase organizes data using the concepts of namespaces, tables, column
families, column qualifiers, columns, rows, and data cells.
 A column is identified by a combination of (column family:column qualifier).
 Data is stored in a self-describing form by associating columns with data values, where data
values are strings.
examples :
Creating a table called EMPLOYEE with three column families: Name, Address, and Details:
create ‘EMPLOYEE’, ‘Name’, ‘Address’, ‘Details’
Some Hbase basic CRUD operations:
Creating a table: create <tablename>, <column family>, <column family>, …
Inserting Data: put <tablename>, <rowid>, <column family>:<column qualifier>, <value>
Reading Data (all data in a table): scan <tablename>
Retrieve Data (one item): get <tablename>,<rowid>

24.5.1 Hbase Data Model and Versioning
examples :
inserting some row data in the EMPLOYEE table:
put ‘EMPLOYEE’, ‘row1’, ‘Name:Fname’, ‘Ahmad’
put ‘EMPLOYEE’, ‘row1’, ‘Name:Lname’, ‘Ali’
put ‘EMPLOYEE’, ‘row1’, ‘Name:Nickname’, ‘Khalidi’
put ‘EMPLOYEE’, ‘row1’, ‘Details:Job’, ‘Engineer’
put ‘EMPLOYEE’, ‘row1’, ‘Details:Review’, ‘Good’
put ‘EMPLOYEE’, ‘row2’, ‘Name:Fname’, ‘Sara’
put ‘EMPLOYEE’, ‘row2’, ‘Name:Lname’, ‘Rami’
put ‘EMPLOYEE’, ‘row2’, ‘Name:MName’, ‘S’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Job’, ‘IT’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Supervisor’, ‘Sead Noor’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Fname’, ‘Hasan’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Minit’, ‘E’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Lname’, ‘Mohammad’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Suffix’, ‘Mr.’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Job’, ‘CEO’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Salary’, ‘1,000,000’

24.5.2 Hbase CRUD Operations
 Hbase has low-level CRUD (create, read, update, delete) operations, as in many of the NoSQL
systems.
24.5.3 Hbase Storage and Distributed System Concepts
 Each Hbase table is divided into a number of regions, where each region will hold a range of the
row keys in the table; this is why the row keys must be lexicographically ordered.
 Hbase uses the Apache Zookeeper open source system for services related to managing the
naming, distribution, and synchronization of the Hbase data on the distributed Hbase server
nodes, as well as for coordination and replication services.
 Hbase also uses Apache HDFS (Hadoop Distributed File System) for distributed file services.

24.6 NOSQL Graph Databases and Neo4j
 Another category of NOSQL systems is known as graph databases or graph- oriented
NOSQL systems.
 The data is represented as a graph, which is a collection of vertices (nodes) and edges.
 Both nodes and edges can be labeled to indicate the types of entities and relationships they
represent, and it is generally possible to store data associated with both individual nodes and
individual edges.

24.6.1 Neo4j Data Model
 The data model in Neo4j organizes data using the concepts of nodes and relationships.
 Both nodes and relationships can have properties, which store the data items associated with
nodes and relationships.
 Nodes can have labels; the nodes that have the same label are grouped into a collection that
identifies a subset of the nodes in the database graph for querying purposes.
 each relationship has a start node and end node as well as a relationship type, which serves a
similar role to a node label by identifying similar relationships that have the same relationship
type.
 In conventional graph theory, nodes and relationships are generally called vertices and edges.
The Neo4j graph data model somewhat resembles how data is represented in the ER and EER
models
 Properties can be specified via a map pattern, which is made of one or more “name : value” pairs
enclosed in curly brackets; for example {Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}.

24.6.1 Neo4j Data Model
Labels and properties:
 When a node is created, the node label can be specified. It is also possible to create nodes
without any labels.
Indexing and node identifiers:
 When a node is created, the Neo4j system creates an internal unique system-defined identifier for
each node.
 For example, Empid can be used to index nodes with the EMPLOYEE label, Dno to index the
nodes with the DEPARTMENT label, and Pno to index the nodes with the PROJECT label.

24.6.2 The Cypher Query Language of Neo4j
 Neo4j has a high-level query language, Cypher.
 There are declarative commands for creating nodes and relationships, as well as for finding nodes
and relationships based on specifying patterns.
 Deletion and modification of data is also possible in Cypher.
Examples in Neo4j using the Cypher language
creating some nodes for the COMPANY data
CREATE (e1: EMPLOYEE, {Empid: ‘1’, Lname: ‘Smith’, Fname: ‘John’, Minit: ‘B’})
CREATE (e2: EMPLOYEE, {Empid: ‘2’, Lname: ‘Wong’, Fname: ‘Franklin’})
CREATE (e3: EMPLOYEE, {Empid: ‘3’, Lname: ‘Zelaya’, Fname: ‘Alicia’})
CREATE (e4: EMPLOYEE, {Empid: ‘4’, Lname: ‘Wallace’, Fname: ‘Jennifer’, Minit: ‘S’}) ...
CREATE (d1: DEPARTMENT, {Dno: ‘5’, Dname: ‘Research’})
CREATE (d2: DEPARTMENT, {Dno: ‘4’, Dname: ‘Administration’}) ...
CREATE (p1: PROJECT, {Pno: ‘1’, Pname: ‘ProductX’})
CREATE (p2: PROJECT, {Pno: ‘2’, Pname: ‘ProductY’})
CREATE (p3: PROJECT, {Pno: ‘10’, Pname: ‘Computerization’})
CREATE (p4: PROJECT, {Pno: ‘20’, Pname: ‘Reorganization’})
...
CREATE (loc1: LOCATION, {Lname: ‘Houston’})
CREATE (loc2: LOCATION, {Lname: ‘Stafford’})
CREATE (loc3: LOCATION, {Lname: ‘Bellaire’})
CREATE (loc4: LOCATION, {Lname: ‘Sugarland’})
...

24.6.2 The Cypher Query Language of Neo4j
Examples in Neo4j using the Cypher language
creating some relationships for the COMPANY data :
CREATE (e1) – [ : WorksFor ] –> (d1) CREATE (e3) – [ : WorksFor ] –> (d2) ...
CREATE (d1) – [ : Manager ] –> (e2) CREATE (d2) – [ : Manager ] –> (e4) ...
CREATE (d1) – [ : LocatedIn ] –> (loc1) CREATE (d1) – [ : LocatedIn ] –> (loc3) CREATE (d1) – [ : LocatedIn ] –> (loc4)
CREATE (d2) – [ : LocatedIn ] –> (loc2) ...
CREATE (e1) – [ : WorksOn, {Hours: ‘32.5’} ] –> (p1) CREATE (e1) – [ : WorksOn, {Hours: ‘7.5’} ] –> (p2) CREATE (e2) – [ :
WorksOn, {Hours: ‘10.0’} ] –> (p1) CREATE (e2) – [ : WorksOn, {Hours: 10.0} ] –> (p2) CREATE (e2) – [ : WorksOn, {Hours:
‘10.0’} ] –> (p3) CREATE (e2) – [ : WorksOn, {Hours: 10.0} ] –> (p4) ...
Basic simplified syntax of some common Cypher clauses:
Finding nodes and relationships that match a pattern: MATCH <pattern>
Specifying aggregates and other query variables: WITH <specifications>
Specifying conditions on the data to be retrieved: WHERE <condition>
Specifying the data to be returned: RETURN <data>
Ordering the data to be returned: ORDER BY <data>
Limiting the number of returned data items: LIMIT <max number>
Creating nodes: CREATE <node, optional labels and properties>
Creating relationships: CREATE <relationship, relationship type and optional properties> Deletion: DELETE <nodes or
relationships>
Specifying property values and labels: SET <property values and labels>
Removing property values and labels: REMOVE <property values and labels>

24.6.3 Neo4j Interfaces and Distributed System Characteristics
 Neo4j has other interfaces that can be used to create, retrieve, and update nodes and
relationships in a graph database.
 It also has two main versions:
1. Enterprise edition.
2. community edition.
 Both editions support the Neo4j graph data model and storage system, and Cypher graph query
language, including a high-performance native API, language drivers for several popular
programming languages, such as Java, Python, PHP.
 In addition, both editions support ACID properties.

24.7 Summary
 In this chapter, we discussed the class of database systems known as NOSQL systems, which focus on efficient
storage and retrieval of large amounts of “big data.” Applications that use these types of systems include social
media, Web links, user profiles, marketing and sales, posts and tweets, road maps and spatial data, and e-mail.
 The term NOSQL is generally interpreted as Not Only SQL—rather than NO to SQL—and is meant to convey
that many applications need systems other than traditional relational SQL systems to augment their data
management needs.
 These systems are distributed databases or distributed storage systems, with a focus on semistructured data
storage, high performance, availability, data replication, and scalability rather than an emphasis on immediate
data consistency, powerful query languages, and structured data storage.
 we started with an introduction to NOSQL systems, their characteristics, and how they differ from SQL
systems. Four general categories of NOSQL systems are document-based, key-value stores, column-
based, and graph-based.
 discussed how NOSQL systems approach the issue of consistency among multiple replicas (copies) by using the
paradigm known as eventual consistency. We discussed the CAP theorem, which can be used to understand the
emphasis of NOSQL systems on availability.

24.7 Summary
the four main categories of NOSQL systems
1.document-based systems
2.key-value stores
3.column-based systems
4.graph-based systems
 We also noted that some NOSQL systems may not fall
neatly into a single category but rather use techniques
that span two or more categories.

References
1. FUNDAMENTALS OF Database Systems -SEVENTH EDITION , Ramez Elmasri and Shamkant B. Navathe
2. Peter W. Resnick. "Internet Message Format. tools.ietf.org. Retrieved 2018-10-02.
3. "JSON Objects". www.w3schools.com. Retrieved 2018-10-02.

Softwae and database in data communication network

More Related Content

What's hot (20)

Similar to Softwae and database in data communication network (20)

Recently uploaded (20)

Softwae and database in data communication network