SlideShare a Scribd company logo
MongoDB
https://www.mongodb.com/
Prutha Date (dprutha1@umbc.edu)
Siraj Memon (siraj1@umbc.edu)
Outline
• Introduction to MongoDB
• Storage Layout
• Data Management Features
• Performance Analysis
• Limitations
• Conclusion
• Demo
• References
What is MongoDB?
• MongoDB is a NoSQL Document-Oriented database.
• It provides semi-structured flexible schema.
• It provides high performance, high availability, and easy scalability.
• MongoDB is free and open source software.
• License: GNU Affero General Public License (AGPL) and Apache License
• MongoDB is a server process that runs on Linux, Windows and OS X. It can
be run both as a 32 or 64-bit application.
When to use MongoDB?
“Knowing when to use a hammer, and when to use a screwdriver.”
• Account and user profiles: can store arrays of addresses with ease (MetLife)
• Content Management Systems (CMS): the flexible schema of MongoDB is great for heterogeneous
collections of content types (MongoPress)
• Form data: MongoDB makes it easy to evolve the structure of form data over time (ADP)
• Blogs / user-generated content: can keep data with complex relationships together in one object (Forbes,
AOL)
• Messaging: vary message meta-data easily per message or message type without needing to maintain
separate collections or schemas (Viber)
• System configuration: just a nice object graph of configuration values, which is very natural in MongoDB
(Cisco)
• Log data of any kind: structured log data is the future (ebay)
• Location based systems: makes use of Geospatial indices (Foursquare, City government of Chicago)
Terminologies – RDBMS vs MongoDB
*JSON – JavaScript Object Notation
Storage Internals - Directory Layout
Data Directory is found at /data/db
Internal File Format
Extent Structure
Extents and Records
To Sum Up: Internal File Format
• Files on disk are broken into extents which contain the documents.
• A collection has one or more extents.
• Extent grow exponentially up to 2GB.
• Namespace entries in the ns (namespace) file point to the first extent
for that collection.
Virtual Address Space
Storage Engine - MMAP (Memory Mapped)
• All data files are memory mapped to Virtual Memory by the
OS.
• MongoDB just reads / writes to RAM in the filesystem cache
• OS takes care of the rest!
• Virtual process size = total files size + overhead (connections,
heap)
• Uses Memory-mapped file using mmap() system call.
Storage Engine - WiredTiger
• Designed especially for Write-Intensive applications
• Document level locking
• Compression and Record-level locking
• Multi-version concurrency control (MVCC)
• Multi-document transactions
• Support for Log Structured Merge (LSM) trees for very high
insert workloads
What makes MongoDB cool?
• Sharding
• Aggregation Framework and Map-Reduce
• Capped Collection
• GridFS
• Geo-Spatial Indexing
Sharding
• Horizontal scaling - divides the data set and distributes the data over
multiple servers, or shards.
• Used to support deployments with very large data sets and high
throughput operations.
• Sharded Cluster Components –
• Shards – mongod instance or replica sets
• Config Server – Multiple mongod instances
• Routing Instances – Multiple mongos instances
• Shards are divided into fixed size chunks using ranges of shard key
values.
Sharding Internals
Choosing a Shard key
The choice of shard key affects:
• Distribution of reads and writes
• Uneven distribution of reads/writes across shards.
• Solution – Hashed ids
• Size of chunks
• Jumbo chunks cause uneven distribution of data.
• Moving data between shards becomes difficult.
• Solution – Multi-tenant compound index
• The number of shards each query hits
Aggregation Framework
• Aggregation Pipeline
• Map-Reduce
• Single Purpose Aggregation Operations (deprecated in latest version)
Aggregation Pipeline
• The aggregation pipeline is a framework for performing aggregation
tasks, modeled on the concept of data processing pipelines.
• Using this framework, MongoDB passes the documents of a single
collection through a pipeline.
• The pipeline transforms the documents into aggregated results, and is
accessed through the aggregate database command.
• Operators: $match, $project, $unwind, $sort, $limit
• User gets to choose the operator.
Aggregation Pipeline - Example
Continued…
Map-Reduce
Capped Collection
• Fixed size collection called capped collection
• Use the db.createCollection command and marked it as capped
• e.g - db.createCollection(‘logs’, {capped: true, size: 2097152})
• When it reaches the size limit, old documents are automatically
removed
• Guarantees preservation of the insertion order
• Maintains insertion order identical to the order on disk by prohibiting
updates that increase document size
• Allows the use of tailable cursor to retrieve documents
GridFS
• GridFS is a specification for storing and retrieving files that exceed
the BSON (binary JSON) document size limit of 16MB.
• Instead of storing a file in a single document, GridFS divides a file into
parts, or chunks, and stores each of those chunks as a separate
document.
• By default GridFS limits chunk size to 255k.
• GridFS uses two collections to store files. One collection stores the file
chunks, and the other stores file metadata.
• GridFS is useful not only for storing files that exceed 16MB but also
for storing any files for which you want access without having to load
the entire file into memory.
GeoSpatial Indexing
• To support efficient queries of geospatial coordinate data, MongoDB
provides two special indexes:
• 2d indexes that uses planar geometry when returning results.
• 2sphere indexes that use spherical geometry to return results.
• Store location data as GeoJSON objects with this coordinate-axis
order: longitude, latitude.
• GeoJSON Object Supported: Point, LineString, Polygon, etc.
• Query Operations: Inclusion, Intersection, Proximity.
• You cannot use a geospatial index as the shard key index.
Performance Analysis
• Yahoo! Cloud Serving Benchmark (YCSB)
• Throughput (ops/second)
WORKLOADS Cassandra Couchbase MongoDB
50% read, 50% update 134,839 106,638 160,719
95% read, 5% update 144,455 187,798 196,498
50% read, 50% update
(Durability Optimized)
6,289 1,236 31,864
Limitations
• Need to have enough memory to fit your working set into memory,
otherwise performance might suffer.
• MapReduce and Aggregation are single-threaded. To be more specific,
one per mongod.
• No joins across collections.
• On 32-bit, it has limitation of 2.5 Gb data.
• Sharding has some unique exceptions. If you plan to shard your data,
you need to shard early as some things that are feasible on a single
server are not feasible on a sharded collection.
Conclusion
• MongoDB is a semi-structured document-oriented NoSQL Database.
• It has two storage engines: MMAP and WiredTiger
• Multiple Aggregation Frameworks: Aggregation Pipeline and Map-
Reduce
• Support for GridFS, GeoSpatial Indexing, Capped Collection
• Better Performance as compared to Cassandra and Couchbase.
• On-going work – In-memory and HDFS support
DEMO
References
• https://www.mongodb.com/presentations/storage-engine-internals
• http://docs.mongodb.org/manual/core/data-modeling-introduction/
• http://docs.mongodb.org/manual/core/aggregation-introduction/
• https://2013.nosql-matters.org/bcn/wp-content/uploads/2013/12/storage-talk-
mongodb.pdf
• http://info-mongodb-com.s3.amazonaws.com/High Performance Benchmark White
Paper final.pdf
• https://www.mongodb.com/collateral/mongodb-architecture-guide
• Book - MongoDB: The Definitive Guide by Kristina Chodorow and Michael Dirolf
Questions?
Thank you!

More Related Content

PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
Storage talk
PDF
HBase Advanced - Lars George
PPTX
HBase Low Latency
PDF
Inside MongoDB: the Internals of an Open-Source Database
PDF
Non Relational Databases
PDF
The InnoDB Storage Engine for MySQL
PDF
Linux tuning to improve PostgreSQL performance
The Parquet Format and Performance Optimization Opportunities
Storage talk
HBase Advanced - Lars George
HBase Low Latency
Inside MongoDB: the Internals of an Open-Source Database
Non Relational Databases
The InnoDB Storage Engine for MySQL
Linux tuning to improve PostgreSQL performance

What's hot (20)

PDF
Get to know PostgreSQL!
PPTX
Mongodb basics and architecture
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
HBase in Practice
PPTX
Apache Spark Architecture
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
MongodB Internals
PPTX
Local Secondary Indexes in Apache Phoenix
PDF
Looking ahead at PostgreSQL 15
PPTX
Apache Kudu: Technical Deep Dive


PDF
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
PPTX
Introduction to MongoDB
PPTX
A Technical Introduction to WiredTiger
PDF
PostgreSQL HA
ODP
Introduction to MongoDB
PDF
MongoDB WiredTiger Internals
PPTX
[211] HBase 기반 검색 데이터 저장소 (공개용)
PDF
NoSQL databases
PDF
MongoDB WiredTiger Internals: Journey To Transactions
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Get to know PostgreSQL!
Mongodb basics and architecture
Apache Iceberg - A Table Format for Hige Analytic Datasets
HBase in Practice
Apache Spark Architecture
Apache Arrow Flight: A New Gold Standard for Data Transport
MongodB Internals
Local Secondary Indexes in Apache Phoenix
Looking ahead at PostgreSQL 15
Apache Kudu: Technical Deep Dive


Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Introduction to MongoDB
A Technical Introduction to WiredTiger
PostgreSQL HA
Introduction to MongoDB
MongoDB WiredTiger Internals
[211] HBase 기반 검색 데이터 저장소 (공개용)
NoSQL databases
MongoDB WiredTiger Internals: Journey To Transactions
Building a fully managed stream processing platform on Flink at scale for Lin...
Ad

Viewers also liked (19)

PPTX
Tim marston
PPT
Evolution and Scaling of MongoDB Management Service Running on MongoDB
PPT
Evolution of mongodb
PPT
MongoDB gridfs
PPSX
Microsoft Hekaton
PPTX
Getting Started with MongoDB and NodeJS
PPTX
MongoDB Operations for Developers
PPTX
PPTX
Get expertise with mongo db
KEY
Gridfs and MongoDB
PDF
MongoDB- Crud Operation
PPTX
MongoDB on EC2 and EBS
PPTX
An Enterprise Architect's View of MongoDB
PPTX
Introduction to MongoDB
PDF
Introduction to column oriented databases
PPTX
Webinar: Back to Basics: Thinking in Documents
PDF
Mongo DB
PPT
Introduction to MongoDB
PDF
Grid FS
Tim marston
Evolution and Scaling of MongoDB Management Service Running on MongoDB
Evolution of mongodb
MongoDB gridfs
Microsoft Hekaton
Getting Started with MongoDB and NodeJS
MongoDB Operations for Developers
Get expertise with mongo db
Gridfs and MongoDB
MongoDB- Crud Operation
MongoDB on EC2 and EBS
An Enterprise Architect's View of MongoDB
Introduction to MongoDB
Introduction to column oriented databases
Webinar: Back to Basics: Thinking in Documents
Mongo DB
Introduction to MongoDB
Grid FS
Ad

Similar to MongoDB Internals (20)

PPTX
MongoDB 3.0
PPTX
Drop acid
PDF
10gen MongoDB Video Presentation at WebGeek DevCup
PDF
Mongo db transcript
PPTX
MongoDB
PPTX
Agility and Scalability with MongoDB
PDF
Introduction to MongoDB
PPTX
Silicon Valley Code Camp: 2011 Introduction to MongoDB
PPTX
Mongo db intro.pptx
PPTX
Mongo db
PDF
MongoDB.pdf
PPTX
Webinar: When to Use MongoDB
PPTX
Common MongoDB Use Cases
PDF
Mongodb
PPTX
MongoDB presentation
PPTX
MongoDB 2.4 and spring data
PDF
Quick overview on mongo db
PPTX
Einführung in MongoDB
PPTX
No SQL - MongoDB
MongoDB 3.0
Drop acid
10gen MongoDB Video Presentation at WebGeek DevCup
Mongo db transcript
MongoDB
Agility and Scalability with MongoDB
Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Mongo db intro.pptx
Mongo db
MongoDB.pdf
Webinar: When to Use MongoDB
Common MongoDB Use Cases
Mongodb
MongoDB presentation
MongoDB 2.4 and spring data
Quick overview on mongo db
Einführung in MongoDB
No SQL - MongoDB

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Machine Learning_overview_presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Getting Started with Data Integration: FME Form 101
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Big Data Technologies - Introduction.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine Learning_overview_presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
Getting Started with Data Integration: FME Form 101
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A comparative analysis of optical character recognition models for extracting...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25-Week II
Big Data Technologies - Introduction.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

MongoDB Internals

  • 2. Outline • Introduction to MongoDB • Storage Layout • Data Management Features • Performance Analysis • Limitations • Conclusion • Demo • References
  • 3. What is MongoDB? • MongoDB is a NoSQL Document-Oriented database. • It provides semi-structured flexible schema. • It provides high performance, high availability, and easy scalability. • MongoDB is free and open source software. • License: GNU Affero General Public License (AGPL) and Apache License • MongoDB is a server process that runs on Linux, Windows and OS X. It can be run both as a 32 or 64-bit application.
  • 4. When to use MongoDB? “Knowing when to use a hammer, and when to use a screwdriver.” • Account and user profiles: can store arrays of addresses with ease (MetLife) • Content Management Systems (CMS): the flexible schema of MongoDB is great for heterogeneous collections of content types (MongoPress) • Form data: MongoDB makes it easy to evolve the structure of form data over time (ADP) • Blogs / user-generated content: can keep data with complex relationships together in one object (Forbes, AOL) • Messaging: vary message meta-data easily per message or message type without needing to maintain separate collections or schemas (Viber) • System configuration: just a nice object graph of configuration values, which is very natural in MongoDB (Cisco) • Log data of any kind: structured log data is the future (ebay) • Location based systems: makes use of Geospatial indices (Foursquare, City government of Chicago)
  • 5. Terminologies – RDBMS vs MongoDB *JSON – JavaScript Object Notation
  • 6. Storage Internals - Directory Layout Data Directory is found at /data/db
  • 10. To Sum Up: Internal File Format • Files on disk are broken into extents which contain the documents. • A collection has one or more extents. • Extent grow exponentially up to 2GB. • Namespace entries in the ns (namespace) file point to the first extent for that collection.
  • 12. Storage Engine - MMAP (Memory Mapped) • All data files are memory mapped to Virtual Memory by the OS. • MongoDB just reads / writes to RAM in the filesystem cache • OS takes care of the rest! • Virtual process size = total files size + overhead (connections, heap) • Uses Memory-mapped file using mmap() system call.
  • 13. Storage Engine - WiredTiger • Designed especially for Write-Intensive applications • Document level locking • Compression and Record-level locking • Multi-version concurrency control (MVCC) • Multi-document transactions • Support for Log Structured Merge (LSM) trees for very high insert workloads
  • 14. What makes MongoDB cool? • Sharding • Aggregation Framework and Map-Reduce • Capped Collection • GridFS • Geo-Spatial Indexing
  • 15. Sharding • Horizontal scaling - divides the data set and distributes the data over multiple servers, or shards. • Used to support deployments with very large data sets and high throughput operations. • Sharded Cluster Components – • Shards – mongod instance or replica sets • Config Server – Multiple mongod instances • Routing Instances – Multiple mongos instances • Shards are divided into fixed size chunks using ranges of shard key values.
  • 17. Choosing a Shard key The choice of shard key affects: • Distribution of reads and writes • Uneven distribution of reads/writes across shards. • Solution – Hashed ids • Size of chunks • Jumbo chunks cause uneven distribution of data. • Moving data between shards becomes difficult. • Solution – Multi-tenant compound index • The number of shards each query hits
  • 18. Aggregation Framework • Aggregation Pipeline • Map-Reduce • Single Purpose Aggregation Operations (deprecated in latest version)
  • 19. Aggregation Pipeline • The aggregation pipeline is a framework for performing aggregation tasks, modeled on the concept of data processing pipelines. • Using this framework, MongoDB passes the documents of a single collection through a pipeline. • The pipeline transforms the documents into aggregated results, and is accessed through the aggregate database command. • Operators: $match, $project, $unwind, $sort, $limit • User gets to choose the operator.
  • 23. Capped Collection • Fixed size collection called capped collection • Use the db.createCollection command and marked it as capped • e.g - db.createCollection(‘logs’, {capped: true, size: 2097152}) • When it reaches the size limit, old documents are automatically removed • Guarantees preservation of the insertion order • Maintains insertion order identical to the order on disk by prohibiting updates that increase document size • Allows the use of tailable cursor to retrieve documents
  • 24. GridFS • GridFS is a specification for storing and retrieving files that exceed the BSON (binary JSON) document size limit of 16MB. • Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document. • By default GridFS limits chunk size to 255k. • GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata. • GridFS is useful not only for storing files that exceed 16MB but also for storing any files for which you want access without having to load the entire file into memory.
  • 25. GeoSpatial Indexing • To support efficient queries of geospatial coordinate data, MongoDB provides two special indexes: • 2d indexes that uses planar geometry when returning results. • 2sphere indexes that use spherical geometry to return results. • Store location data as GeoJSON objects with this coordinate-axis order: longitude, latitude. • GeoJSON Object Supported: Point, LineString, Polygon, etc. • Query Operations: Inclusion, Intersection, Proximity. • You cannot use a geospatial index as the shard key index.
  • 26. Performance Analysis • Yahoo! Cloud Serving Benchmark (YCSB) • Throughput (ops/second) WORKLOADS Cassandra Couchbase MongoDB 50% read, 50% update 134,839 106,638 160,719 95% read, 5% update 144,455 187,798 196,498 50% read, 50% update (Durability Optimized) 6,289 1,236 31,864
  • 27. Limitations • Need to have enough memory to fit your working set into memory, otherwise performance might suffer. • MapReduce and Aggregation are single-threaded. To be more specific, one per mongod. • No joins across collections. • On 32-bit, it has limitation of 2.5 Gb data. • Sharding has some unique exceptions. If you plan to shard your data, you need to shard early as some things that are feasible on a single server are not feasible on a sharded collection.
  • 28. Conclusion • MongoDB is a semi-structured document-oriented NoSQL Database. • It has two storage engines: MMAP and WiredTiger • Multiple Aggregation Frameworks: Aggregation Pipeline and Map- Reduce • Support for GridFS, GeoSpatial Indexing, Capped Collection • Better Performance as compared to Cassandra and Couchbase. • On-going work – In-memory and HDFS support
  • 29. DEMO
  • 30. References • https://www.mongodb.com/presentations/storage-engine-internals • http://docs.mongodb.org/manual/core/data-modeling-introduction/ • http://docs.mongodb.org/manual/core/aggregation-introduction/ • https://2013.nosql-matters.org/bcn/wp-content/uploads/2013/12/storage-talk- mongodb.pdf • http://info-mongodb-com.s3.amazonaws.com/High Performance Benchmark White Paper final.pdf • https://www.mongodb.com/collateral/mongodb-architecture-guide • Book - MongoDB: The Definitive Guide by Kristina Chodorow and Michael Dirolf