SlideShare a Scribd company logo
NY Times +
MongoDB
Lessons Learnt
- Deep Kapadia (NYT R&D)
Why MongoDB?
● Quick Prototyping
● Flexible schema
  ○ Easy to dump data from 3rd party data sources
    ■ bit.ly
    ■ Twitter
  ○ Schema may change depending on the need of the
    day or what metrics we are interested in
● Quick scaling if needed
6 Months ago
● Started out with two weeks data
● Single MongoDB instance
  ○ No replication
  ○ No backups
  ○ No monitoring
● Data was stored locally on ephemeral
  storage on an EC2 instance
● Logs were stored locally
Technology stack
●   Amazon EC2
●   MongoDB 2.0.x (started with 1.8)
●   Python 2.7.2
●   pymongo 2.1.x
●   Tornado 2.2 load balanced over Nginx
●   Little bit of Ruby on Rails (going away soon)
●   Custom WebGL based framework for
    visualization
Monitoring
● Monitoring tools
  ○   db.serverStatus() + cron +email
  ○   10gen's Mongo Monitoring Service
  ○   M/Monit -very basic
  ○   Nagios
● At the least monitor
  ○   The mongod process
  ○   Memory usage
  ○   CPU usage
  ○   Disk usage
● Desirable to monitor EVERYTHING
Replication
● REALLY EASY to set up
● 4 node replica set
  ○   Primary
  ○   Secondary
  ○   Arbiter
  ○   Delayed secondary - Priority 0
      ■ Never gets elected as Primary


All instances are m1.large except Arbiter which
is a t1.micro
Replication
● Be aware of how your drivers handle failover
  ○ pymongo throws AutoReconnectException
● Decide up front how you want to handle
  failover
  ○ Lose data
  ○ idempotent writes (keep trying until write is
    successful...up to a certain number of times)
Storage
● Do not use local storage on EC2
  ○ ephemeral - does not persist on reboots
● If using EC2 use EBS
  ○   Persistent
  ○   Easy to snapshot
  ○   can be detached and attached to a different server
  ○   Can be RAID'ed for reliablity and performance


● Note: EBS is known to have inconsistent
  performance characteristics
● Limited by 1GB/s
Storage
● We started with RAID 10 on EBS
  ○ difficult to image
  ○ slightly steep learning curve if you are not used to
    tinkering with RAID/LVM
● when using RAID, you would need to freeze
  the filesystem
  ○ xfs_freeze
● If not on EC2 just use File system snapshots
Logs
● Store logs an EBS block
  ○ Logs can still be viewed in case your server goes
    down and cannot be restarted
● Rotate your logs - please!
  ○   db.runCommand("logRotate");
  ○   command line
  ○   kill -SIGUSR1 <mongod process id>
  ○   killall -SIGUSR1 mongod
  ○   logrotate
       ■ still requires a post-rotate kill command
Backups and Restore
● Snapshot EBS blocks
  ○ --journaling is your friend
  ○ need to use fsync + lock if journaling is disabled
● Restoring from snapshots is easy
  ○ create a new volume from the snapshot
  ○ mount volume to EC2 instance


Caveat: When using RAID, you would still need
to use fsync + lock even if journaling is
enabled.
Backups and Restore
● mongodump/mongorestore
  ○ can be run while the DB is still running
  ○ no need to lock the DB
  ○ can backup and restore individual collections or
    even partial collections
  ○ rebuilds indexes


● Automate your backups
  ○   https://github.com/micahwedemeyer/automongobackup
Backups and Restore
● use --oplog with mongodump and --
  oplogReplay with mongorestore
● Backups and restore can be slow if your
  data is a few 100 GB.
  ○ Plan for it
● Use incremental backups
  ○ possible with mongdump/mongorestore
    ■ mongodump -q
Understand your Data
● Know whether your application is write
  heavy or a read heavy
● Separate write heavy collections from read
  heavy collections
● Minimize indexes on write heavy collections
● Separate operational data from data used for
  mining/analytics if possible
Querying
● db.<collection>.find({x:123})
  ○ returns entire documents matching the criteria
  ○ will be slow if you have large documents
● $exists, $nin & $ne not very efficient
  ○ try setting default values for keys instead of using
       exist
   ○   try using $gt and $lt instead of $ne if possible
       (numerics)
Querying
● Limit the data you return only what you need
   ○ Use range queries
   ○ limit the number of results
   ○ limit the number of keys returned
● Increase or decrease the batch size for a
  cursor based on your needs
  ○ returning a batch is a network operation
Indexes
● Use indexes judiciously
● Create indexes to match your query keys
● Understand what you get with a compound
  index
  ○ db.collection.ensureIndex({a:1,b:1})
    ■ gives you an index on a and a&b but not on b
    ■ ascending/descending may sometimes matter
       when using composite indexes
Indexes (continued)
● One index per query rule:
  ○ Queries on multiple keys cannot use multiple
       indexes. Use a compound index instead
      ■ $or is an exception
●   Make sure that all your indexes fit in the memory
    ○ db.<collection>.getTotalIndexSize()
●   Sometimes indexes may not be helpful
    ○ Low selectivity indexes
●   Use explain()
Other details
● Pay attention to the limitations of the
  MongoDB version and the driver you are
  using
  ○ e.g: $exists does not use indexes prior to 2.0
  ○ e.g. $and is not supported in 1.8
● Design for performance
  ○ iterate over schema design if it does not perform
  ○ sometimes it is better to normalize than store
    everything in one large document
  ○ archive historical data to a warehouse
     ■ mining/analytics
Administration tools
● We use Rockmongo
● But there are many other tools available
   ○ http://www.mongodb.org/display/DOCS/Admin+UIs


Create read only users for developers if
needed.
Questions?
          Deep Kapadia
             @durple
    deep.kapadia@nytimes.com

         NYT R&D Labs
             @nytlabs
        http://nytlabs.com

More Related Content

PDF
Mongodb meetup
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PPTX
Mongo db cluster administration and Shredded Databases
PDF
EncExec: Secure In-Cache Execution
PPTX
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
PDF
Balogh gyorgy big_data
PDF
Open stack @ iiit hyderabad
PDF
Intro to cassandra
Mongodb meetup
Introduction to Apache Tajo: Data Warehouse for Big Data
Mongo db cluster administration and Shredded Databases
EncExec: Secure In-Cache Execution
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
Balogh gyorgy big_data
Open stack @ iiit hyderabad
Intro to cassandra

What's hot (20)

PDF
It's not you, it's me: Ending a 15 year relationship with RRD
PDF
Time Series Data with Apache Cassandra
PDF
Time series storage in Cassandra
PPTX
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
PDF
YAML Engineering: why we need a new paradigm
PDF
Mosix Cluster
PDF
Wikimedia Content API (Strangeloop)
PDF
Rook: Storage for Containers in Containers – data://disrupted® 2020
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
PDF
Introduction to Bizur
PDF
Redis Overview
PDF
Data Lessons Learned at Scale
ODP
LOFAR - finding transients in the radio spectrum
PDF
OSDC 2013 | Distributed Storage with GlusterFS by Dr. Udo Seidel
ODP
My talk at Topconf.com conference, Tallinn, 1st of November 2012
PDF
Slide smallfiles
PPTX
Cassandra Lunch #59 Functions in Cassandra
PDF
Lua — Introduction
PPTX
Comparing Orchestration
It's not you, it's me: Ending a 15 year relationship with RRD
Time Series Data with Apache Cassandra
Time series storage in Cassandra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
YAML Engineering: why we need a new paradigm
Mosix Cluster
Wikimedia Content API (Strangeloop)
Rook: Storage for Containers in Containers – data://disrupted® 2020
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Introduction to Bizur
Redis Overview
Data Lessons Learned at Scale
LOFAR - finding transients in the radio spectrum
OSDC 2013 | Distributed Storage with GlusterFS by Dr. Udo Seidel
My talk at Topconf.com conference, Tallinn, 1st of November 2012
Slide smallfiles
Cassandra Lunch #59 Functions in Cassandra
Lua — Introduction
Comparing Orchestration
Ad

Similar to Mongo nyc nyt + mongodb (20)

PDF
MySQL and MariaDB Backups
PDF
Scaling up and accelerating Drupal 8 with NoSQL
PDF
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PDF
Elasticsearch as a time series database
PDF
kranonit S06E01 Игорь Цинько: High load
PPTX
PL22 - Backup and Restore Performance.pptx
PDF
Piano Media - approach to data gathering and processing
PDF
Dfrws eu 2014 rekall workshop
PPT
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
PPTX
Percona XtraBackup - New Features and Improvements
PDF
MySQL Cluster (NDB) - Best Practices Percona Live 2017
PDF
Lt2013 glusterfs.talk
PPTX
Journey through high performance django application
PDF
An Introduction to Apache Cassandra
PDF
The Proper Care and Feeding of MySQL Databases
PDF
High performance json- postgre sql vs. mongodb
PDF
Devoxx : being productive with JHipster
PPTX
Ledingkart Meetup #2: Scaling Search @Lendingkart
PDF
Running MySQL in AWS
PDF
Elasticsearch 101 - Cluster setup and tuning
MySQL and MariaDB Backups
Scaling up and accelerating Drupal 8 with NoSQL
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
Elasticsearch as a time series database
kranonit S06E01 Игорь Цинько: High load
PL22 - Backup and Restore Performance.pptx
Piano Media - approach to data gathering and processing
Dfrws eu 2014 rekall workshop
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
Percona XtraBackup - New Features and Improvements
MySQL Cluster (NDB) - Best Practices Percona Live 2017
Lt2013 glusterfs.talk
Journey through high performance django application
An Introduction to Apache Cassandra
The Proper Care and Feeding of MySQL Databases
High performance json- postgre sql vs. mongodb
Devoxx : being productive with JHipster
Ledingkart Meetup #2: Scaling Search @Lendingkart
Running MySQL in AWS
Elasticsearch 101 - Cluster setup and tuning
Ad

Mongo nyc nyt + mongodb

  • 1. NY Times + MongoDB Lessons Learnt - Deep Kapadia (NYT R&D)
  • 2. Why MongoDB? ● Quick Prototyping ● Flexible schema ○ Easy to dump data from 3rd party data sources ■ bit.ly ■ Twitter ○ Schema may change depending on the need of the day or what metrics we are interested in ● Quick scaling if needed
  • 3. 6 Months ago ● Started out with two weeks data ● Single MongoDB instance ○ No replication ○ No backups ○ No monitoring ● Data was stored locally on ephemeral storage on an EC2 instance ● Logs were stored locally
  • 4. Technology stack ● Amazon EC2 ● MongoDB 2.0.x (started with 1.8) ● Python 2.7.2 ● pymongo 2.1.x ● Tornado 2.2 load balanced over Nginx ● Little bit of Ruby on Rails (going away soon) ● Custom WebGL based framework for visualization
  • 5. Monitoring ● Monitoring tools ○ db.serverStatus() + cron +email ○ 10gen's Mongo Monitoring Service ○ M/Monit -very basic ○ Nagios ● At the least monitor ○ The mongod process ○ Memory usage ○ CPU usage ○ Disk usage ● Desirable to monitor EVERYTHING
  • 6. Replication ● REALLY EASY to set up ● 4 node replica set ○ Primary ○ Secondary ○ Arbiter ○ Delayed secondary - Priority 0 ■ Never gets elected as Primary All instances are m1.large except Arbiter which is a t1.micro
  • 7. Replication ● Be aware of how your drivers handle failover ○ pymongo throws AutoReconnectException ● Decide up front how you want to handle failover ○ Lose data ○ idempotent writes (keep trying until write is successful...up to a certain number of times)
  • 8. Storage ● Do not use local storage on EC2 ○ ephemeral - does not persist on reboots ● If using EC2 use EBS ○ Persistent ○ Easy to snapshot ○ can be detached and attached to a different server ○ Can be RAID'ed for reliablity and performance ● Note: EBS is known to have inconsistent performance characteristics ● Limited by 1GB/s
  • 9. Storage ● We started with RAID 10 on EBS ○ difficult to image ○ slightly steep learning curve if you are not used to tinkering with RAID/LVM ● when using RAID, you would need to freeze the filesystem ○ xfs_freeze ● If not on EC2 just use File system snapshots
  • 10. Logs ● Store logs an EBS block ○ Logs can still be viewed in case your server goes down and cannot be restarted ● Rotate your logs - please! ○ db.runCommand("logRotate"); ○ command line ○ kill -SIGUSR1 <mongod process id> ○ killall -SIGUSR1 mongod ○ logrotate ■ still requires a post-rotate kill command
  • 11. Backups and Restore ● Snapshot EBS blocks ○ --journaling is your friend ○ need to use fsync + lock if journaling is disabled ● Restoring from snapshots is easy ○ create a new volume from the snapshot ○ mount volume to EC2 instance Caveat: When using RAID, you would still need to use fsync + lock even if journaling is enabled.
  • 12. Backups and Restore ● mongodump/mongorestore ○ can be run while the DB is still running ○ no need to lock the DB ○ can backup and restore individual collections or even partial collections ○ rebuilds indexes ● Automate your backups ○ https://github.com/micahwedemeyer/automongobackup
  • 13. Backups and Restore ● use --oplog with mongodump and -- oplogReplay with mongorestore ● Backups and restore can be slow if your data is a few 100 GB. ○ Plan for it ● Use incremental backups ○ possible with mongdump/mongorestore ■ mongodump -q
  • 14. Understand your Data ● Know whether your application is write heavy or a read heavy ● Separate write heavy collections from read heavy collections ● Minimize indexes on write heavy collections ● Separate operational data from data used for mining/analytics if possible
  • 15. Querying ● db.<collection>.find({x:123}) ○ returns entire documents matching the criteria ○ will be slow if you have large documents ● $exists, $nin & $ne not very efficient ○ try setting default values for keys instead of using exist ○ try using $gt and $lt instead of $ne if possible (numerics)
  • 16. Querying ● Limit the data you return only what you need ○ Use range queries ○ limit the number of results ○ limit the number of keys returned ● Increase or decrease the batch size for a cursor based on your needs ○ returning a batch is a network operation
  • 17. Indexes ● Use indexes judiciously ● Create indexes to match your query keys ● Understand what you get with a compound index ○ db.collection.ensureIndex({a:1,b:1}) ■ gives you an index on a and a&b but not on b ■ ascending/descending may sometimes matter when using composite indexes
  • 18. Indexes (continued) ● One index per query rule: ○ Queries on multiple keys cannot use multiple indexes. Use a compound index instead ■ $or is an exception ● Make sure that all your indexes fit in the memory ○ db.<collection>.getTotalIndexSize() ● Sometimes indexes may not be helpful ○ Low selectivity indexes ● Use explain()
  • 19. Other details ● Pay attention to the limitations of the MongoDB version and the driver you are using ○ e.g: $exists does not use indexes prior to 2.0 ○ e.g. $and is not supported in 1.8 ● Design for performance ○ iterate over schema design if it does not perform ○ sometimes it is better to normalize than store everything in one large document ○ archive historical data to a warehouse ■ mining/analytics
  • 20. Administration tools ● We use Rockmongo ● But there are many other tools available ○ http://www.mongodb.org/display/DOCS/Admin+UIs Create read only users for developers if needed.
  • 21. Questions? Deep Kapadia @durple [email protected] NYT R&D Labs @nytlabs http://nytlabs.com