Mongo nyc nyt + mongodb

NY Times +
MongoDB
Lessons Learnt
- Deep Kapadia (NYT R&D)

Why MongoDB?
● Quick Prototyping
● Flexible schema
○ Easy to dump data from 3rd party data sources
■ bit.ly
■ Twitter
○ Schema may change depending on the need of the
day or what metrics we are interested in
● Quick scaling if needed

6 Months ago
● Started out with two weeks data
● Single MongoDB instance
○ No replication
○ No backups
○ No monitoring
● Data was stored locally on ephemeral
storage on an EC2 instance
● Logs were stored locally

Technology stack
● Amazon EC2
● MongoDB 2.0.x (started with 1.8)
● Python 2.7.2
● pymongo 2.1.x
● Tornado 2.2 load balanced over Nginx
● Little bit of Ruby on Rails (going away soon)
● Custom WebGL based framework for
visualization

Monitoring
● Monitoring tools
○ db.serverStatus() + cron +email
○ 10gen's Mongo Monitoring Service
○ M/Monit -very basic
○ Nagios
● At the least monitor
○ The mongod process
○ Memory usage
○ CPU usage
○ Disk usage
● Desirable to monitor EVERYTHING

Replication
● REALLY EASY to set up
● 4 node replica set
○ Primary
○ Secondary
○ Arbiter
○ Delayed secondary - Priority 0
■ Never gets elected as Primary

All instances are m1.large except Arbiter which
is a t1.micro

Replication
● Be aware of how your drivers handle failover
○ pymongo throws AutoReconnectException
● Decide up front how you want to handle
failover
○ Lose data
○ idempotent writes (keep trying until write is
successful...up to a certain number of times)

Storage
● Do not use local storage on EC2
○ ephemeral - does not persist on reboots
● If using EC2 use EBS
○ Persistent
○ Easy to snapshot
○ can be detached and attached to a different server
○ Can be RAID'ed for reliablity and performance

● Note: EBS is known to have inconsistent
performance characteristics
● Limited by 1GB/s

Storage
● We started with RAID 10 on EBS
○ difficult to image
○ slightly steep learning curve if you are not used to
tinkering with RAID/LVM
● when using RAID, you would need to freeze
the filesystem
○ xfs_freeze
● If not on EC2 just use File system snapshots

Logs
● Store logs an EBS block
○ Logs can still be viewed in case your server goes
down and cannot be restarted
● Rotate your logs - please!
○ db.runCommand("logRotate");
○ command line
○ kill -SIGUSR1 <mongod process id>
○ killall -SIGUSR1 mongod
○ logrotate
■ still requires a post-rotate kill command

Backups and Restore
● Snapshot EBS blocks
○ --journaling is your friend
○ need to use fsync + lock if journaling is disabled
● Restoring from snapshots is easy
○ create a new volume from the snapshot
○ mount volume to EC2 instance

Caveat: When using RAID, you would still need
to use fsync + lock even if journaling is
enabled.

Backups and Restore
● mongodump/mongorestore
○ can be run while the DB is still running
○ no need to lock the DB
○ can backup and restore individual collections or
even partial collections
○ rebuilds indexes

● Automate your backups
○ https://github.com/micahwedemeyer/automongobackup

Backups and Restore
● use --oplog with mongodump and --
oplogReplay with mongorestore
● Backups and restore can be slow if your
data is a few 100 GB.
○ Plan for it
● Use incremental backups
○ possible with mongdump/mongorestore
■ mongodump -q

Understand your Data
● Know whether your application is write
heavy or a read heavy
● Separate write heavy collections from read
heavy collections
● Minimize indexes on write heavy collections
● Separate operational data from data used for
mining/analytics if possible

Querying
● db.<collection>.find({x:123})
○ returns entire documents matching the criteria
○ will be slow if you have large documents
● $exists, $nin & $ne not very efficient
○ try setting default values for keys instead of using
exist
○ try using $gt and $lt instead of $ne if possible
(numerics)

Querying
● Limit the data you return only what you need
○ Use range queries
○ limit the number of results
○ limit the number of keys returned
● Increase or decrease the batch size for a
cursor based on your needs
○ returning a batch is a network operation

Indexes
● Use indexes judiciously
● Create indexes to match your query keys
● Understand what you get with a compound
index
○ db.collection.ensureIndex({a:1,b:1})
■ gives you an index on a and a&b but not on b
■ ascending/descending may sometimes matter
when using composite indexes

Indexes (continued)
● One index per query rule:
○ Queries on multiple keys cannot use multiple
indexes. Use a compound index instead
■ $or is an exception
● Make sure that all your indexes fit in the memory
○ db.<collection>.getTotalIndexSize()
● Sometimes indexes may not be helpful
○ Low selectivity indexes
● Use explain()

Other details
● Pay attention to the limitations of the
MongoDB version and the driver you are
using
○ e.g: $exists does not use indexes prior to 2.0
○ e.g. $and is not supported in 1.8
● Design for performance
○ iterate over schema design if it does not perform
○ sometimes it is better to normalize than store
everything in one large document
○ archive historical data to a warehouse
■ mining/analytics

Administration tools
● We use Rockmongo
● But there are many other tools available
○ http://www.mongodb.org/display/DOCS/Admin+UIs

Create read only users for developers if
needed.

Questions?
Deep Kapadia
@durple
deep.kapadia@nytimes.com

NYT R&D Labs
@nytlabs
http://nytlabs.com

Mongo nyc nyt + mongodb

More Related Content

What's hot (20)

Similar to Mongo nyc nyt + mongodb (20)

Mongo nyc nyt + mongodb