SlideShare a Scribd company logo
Joanna Cheng
Becoming an Ops Manager Backup Superhero!
MongoDB joannac-
Joanna Cheng
Team Lead, Technical Services, MongoDB
About Me
Who has used Ops Manager before?
• On-prem software – comes with Enterprise Advanced
• Monitoring MongoDB deployments
• Backup
• Management of MongoDB
• Upgrades
• Index builds
• Cluster administration
Who has used Ops Manager backup before?
Who has used Ops Manager backup before?
Uh oh!
Uh oh!
Uh oh!
Uh oh!
My backups are stuck
My backups are too slow
I don’t know what’s happening
What’s wrong?
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
Technical Services can save
the day!
Agenda
• How Ops Manager Backup works
• How to diagnose some common errors
• What’s changed for backup in Ops Manager 4.2
How OM Backup
works
How OM Backup works
1. Initial sync 2. Oplog apply 3. Snapshot
How OM Backup works – initial sync
Source
Backup agent HTTP service
Sync
store
Oplog
store
Backup daemon
HEAD
DB
Sync slices
Oplog slices
Ops Manager
Source
HEAD
DB
How OM Backup works – oplog apply
Source
Backup agent
Oplog
store
Backup daemon
Oplog slices
HTTP service
HEAD
DB
How OM Backup works – snapshot
Oplog
store
Backup daemon
Oplog slices
HEAD DB files
Break up
files into
blocks
MongoDB
Blockstore
Filesystem
store
S3HEAD
DB
Insufficient oplog size
Insufficient oplog size
How OM Backup works – initial sync
Source
Backup agent HTTP service Backup daemon
HEAD
DB
Sync slices
Oplog slices
Ops Manager
Sync
store
Oplog
store
Insufficient oplog size
Insufficient oplog size
- how to diagnose
q Look at metrics for Oplog Window
q Wait for the oplog window to increase naturally
q Increase size of oplog
Starting… (forever)
Starting… (forever)
Initial sync process
Source
Backup agent HTTP service Backup daemon
Sync slices
Oplog slices
HEAD
DBSync
store
Oplog
store
Starting… (forever)
– backup agent down
Starting… (forever)
– alerts
Starting… (forever)
– backup agent logs
Starting… (forever)
– backup agent logs
[2019/05/05 21:23:35.245] [agent.error]
[components/agent.go:manageReplicaSets:202] Error starting syncs.
RsId: myReplicaSetError getting a valid session. Err: ip-172-31-9-
252.ap-southeast-2.compute.internal:27020 error dialing: Failed to
make a direct connection. Address: ip-172-31-9-252.ap-southeast-
2.compute.internal:27020 Err: no reachable servers
…
error dialing primary: Error dialing to replica set. RsId:
myReplicaSet Hosts: [ip-172-31-9-252.ap-southeast-
2.compute.internal:27020] Err: no reachable servers
Starting… (forever)
– connectivity issue to source
Sending collection information. Sync ID:
ObjectIdHex("5cd4d1bb0bb1317ecc9d0a17") Namespace:
admin.system.version
Sending collection information. Sync ID:
ObjectIdHex("5cd4d1bb0bb1317ecc9d0a17") Namespace: foo.bar
Sending numStreamingNamespaces information. Sync ID:
ObjectIdHex("5cd4d1bb0bb1317ecc9d0a17") number: 2
Finished namespace collectionInfo.
Starting… (forever)
– large number of collections
Starting… (forever)
– how to diagnose
q Check backup agent is functional
q Connectivity issues to source – check backup agent logs
q Still gathering collection info – do you have a lot of collections?
Transferring… (forever)
Transferring… (forever)
Initial sync process
Source
Backup agent HTTP service Backup daemon
Sync slices
Oplog slices
HEAD
DBSync
store
Oplog
store
[2019/05/19 04:13:15.191]
[agent.sync.myReplicaSet.5ce0d7a80bb131032173e503.error]
[components/sync.go:Run:246] Error sending sync slices.
Error syncing a collection. Namespace: `database.collection`
Err: operation was interrupted
Transferring … (forever)
– connectivity issue to source
Initial sync process
Source
Backup agent HTTP service
Sync
store
Oplog
store
Backup daemon
Sync slices
Oplog slices
HEAD
DB
[2019/05/13 12:18:08.993]
Pushing sync slice. Sync ID:
ObjectIdHex("5cd958960bb131032172269e") Namespace: test.foo Slice
#356
[2019/05/13 12:18:27.993]
Finished pushing sync slice. Sync ID:
ObjectIdHex("5cd958960bb131032172269e") Namespace: test.foo Slice
#356 Request Time 18589ms
Transferring … (forever)
- slow transfer to Sync Store
[2019/05/02 12:37:21.357]
[agent.sync.production.5cd4d1bb0bb1317ecc9d0a17.info]
[components/mothership.go:PushSyncSlice:320] Total Slice #1003 -
server syncStore is full. 11th attempt. Will resend this slice
again soon.
Transferring … (forever)
– issues writing to Sync Store
Sync Store
Backup agent Backup daemon
Initial sync process
Source
Backup agent HTTP service Backup daemon
Sync slices
Oplog slices
HEAD
DBSync
store
Oplog
store
On Linux
/opt/mongodb/mms/logs/*
On Windows
C:MMSDataServerLog
Starting… (forever)
– backup daemon logs
2018-02-10T20:22:06.780-0600 [class
com.xgen.svc.brs.svc.AssignmentThread =>
Space used: 2,252,138,352,640 bytes,
Space free: 324,684,738,560 bytes
2018-02-10T20:22:06.782-0600 [class
com.xgen.svc.brs.svc.AssignmentThread =>
backup.assignment.5ccbd7d20bb1317ecc8256ee.rs.production] ERROR
backup.assignment.5ccbd7d20bb1317ecc8256ee.rs.production
[assignmentFailed:527]
Assignment failed: No Daemon found with suitable conditions.
Filesize: 402.7 GB Oplog Churn: 0.0 GB/hr RequireSSD: false,
rsId=production, groupId=5ccbd7d20bb1317ecc8256ee
Transferring … (forever)
– no suitable daemon
Failed to decompress mongodb-linux-x86_64-rhel70-4.0.9.tgz to
/opt/mongodb/mms/mongodb-releases java.io.IOException: No
space left on device
Could not find appropriate mongod in /opt/mongodb/mongodb-
releases/, versions available to MMS: 3.4.3, 3.4.4, 3.4.2,
3.4.9. Expecting version 3.6.12 or greater, module preference:
enterprisePreferred
Transferring … (forever)
– missing binaries
Transferring … (forever)
– other issues on the daemon
• Missing prerequisites for MongoDB Enterprise
• Problems with the HEAD database
• Corruption
• Many collections (which slows mongod startup)
• Resource contention
Transferring … (forever)
– how to diagnose
q Check backup agent logs
q Problems connecting to source?
q Problems sending to sync store?
q Is the job assigned to a daemon?
q Check backup daemon logs
q Is the daemon working on the job?
Oplog behind
Oplog behind
Oplog behind
How OM Backup works – oplog apply
Source
Backup agent
Oplog
store
Backup daemon
Oplog slices
HTTP service
HEAD
DB
Oplog behind
– agent down
Oplog behind
- alerts
[2019/05/13 12:20:12.513] [agent.oplog.myReplicaSet.debug]
[components/agent.go:func1:359] Successfully finished pushing oplog
slice. {ts: 1557740046:1} -> {ts: 1557740106:1} Num slices: 1 Num
docs: 5. Request Time 11ms
[2019/05/13 13:12:39.277] [agent.oplog.myReplicaSet.debug]
[components/agent.go:func1:359] Successfully finished pushing oplog
slice. {ts: 1557752912:192} -> {ts: 1557752912:3292} Num slices: 1
Num docs: 3100. Request Time 1004ms
Oplog behind
- Backup agent too slow
[2019/05/02 12:37:21.208] [agent.oplog.production.error]
[components/mothership.go:doChunkedPushRequest:1070] Failed doing a
chunked request.
Oplog behind
- Failing to push slice
Oplog behind
- Failing to push slice
Source
Backup agent HTTP service
Oplog
store
Backup daemon
Oplog slices
HEAD
DB
Oplog behind
– how to diagnose
q Check agents page
q Check backup agent logs
q Check network to oplog store
q Check health of oplog store
Snapshot behind
Snapshot behind
Snapshot behind
How OM Backup works – snapshot
Oplog
store
Backup daemon
Oplog slices
HEAD DB files
Break up
files into
blocks
MongoDB
Blockstore
Filesystem
store
S3HEAD
DB
2019-05-18T11:38:29.170-0500 [Daemon #1: class
com.xgen.svc.brs.job.ApplyOpsJob] DEBUG backup.jobs.
5cd958960bb131032172269e.production
[OplogSliceCompiler.java.work:270] - OplogSlice: Range:
1557217382:2778 -> 1557217385:6790; NumDocs: 22930
2019-05-18T11:38:38.817-0500 [Daemon #1: class
com.xgen.svc.brs.job.ApplyOpsJob] DEBUG backup.jobs.
5cd958960bb131032172269e.production
[OplogSliceCompiler.java.work:421] - Oplogs to apply: 22930;
Skipped before: 0; Skipped from overlap: 0, Skipped after: 0
Snapshot behind
– daemon can’t keep up
2019-05-18T13:00:54.809-0500 [Daemon #1: class
com.xgen.svc.brs.job.SnapshotJob] INFO backup.jobs.
5cd958960bb131032172269e.production
[SnapshotJob.java.initiateSnapshot:123] - Starting a snapshot job.
2019-05-18T15:01:07.072-0500 [Daemon #1: class
com.xgen.svc.brs.job.SnapshotJob =>
5cd958960bb131032172269e/production ] DEBUG
com.xgen.svc.brs.grid.Daemon [Daemon.java.iterate:138] - Job: class
com.xgen.svc.brs.job.SnapshotJob finished. JobResult: OK.
Snapshot behind
– snapshot is taking too long to complete
Snapshot behind
– snapshot is taking too long to complete
• Change in data
• Many updates - less deduplication (if applicable)
• Many inserts - more data to save
• Network issue
• Storage speed issue
Snapshot behind
– how to diagnose
q Did the daemon start the snapshot?
q Check the logs to see why it’s falling behind
q Is the snapshot taking a long time?
q Storage slowness
q Network slowness
Needs resync
Needs resync
Needs resync
Oplog apply process
Source
Backup agent HTTP service
Oplog
store
Backup daemon
Oplog slices
HEAD
DB
[2019/05/10 06:39:51.135] [agent.oplog.myReplicaSet.warn]
[components/oplog.go:TailOplog:253] Bad match. Expected: {ts:
1557469781:1 h: -786638763670375692, t: 1} Received: {ts:
1557470350:1 h: -8704482524044120259, t: -1}
…
[2019/05/10 06:40:51.152] [agent.commonPoints.myReplicaSet.warn]
[components/rollback.go:Run:127] Failed to find a common point.
Needs resync
– lost oplog tail
In the daemon logs:
Error applying ops. Requesting resync.
In the HEAD DB logs:
2019-05-21T03:13:53.140+0000 [conn3] replication update of non-mod
failed: { ts: Timestamp 1558162386103|12, h: 361936184013300100, v:
2, op: "u", ns: ”database.collection", o2: {actual update here}
Needs resync
– error in applyOps
In the daemon logs:
2019-05-24T00:01:17.522+0000 [Daemon #3: class
com.xgen.svc.brs.job.ApplyOpsJob =>
5ccbd7d20bb1317ecc8256ee/myReplicaSet] DEBUG
backup.jobs.5ccbd7d20bb1317ecc8256ee.myReplicaSet
[ReplicaSetJob.java.startMongo:127] - MongodManager - Requested
Version: 4.0.9, Matching Version: 4.0.9, Matching Path:
/opt/mongodb/mms/mongodb-releases/mongodb-linux-x86_64-amazon-
4.0.9/bin/mongod, HEAD Path:
/backup/5ccbd7d20bb1317ecc8256ee/myReplicaSet/head/
Needs resync
– HEAD db logs
Needs resync
– how to diagnose
q Resync the backup to get it running again
q Backup agent initiated
q Why did it fail? Slowness? Rollback?
q Backup daemon initiated
q What operation did it fail on?
Summary
Investigation – where in the process?
q Where in the process am I stuck?
q Initial sync
q Oplog apply
q Snapshot
Investigation – which component?
q Which components are involved?
q Source replica set
q Backup agent
q Sync store / oplog store
q Backup daemon
q HEAD DB
q Snapshot storage
Investigation – what to look at?
q Backup agent logs
q Backup daemon logs
q HEAD DB logs
q Monitoring metrics
q Storage
q Network
q General utilisation
Investigation – I’m still stuck!
• Open a support ticket and
we’ll help you out!
• Make sure you include the
relevant data
§ Logs
§ Backup daemon
§ Backup agent
§ Screenshots
§ The diagnostic archive
Investigation – the diagnostic archive
What’s new in Ops
Manager 4.2
How OM Backup works – currently
Source
Backup agent HTTP service Backup daemon
HEAD
DB
Ops Manager
Snapshot
storageSync
store
Oplog
store
How OM Backup works – 4.2 and above
Backup agent HTTP service Snapshot
storage
Source
(v4.2+)
New backup process in OM 4.2
• No more HEAD DBs
• Backup directly from source (via WiredTiger snapshots) to
snapshot storage
• For more information, attend Ben Cefalo’s talk at 1:45pm today
Next Steps
• Put this into practice!
• (don’t break your backups)
• Other talks
• Today, 1:45pm - Modern Data Backup and Recovery from On-
Premises to the Public Cloud, Ben Cefalo, MongoDB
• Tomorrow – attend Builder’s Fest (Atlas, Charts, Stitch,
Games, and more!)
• Come chat to me in the Leaf Lounge
• Reach out to me on LinkedIn: joannac-
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!
Joanna Cheng - Team Lead, Technical Services
Any feedback would be greatly appreciated!
Thank You!
MongoDB World 2019: Becoming an Ops Manager Backup Superhero!

More Related Content

PPTX
Advanced Ops Manager Topics
PDF
The InnoDB Storage Engine for MySQL
PDF
Redo log improvements MYSQL 8.0
PDF
Understanding the Dalvik bytecode with the Dedexer tool
PDF
MySQL Monitoring with Zabbix
PPTX
An Introduction to MongoDB Ops Manager
PDF
1.mysql disk io 모니터링 및 분석사례
PDF
MariaDB 10.11 key features overview for DBAs
Advanced Ops Manager Topics
The InnoDB Storage Engine for MySQL
Redo log improvements MYSQL 8.0
Understanding the Dalvik bytecode with the Dedexer tool
MySQL Monitoring with Zabbix
An Introduction to MongoDB Ops Manager
1.mysql disk io 모니터링 및 분석사례
MariaDB 10.11 key features overview for DBAs

What's hot (20)

PDF
Git - Level 2
PDF
MySQL GTID 시작하기
PDF
MongoDB Performance Tuning
PDF
Git training v10
PPTX
PPT
Epoll - from the kernel side
PPTX
Troubleshooting common oslo.messaging and RabbitMQ issues
PPTX
Automate DBA Tasks With Ansible
PDF
CUST-10 Customizing the Upload File(s) dialog in Alfresco Share
PDF
MySQL Administrator 2021 - 네오클로바
PPTX
Web Servers(IIS, NGINX, APACHE)
PDF
MySQL Timeout Variables Explained
PPTX
Spring data jpa
PDF
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
PDF
React for Beginners
PDF
How Booking.com avoids and deals with replication lag
PDF
Github - Git Training Slides: Foundations
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
MySQL GTID Concepts, Implementation and troubleshooting
Git - Level 2
MySQL GTID 시작하기
MongoDB Performance Tuning
Git training v10
Epoll - from the kernel side
Troubleshooting common oslo.messaging and RabbitMQ issues
Automate DBA Tasks With Ansible
CUST-10 Customizing the Upload File(s) dialog in Alfresco Share
MySQL Administrator 2021 - 네오클로바
Web Servers(IIS, NGINX, APACHE)
MySQL Timeout Variables Explained
Spring data jpa
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
React for Beginners
How Booking.com avoids and deals with replication lag
Github - Git Training Slides: Foundations
Linux Performance Analysis: New Tools and Old Secrets
MySQL GTID Concepts, Implementation and troubleshooting
Ad

Similar to MongoDB World 2019: Becoming an Ops Manager Backup Superhero! (20)

PDF
MongoDB .local Bengaluru 2019: Becoming an Ops Manager Backup Superhero!
PDF
MongoDB .local Chicago 2019: Modern Data Backup and Recovery from On-premises...
PPTX
Webinar: Backups + Disaster Recovery
PDF
MongoDB World 2019: Modern Data Backup and Recovery from On-premises to the P...
PPTX
Run MongoDB with Confidence Using MongoDB Management Service (MMS)
PPTX
Run MongoDB with Confidence: Backing up and Monitoring with MMS
PPTX
Walking the Walk: Developing the MongoDB Backup Service with MongoDB
PPTX
Webinar: MongoDB Management Service (MMS): Session 02 - Backing up Data
PDF
DevOps Fest 2020. Николай Маржан. Consistent backups of multi-shard MongoDB
PDF
MongoDB .local London 2019: Modern Data Backup and Recovery from On-premises ...
PPTX
MongoDB.local Atlanta: Modern Data Backup and Recovery from On-Premises to th...
PPTX
Keeping MongoDB Data Safe
PPTX
Solving Your Backup Needs Using Ops Manager, Cloud Manager and Atlas
PDF
High performance Infrastructure Oct 2013
PDF
Solving Your Backup Needs Using MongoDB Ops Manager, Cloud Manager and Atlas
PDF
MongoDB Backups and PITR
PPTX
MongoDB Management Service: Getting Started with MMS
PDF
MongoDB.local DC 2018: Solving Your Backup Needs Using MongoDB Ops Manager, C...
PDF
MongoDB.local Austin 2018: Solving Your Backup Needs Using MongoDB Ops Manage...
PDF
Mongo db ops mug pres
MongoDB .local Bengaluru 2019: Becoming an Ops Manager Backup Superhero!
MongoDB .local Chicago 2019: Modern Data Backup and Recovery from On-premises...
Webinar: Backups + Disaster Recovery
MongoDB World 2019: Modern Data Backup and Recovery from On-premises to the P...
Run MongoDB with Confidence Using MongoDB Management Service (MMS)
Run MongoDB with Confidence: Backing up and Monitoring with MMS
Walking the Walk: Developing the MongoDB Backup Service with MongoDB
Webinar: MongoDB Management Service (MMS): Session 02 - Backing up Data
DevOps Fest 2020. Николай Маржан. Consistent backups of multi-shard MongoDB
MongoDB .local London 2019: Modern Data Backup and Recovery from On-premises ...
MongoDB.local Atlanta: Modern Data Backup and Recovery from On-Premises to th...
Keeping MongoDB Data Safe
Solving Your Backup Needs Using Ops Manager, Cloud Manager and Atlas
High performance Infrastructure Oct 2013
Solving Your Backup Needs Using MongoDB Ops Manager, Cloud Manager and Atlas
MongoDB Backups and PITR
MongoDB Management Service: Getting Started with MMS
MongoDB.local DC 2018: Solving Your Backup Needs Using MongoDB Ops Manager, C...
MongoDB.local Austin 2018: Solving Your Backup Needs Using MongoDB Ops Manage...
Mongo db ops mug pres
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PPTX
1. Introduction to Computer Programming.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Mushroom cultivation and it's methods.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
August Patch Tuesday
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
1. Introduction to Computer Programming.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25-Week II
cloud_computing_Infrastucture_as_cloud_p
Mushroom cultivation and it's methods.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TLE Review Electricity (Electricity).pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
August Patch Tuesday
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Tartificialntelligence_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

MongoDB World 2019: Becoming an Ops Manager Backup Superhero!

  • 1. Joanna Cheng Becoming an Ops Manager Backup Superhero! MongoDB joannac-
  • 2. Joanna Cheng Team Lead, Technical Services, MongoDB
  • 4. Who has used Ops Manager before? • On-prem software – comes with Enterprise Advanced • Monitoring MongoDB deployments • Backup • Management of MongoDB • Upgrades • Index builds • Cluster administration
  • 5. Who has used Ops Manager backup before?
  • 6. Who has used Ops Manager backup before?
  • 11. My backups are stuck My backups are too slow I don’t know what’s happening What’s wrong?
  • 13. Technical Services can save the day!
  • 14. Agenda • How Ops Manager Backup works • How to diagnose some common errors • What’s changed for backup in Ops Manager 4.2
  • 16. How OM Backup works 1. Initial sync 2. Oplog apply 3. Snapshot
  • 17. How OM Backup works – initial sync Source Backup agent HTTP service Sync store Oplog store Backup daemon HEAD DB Sync slices Oplog slices Ops Manager Source HEAD DB
  • 18. How OM Backup works – oplog apply Source Backup agent Oplog store Backup daemon Oplog slices HTTP service HEAD DB
  • 19. How OM Backup works – snapshot Oplog store Backup daemon Oplog slices HEAD DB files Break up files into blocks MongoDB Blockstore Filesystem store S3HEAD DB
  • 22. How OM Backup works – initial sync Source Backup agent HTTP service Backup daemon HEAD DB Sync slices Oplog slices Ops Manager Sync store Oplog store
  • 24. Insufficient oplog size - how to diagnose q Look at metrics for Oplog Window q Wait for the oplog window to increase naturally q Increase size of oplog
  • 27. Initial sync process Source Backup agent HTTP service Backup daemon Sync slices Oplog slices HEAD DBSync store Oplog store
  • 32. [2019/05/05 21:23:35.245] [agent.error] [components/agent.go:manageReplicaSets:202] Error starting syncs. RsId: myReplicaSetError getting a valid session. Err: ip-172-31-9- 252.ap-southeast-2.compute.internal:27020 error dialing: Failed to make a direct connection. Address: ip-172-31-9-252.ap-southeast- 2.compute.internal:27020 Err: no reachable servers … error dialing primary: Error dialing to replica set. RsId: myReplicaSet Hosts: [ip-172-31-9-252.ap-southeast- 2.compute.internal:27020] Err: no reachable servers Starting… (forever) – connectivity issue to source
  • 33. Sending collection information. Sync ID: ObjectIdHex("5cd4d1bb0bb1317ecc9d0a17") Namespace: admin.system.version Sending collection information. Sync ID: ObjectIdHex("5cd4d1bb0bb1317ecc9d0a17") Namespace: foo.bar Sending numStreamingNamespaces information. Sync ID: ObjectIdHex("5cd4d1bb0bb1317ecc9d0a17") number: 2 Finished namespace collectionInfo. Starting… (forever) – large number of collections
  • 34. Starting… (forever) – how to diagnose q Check backup agent is functional q Connectivity issues to source – check backup agent logs q Still gathering collection info – do you have a lot of collections?
  • 37. Initial sync process Source Backup agent HTTP service Backup daemon Sync slices Oplog slices HEAD DBSync store Oplog store
  • 38. [2019/05/19 04:13:15.191] [agent.sync.myReplicaSet.5ce0d7a80bb131032173e503.error] [components/sync.go:Run:246] Error sending sync slices. Error syncing a collection. Namespace: `database.collection` Err: operation was interrupted Transferring … (forever) – connectivity issue to source
  • 39. Initial sync process Source Backup agent HTTP service Sync store Oplog store Backup daemon Sync slices Oplog slices HEAD DB
  • 40. [2019/05/13 12:18:08.993] Pushing sync slice. Sync ID: ObjectIdHex("5cd958960bb131032172269e") Namespace: test.foo Slice #356 [2019/05/13 12:18:27.993] Finished pushing sync slice. Sync ID: ObjectIdHex("5cd958960bb131032172269e") Namespace: test.foo Slice #356 Request Time 18589ms Transferring … (forever) - slow transfer to Sync Store
  • 41. [2019/05/02 12:37:21.357] [agent.sync.production.5cd4d1bb0bb1317ecc9d0a17.info] [components/mothership.go:PushSyncSlice:320] Total Slice #1003 - server syncStore is full. 11th attempt. Will resend this slice again soon. Transferring … (forever) – issues writing to Sync Store
  • 42. Sync Store Backup agent Backup daemon
  • 43. Initial sync process Source Backup agent HTTP service Backup daemon Sync slices Oplog slices HEAD DBSync store Oplog store
  • 45. 2018-02-10T20:22:06.780-0600 [class com.xgen.svc.brs.svc.AssignmentThread => Space used: 2,252,138,352,640 bytes, Space free: 324,684,738,560 bytes 2018-02-10T20:22:06.782-0600 [class com.xgen.svc.brs.svc.AssignmentThread => backup.assignment.5ccbd7d20bb1317ecc8256ee.rs.production] ERROR backup.assignment.5ccbd7d20bb1317ecc8256ee.rs.production [assignmentFailed:527] Assignment failed: No Daemon found with suitable conditions. Filesize: 402.7 GB Oplog Churn: 0.0 GB/hr RequireSSD: false, rsId=production, groupId=5ccbd7d20bb1317ecc8256ee Transferring … (forever) – no suitable daemon
  • 46. Failed to decompress mongodb-linux-x86_64-rhel70-4.0.9.tgz to /opt/mongodb/mms/mongodb-releases java.io.IOException: No space left on device Could not find appropriate mongod in /opt/mongodb/mongodb- releases/, versions available to MMS: 3.4.3, 3.4.4, 3.4.2, 3.4.9. Expecting version 3.6.12 or greater, module preference: enterprisePreferred Transferring … (forever) – missing binaries
  • 47. Transferring … (forever) – other issues on the daemon • Missing prerequisites for MongoDB Enterprise • Problems with the HEAD database • Corruption • Many collections (which slows mongod startup) • Resource contention
  • 48. Transferring … (forever) – how to diagnose q Check backup agent logs q Problems connecting to source? q Problems sending to sync store? q Is the job assigned to a daemon? q Check backup daemon logs q Is the daemon working on the job?
  • 52. How OM Backup works – oplog apply Source Backup agent Oplog store Backup daemon Oplog slices HTTP service HEAD DB
  • 55. [2019/05/13 12:20:12.513] [agent.oplog.myReplicaSet.debug] [components/agent.go:func1:359] Successfully finished pushing oplog slice. {ts: 1557740046:1} -> {ts: 1557740106:1} Num slices: 1 Num docs: 5. Request Time 11ms [2019/05/13 13:12:39.277] [agent.oplog.myReplicaSet.debug] [components/agent.go:func1:359] Successfully finished pushing oplog slice. {ts: 1557752912:192} -> {ts: 1557752912:3292} Num slices: 1 Num docs: 3100. Request Time 1004ms Oplog behind - Backup agent too slow
  • 57. Oplog behind - Failing to push slice Source Backup agent HTTP service Oplog store Backup daemon Oplog slices HEAD DB
  • 58. Oplog behind – how to diagnose q Check agents page q Check backup agent logs q Check network to oplog store q Check health of oplog store
  • 62. How OM Backup works – snapshot Oplog store Backup daemon Oplog slices HEAD DB files Break up files into blocks MongoDB Blockstore Filesystem store S3HEAD DB
  • 63. 2019-05-18T11:38:29.170-0500 [Daemon #1: class com.xgen.svc.brs.job.ApplyOpsJob] DEBUG backup.jobs. 5cd958960bb131032172269e.production [OplogSliceCompiler.java.work:270] - OplogSlice: Range: 1557217382:2778 -> 1557217385:6790; NumDocs: 22930 2019-05-18T11:38:38.817-0500 [Daemon #1: class com.xgen.svc.brs.job.ApplyOpsJob] DEBUG backup.jobs. 5cd958960bb131032172269e.production [OplogSliceCompiler.java.work:421] - Oplogs to apply: 22930; Skipped before: 0; Skipped from overlap: 0, Skipped after: 0 Snapshot behind – daemon can’t keep up
  • 64. 2019-05-18T13:00:54.809-0500 [Daemon #1: class com.xgen.svc.brs.job.SnapshotJob] INFO backup.jobs. 5cd958960bb131032172269e.production [SnapshotJob.java.initiateSnapshot:123] - Starting a snapshot job. 2019-05-18T15:01:07.072-0500 [Daemon #1: class com.xgen.svc.brs.job.SnapshotJob => 5cd958960bb131032172269e/production ] DEBUG com.xgen.svc.brs.grid.Daemon [Daemon.java.iterate:138] - Job: class com.xgen.svc.brs.job.SnapshotJob finished. JobResult: OK. Snapshot behind – snapshot is taking too long to complete
  • 65. Snapshot behind – snapshot is taking too long to complete • Change in data • Many updates - less deduplication (if applicable) • Many inserts - more data to save • Network issue • Storage speed issue
  • 66. Snapshot behind – how to diagnose q Did the daemon start the snapshot? q Check the logs to see why it’s falling behind q Is the snapshot taking a long time? q Storage slowness q Network slowness
  • 70. Oplog apply process Source Backup agent HTTP service Oplog store Backup daemon Oplog slices HEAD DB
  • 71. [2019/05/10 06:39:51.135] [agent.oplog.myReplicaSet.warn] [components/oplog.go:TailOplog:253] Bad match. Expected: {ts: 1557469781:1 h: -786638763670375692, t: 1} Received: {ts: 1557470350:1 h: -8704482524044120259, t: -1} … [2019/05/10 06:40:51.152] [agent.commonPoints.myReplicaSet.warn] [components/rollback.go:Run:127] Failed to find a common point. Needs resync – lost oplog tail
  • 72. In the daemon logs: Error applying ops. Requesting resync. In the HEAD DB logs: 2019-05-21T03:13:53.140+0000 [conn3] replication update of non-mod failed: { ts: Timestamp 1558162386103|12, h: 361936184013300100, v: 2, op: "u", ns: ”database.collection", o2: {actual update here} Needs resync – error in applyOps
  • 73. In the daemon logs: 2019-05-24T00:01:17.522+0000 [Daemon #3: class com.xgen.svc.brs.job.ApplyOpsJob => 5ccbd7d20bb1317ecc8256ee/myReplicaSet] DEBUG backup.jobs.5ccbd7d20bb1317ecc8256ee.myReplicaSet [ReplicaSetJob.java.startMongo:127] - MongodManager - Requested Version: 4.0.9, Matching Version: 4.0.9, Matching Path: /opt/mongodb/mms/mongodb-releases/mongodb-linux-x86_64-amazon- 4.0.9/bin/mongod, HEAD Path: /backup/5ccbd7d20bb1317ecc8256ee/myReplicaSet/head/ Needs resync – HEAD db logs
  • 74. Needs resync – how to diagnose q Resync the backup to get it running again q Backup agent initiated q Why did it fail? Slowness? Rollback? q Backup daemon initiated q What operation did it fail on?
  • 76. Investigation – where in the process? q Where in the process am I stuck? q Initial sync q Oplog apply q Snapshot
  • 77. Investigation – which component? q Which components are involved? q Source replica set q Backup agent q Sync store / oplog store q Backup daemon q HEAD DB q Snapshot storage
  • 78. Investigation – what to look at? q Backup agent logs q Backup daemon logs q HEAD DB logs q Monitoring metrics q Storage q Network q General utilisation
  • 79. Investigation – I’m still stuck! • Open a support ticket and we’ll help you out! • Make sure you include the relevant data § Logs § Backup daemon § Backup agent § Screenshots § The diagnostic archive
  • 80. Investigation – the diagnostic archive
  • 81. What’s new in Ops Manager 4.2
  • 82. How OM Backup works – currently Source Backup agent HTTP service Backup daemon HEAD DB Ops Manager Snapshot storageSync store Oplog store
  • 83. How OM Backup works – 4.2 and above Backup agent HTTP service Snapshot storage Source (v4.2+)
  • 84. New backup process in OM 4.2 • No more HEAD DBs • Backup directly from source (via WiredTiger snapshots) to snapshot storage • For more information, attend Ben Cefalo’s talk at 1:45pm today
  • 85. Next Steps • Put this into practice! • (don’t break your backups) • Other talks • Today, 1:45pm - Modern Data Backup and Recovery from On- Premises to the Public Cloud, Ben Cefalo, MongoDB • Tomorrow – attend Builder’s Fest (Atlas, Charts, Stitch, Games, and more!) • Come chat to me in the Leaf Lounge • Reach out to me on LinkedIn: joannac-
  • 88. Joanna Cheng - Team Lead, Technical Services Any feedback would be greatly appreciated! Thank You!