Working with huge datasets - optimizing aggregates with date fields

DamianV · May 15, 2025, 1:19pm

Hello guys,

i am working with a very big collection. 38m entries in total.

Those entries include a date field which i have indexed in my mongoose model.

Selecting the last 30 days of data returns 4.7m datasets - the request takes 2 minutes to load

i have tried to optimize my aggregate but cant get it any faster

the aggregate only includes:

       {
            $match: {
                $and: [
                    { date: { $lte: endDate } },
                    { date: { $gte: startDate } }
                ]
            }
        },

date is indexed

what can i do to improve performance of that collection?

best regards
Damian

steevej · May 15, 2025, 2:21pm

Your query is equivalent to

May be, just may be, the fact that you are using the explicit $and rather than the implicit version above causes the optimizer to ignore the index.

Share your indices and the explain plan for further help.

DamianV · May 15, 2025, 2:26pm

Recreating the aggregate in compass - the performance is wayyy faster

Running the aggregate in my node backend always takes 2 minutes

steevej · May 15, 2025, 3:01pm

The the bottleneck seems to be data transfer, which is comprehensible since you transfer 4.7m.

You have 2 choices from here:

use $project to weed out fields you do not need in your backend
leverage the power of aggregation by moving what you do in the back to a more complex pipeline

DamianV · May 27, 2025, 10:54am

you are right - my mistake.

i have changed the date indexes to descending to try and squeeze more speed out of the query but it didnt do anything

        "executionStats": {
            "executionSuccess": true,
            "nReturned": 5376753,
            "executionTimeMillis": 13878,
            "totalKeysExamined": 5376753,
            "totalDocsExamined": 5376753,
            "executionStages": {
                "isCached": false,
                "stage": "FETCH",
                "nReturned": 5376753,
                "executionTimeMillisEstimate": 13428,
                "works": 5376754,
                "advanced": 5376753,
                "needTime": 0,
                "needYield": 0,
                "saveState": 741,
                "restoreState": 741,
                "isEOF": 1,
                "docsExamined": 5376753,
                "alreadyHasObj": 0,
                "inputStage": {
                    "stage": "IXSCAN",
                    "nReturned": 5376753,
                    "executionTimeMillisEstimate": 3647,
                    "works": 5376754,
                    "advanced": 5376753,
                    "needTime": 0,
                    "needYield": 0,
                    "saveState": 741,
                    "restoreState": 741,
                    "isEOF": 1,
                    "keyPattern": {
                        "date": -1
                    },
                    "indexName": "date_-1",
                    "isMultiKey": false,
                    "multiKeyPaths": {
                        "date": []
                    },
                    "isUnique": false,
                    "isSparse": false,
                    "isPartial": false,
                    "indexVersion": 2,
                    "direction": "forward",
                    "indexBounds": {
                        "date": [
                            "[new Date(1748303999999), new Date(1745837512346)]"
                        ]
                    },
                    "keysExamined": 5376753,
                    "seeks": 1,
                    "dupsTested": 0,
                    "dupsDropped": 0
                }
            },
            "allPlansExecution": []
        },

FETCH stage takes 13 sec while the IXSCAN takes additional 3,6s

Is there any chance to improvie the speed of queries selecting data by date ranges?

steevej · May 30, 2025, 12:07pm

If FETCH takes 13sec it means your working set does not fit in RAM.

It means you are consuming too much data for your use-case.

For the $project as

you would need an index that makes your query a covered query.

For dated use-cases, especially if past date data is not mutable, the Computed Pattern is often a good solution.

steevej · June 12, 2025, 12:46am

I am removing this thread from my bookmarks from the lack of followups.