This document discusses Datadog's use of Apache Spark to process trillions of records daily. It describes their initial Spark setup using AWS EMR with large clusters. It then covers common out of memory errors, measuring memory usage, handling spot instances, and lessons learned around monitoring jobs and ensuring resilience.