SlideShare a Scribd company logo
Using Apache Spark for processing
trillions of records each day at
Datadog
Vadim Semenov
Data Engineer @ Datadog
vadim@datadoghq.com
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Initial setup
AWS EMR
100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE
3200-6400 cores
24-48 TiB memory
spot instances
spark 1.6 in yarn-cluster mode
scala + RDD API
Initial setup
AWS EMR
100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE
3200-6400 cores
23.5-47 TiB memory
spot instances
spark 1.6 in yarn-cluster mode
scala + RDD API
only 240.23
GiB available
because of Xen
Some initial settings
yarn.nodemanager.resource.memory-mb 240g
yarn.scheduler.maximum-allocation-mb 240g
spark.driver.memory 8g
spark.yarn.driver.memoryOverhead 3g
spark.executor.memory 201g
spark.yarn.executor.memoryOverhead 28g
spark.driver.cores 4
spark.executor.cores 32
spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
-XX:ErrorFile=/tmp/hs_err_pid%p.log
Trillion
How big is a trillion?
2^40 = 1,099,511,627,776
2^31 = 2,147,483,648 = Int.MaxValue
a trillion Integers = 4.3 TiB
OOMs
OOMs
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)
The driver must survive
spark.driver.memory 8g 83g
spark.yarn.driver.memoryOverhead 3g 32g
spark.driver.cores 4 15
spark.executor.memory 201g 166g
spark.yarn.executor.memoryOverhead 28g 64g
spark.executor.cores 32 30
IMAGE: TYNE & WEAR ARCHIVES & MUSEUMS
Measure memory usage
https://github.com/etsy/statsd-jvm-profiler
spark.files = /tmp/statsd-jvm-profiler.jar
spark.executor.extraJavaOptions +=
-javaagent:statsd-jvm-profiler.jar=server=localhost,port=8125,profilers=Mem
oryProfiler
Measure memory usage
Measure memory usage
Measure GC
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)
Off-heap OOMs
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
…
at parquet.hadoop.codec.…
Off-heap OOMs
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
…
at parquet.hadoop.codec.…
Off-heap memory
Direct Allocated Buffers (NIO): Parquet, MessagePack, …
Java Native Interface (JNI): dynamically-linked native
libraries like libhadoop.so, GZIP, ZLIB, LZ4
sun.misc.Unsafe: org.apache.hadoop.io.nativeio,
org.apache.spark.unsafe
Process memory
$ cat /proc/<spark driver/executor pid>/status
VmPeak: 190317312 kB
VmSize: 190268160 kB
VmHWM: 187586408 kB
VmRSS: 187586408 kB
VmData: 190044492 kB
Process memory
$ cat /proc/<spark driver/executor pid>/status
VmPeak: 190317312 kB
VmSize: 190268160 kB
VmHWM: 187586408 kB
VmRSS: 187586408 kB
VmData: 190044492 kB
Process memory
Solution: let the java-agent get the memory
usage of its process right from the procfs
https://github.com/DataDog/spark-jvm-profiler
Measure memory usage
Measure each executor
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
Lessons
- Give more resources than you think you
would need, and then reduce
Lessons
- Give more resources than you think you
would need, and then reduce
- Measure memory usage of each executor
Lessons
- Give more resources than you think you
would need, and then reduce
- Measure memory usage of each executor
- Keep an eye on your GC metrics
Measure slow parts
val timer = MaxAndTotalTimeAccumulator
rdd.map(key => {
val startTime = System.nanoTime()
...
val endTime = System.nanoTime()
val millisecondsPassed = ((endTime - startTime) / 1000000).toInt
timer.add(millisecondsPassed)
})
Watch skewed parts
.groupByKey().flatMap({ case (key, iter) =>
val size = iter.size
maxAccumulator.add(key, size)
if (size >= 100,000,000) {
log.info(s"Key $key has $size values")
None
} else {
Report accumulators per partition
sc.addSparkListener(new SparkListener {
override def onTaskEnd(
taskEnd: SparkListenerTaskEnd
): Unit =
Option(taskEnd.taskMetrics)
.foreach(taskMetrics => … )
})
Collect executors metrics
Lessons
- Measure slowest parts of your job
Lessons
- Measure slowest parts of your job
- Count records in the most skewed parts
Lessons
- Measure slowest parts of your job
- Count records in the most skewed parts
- Keep track of how much CPU time your job
actually consumes
Lessons
- Measure slowest parts of your job
- Count records in the most skewed parts
- Keep track of how much CPU time your job
actually consumes
- Have some alerting on these metrics, so you
would know that your job gets slower
Spot instances
Spot instances mitigation
- Break the job into smaller survivable pieces
Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs
Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs
- Losing multiple executors won't result in
recomputing partitions
ExternalShuffleService
Ex1 1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
Ex1 1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
1 2 3 4
Ex2
Ex3
Driver
ERROR o.a.s.n.shuffle.RetryingBlockFetcher:
Exception while beginning fetch of 13
outstanding blocks
java.io.IOException: Failed to connect to
ip-10-12-32-67.us-west-2.compute.internal/1
0.12.32.67:7337
ExternalShuffleService
2 3 4
Ex2
Ex3
Driver
1
ExternalShuffleService
2 3 4
Ex2
Ex3
Driver
1
ERROR o.a.s.n.shuffle.RetryingBlockFetcher:
Exception while beginning fetch of 13
outstanding blocks
java.io.IOException: Failed to connect to
ip-10-12-32-67.us-west-2.compute.internal/1
0.12.32.67:7337
ExternalShuffleService
3 4
Ex2
Ex3
Driver
1
2
ExternalShuffleService
SPARK-19753 Remove all shuffle files on a host in case
of slave lost of fetch failure
SPARK-20832 Standalone master should explicitly inform
drivers of worker deaths and invalidate external shuffle
service outputs
Other FetchFailures
SPARK-20178 Improve Scheduler fetch failures
Keep an eye on failed tasks
Lessons
- Keep all logs
Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies
Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies
- Monitor the number of failed
tasks/stages/lost nodes
Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
Late arriving partitions
.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sortBy(_._1)
// (1L, 10), (1L, 1), (2L, 1)
// (1L, 1), (1L, 10), (2L, 1)
})
SPARK-19263 DAGScheduler should avoid sending
conflicting task set
Late arriving partitions
.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sorted
// (1L, 1), (1L, 10), (2L, 1)
})
Lessons
- Trust but put extra checks and log everything
Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome
Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome
- Have retries on the pipeline scheduler level
Migration to Spark 2
SPARK-13850 TimSort Comparison method violates its general contract
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-14363 Executor OOM due to a memory leak in Sorter
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-22033 BufferHolder, other size checks should account for the specific VM
array size limitations
Lessons
- Check the bug tracker periodically
Lessons
- Check the bug tracker periodically
- Subscribe to mailing lists
Lessons
- Check the bug tracker periodically
- Subscribe to mailing lists
- Participate in discussing issues
In conclusion
- Log everything (driver/executors,
NodeManagers, GC)
In conclusion
- Log everything
- Measure everything (heap/off-heap, GC,
executors cpu, failed tasks/stages, slow
parts, skewed parts)
In conclusion
- Log everything
- Measure everything
- Trust but be ready
In conclusion
- Log everything
- Measure everything
- Trust but be ready
- Smaller survivable pieces
Thanks!
Want to work with us on Spark, Kafka, ES, and
more? Come to our booth!
jobs.datadoghq.com
twitter.com/@databuryat
_@databuryat.com
vadim@datadoghq.com

More Related Content

PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PPTX
The Current State of Table API in 2022
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Apache Spark At Scale in the Cloud
From cache to in-memory data grid. Introduction to Hazelcast.
The Parquet Format and Performance Optimization Opportunities
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
APACHE KAFKA / Kafka Connect / Kafka Streams
The Current State of Table API in 2022
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Apache Spark At Scale in the Cloud

What's hot (20)

PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
Hive tuning
PDF
Understanding Query Plans and Spark UIs
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Apache Jackrabbit Oak on MongoDB
PDF
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
PDF
Advanced Cassandra Operations via JMX (Nate McCall, The Last Pickle) | C* Sum...
PDF
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PPTX
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
PDF
Presto on Apache Spark: A Tale of Two Computation Engines
PPTX
Oracle Unified Directory. Lessons learnt. Is it ready for a move from OID? (O...
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Battle of the Stream Processing Titans – Flink versus RisingWave
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Efficient Data Storage for Analytics with Apache Parquet 2.0
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Hive tuning
Understanding Query Plans and Spark UIs
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Apache Jackrabbit Oak on MongoDB
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Advanced Cassandra Operations via JMX (Nate McCall, The Last Pickle) | C* Sum...
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Presto on Apache Spark: A Tale of Two Computation Engines
Oracle Unified Directory. Lessons learnt. Is it ready for a move from OID? (O...
Project Tungsten: Bringing Spark Closer to Bare Metal
Top 5 Mistakes When Writing Spark Applications
Battle of the Stream Processing Titans – Flink versus RisingWave
Optimizing Delta/Parquet Data Lakes for Apache Spark
Ad

Similar to Using apache spark for processing trillions of records each day at Datadog (20)

PDF
Spark 2.x Troubleshooting Guide
 
PDF
10 things i wish i'd known before using spark in production
PDF
Advanced spark training advanced spark internals and tuning reynold xin
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
PDF
Debugging & Tuning in Spark
PDF
Top 5 mistakes when writing Spark applications
PDF
TriHUG talk on Spark and Shark
PDF
Understanding Memory Management In Spark For Fun And Profit
PDF
Top 5 mistakes when writing Spark applications
PDF
Apache Spark Internals - Part 2
PDF
Top 5 mistakes when writing Spark applications
PPTX
Apache Spark Workshop
PDF
Top 5 mistakes when writing Spark applications
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Scaling Big Data with Hadoop and Mesos
PDF
Sparklife - Life In The Trenches With Spark
PDF
Why your Spark job is failing
PDF
Spark Gotchas and Lessons Learned (2/20/20)
PPTX
Spark Gotchas and Lessons Learned
PPTX
Tuning tips for Apache Spark Jobs
Spark 2.x Troubleshooting Guide
 
10 things i wish i'd known before using spark in production
Advanced spark training advanced spark internals and tuning reynold xin
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Debugging & Tuning in Spark
Top 5 mistakes when writing Spark applications
TriHUG talk on Spark and Shark
Understanding Memory Management In Spark For Fun And Profit
Top 5 mistakes when writing Spark applications
Apache Spark Internals - Part 2
Top 5 mistakes when writing Spark applications
Apache Spark Workshop
Top 5 mistakes when writing Spark applications
Apache Spark in Depth: Core Concepts, Architecture & Internals
Scaling Big Data with Hadoop and Mesos
Sparklife - Life In The Trenches With Spark
Why your Spark job is failing
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned
Tuning tips for Apache Spark Jobs
Ad

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
AI in Product Development-omnex systems
PPTX
ai tools demonstartion for schools and inter college
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
history of c programming in notes for students .pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPT
Introduction Database Management System for Course Database
PDF
medical staffing services at VALiNTRY
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Transform Your Business with a Software ERP System
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
Wondershare Filmora 15 Crack With Activation Key [2025
How to Migrate SBCGlobal Email to Yahoo Easily
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Softaken Excel to vCard Converter Software.pdf
AI in Product Development-omnex systems
ai tools demonstartion for schools and inter college
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
history of c programming in notes for students .pptx
top salesforce developer skills in 2025.pdf
Understanding Forklifts - TECH EHS Solution
VVF-Customer-Presentation2025-Ver1.9.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Introduction Database Management System for Course Database
medical staffing services at VALiNTRY
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Transform Your Business with a Software ERP System
Design an Analysis of Algorithms I-SECS-1021-03
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
2025 Textile ERP Trends: SAP, Odoo & Oracle

Using apache spark for processing trillions of records each day at Datadog

  • 1. Using Apache Spark for processing trillions of records each day at Datadog Vadim Semenov Data Engineer @ Datadog [email protected]
  • 5. Initial setup AWS EMR 100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE 3200-6400 cores 24-48 TiB memory spot instances spark 1.6 in yarn-cluster mode scala + RDD API
  • 6. Initial setup AWS EMR 100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE 3200-6400 cores 23.5-47 TiB memory spot instances spark 1.6 in yarn-cluster mode scala + RDD API only 240.23 GiB available because of Xen
  • 7. Some initial settings yarn.nodemanager.resource.memory-mb 240g yarn.scheduler.maximum-allocation-mb 240g spark.driver.memory 8g spark.yarn.driver.memoryOverhead 3g spark.executor.memory 201g spark.yarn.executor.memoryOverhead 28g spark.driver.cores 4 spark.executor.cores 32 spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -XX:ErrorFile=/tmp/hs_err_pid%p.log
  • 8. Trillion How big is a trillion? 2^40 = 1,099,511,627,776 2^31 = 2,147,483,648 = Int.MaxValue a trillion Integers = 4.3 TiB
  • 10. OOMs
  • 11. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size)
  • 12. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage)
  • 13. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO)
  • 14. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO) - YARN Container is running beyond physical memory limits. Killing container. (Increase memory overhead)
  • 15. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO) - YARN Container is running beyond physical memory limits. Killing container. (Increase memory overhead) - There is insufficient memory for the Java Runtime Environment to continue (Add more memory, reduce memory consumption)
  • 16. The driver must survive spark.driver.memory 8g 83g spark.yarn.driver.memoryOverhead 3g 32g spark.driver.cores 4 15 spark.executor.memory 201g 166g spark.yarn.executor.memoryOverhead 28g 64g spark.executor.cores 32 30
  • 17. IMAGE: TYNE & WEAR ARCHIVES & MUSEUMS
  • 18. Measure memory usage https://github.com/etsy/statsd-jvm-profiler spark.files = /tmp/statsd-jvm-profiler.jar spark.executor.extraJavaOptions += -javaagent:statsd-jvm-profiler.jar=server=localhost,port=8125,profilers=Mem oryProfiler
  • 22. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO) - YARN Container is running beyond physical memory limits. Killing container. (Increase memory overhead) - There is insufficient memory for the Java Runtime Environment to continue (Add more memory, reduce memory consumption)
  • 23. Off-heap OOMs java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) … at parquet.hadoop.codec.…
  • 24. Off-heap OOMs java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) … at parquet.hadoop.codec.…
  • 25. Off-heap memory Direct Allocated Buffers (NIO): Parquet, MessagePack, … Java Native Interface (JNI): dynamically-linked native libraries like libhadoop.so, GZIP, ZLIB, LZ4 sun.misc.Unsafe: org.apache.hadoop.io.nativeio, org.apache.spark.unsafe
  • 26. Process memory $ cat /proc/<spark driver/executor pid>/status VmPeak: 190317312 kB VmSize: 190268160 kB VmHWM: 187586408 kB VmRSS: 187586408 kB VmData: 190044492 kB
  • 27. Process memory $ cat /proc/<spark driver/executor pid>/status VmPeak: 190317312 kB VmSize: 190268160 kB VmHWM: 187586408 kB VmRSS: 187586408 kB VmData: 190044492 kB
  • 28. Process memory Solution: let the java-agent get the memory usage of its process right from the procfs https://github.com/DataDog/spark-jvm-profiler
  • 31. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO) - There is insufficient memory for the Java Runtime Environment to continue (Add more memory, reduce memory consumption) - YARN Container is running beyond physical memory limits. Killing container. (Increase memory overhead)
  • 32. Lessons - Give more resources than you think you would need, and then reduce
  • 33. Lessons - Give more resources than you think you would need, and then reduce - Measure memory usage of each executor
  • 34. Lessons - Give more resources than you think you would need, and then reduce - Measure memory usage of each executor - Keep an eye on your GC metrics
  • 35. Measure slow parts val timer = MaxAndTotalTimeAccumulator rdd.map(key => { val startTime = System.nanoTime() ... val endTime = System.nanoTime() val millisecondsPassed = ((endTime - startTime) / 1000000).toInt timer.add(millisecondsPassed) })
  • 36. Watch skewed parts .groupByKey().flatMap({ case (key, iter) => val size = iter.size maxAccumulator.add(key, size) if (size >= 100,000,000) { log.info(s"Key $key has $size values") None } else {
  • 37. Report accumulators per partition sc.addSparkListener(new SparkListener { override def onTaskEnd( taskEnd: SparkListenerTaskEnd ): Unit = Option(taskEnd.taskMetrics) .foreach(taskMetrics => … ) })
  • 39. Lessons - Measure slowest parts of your job
  • 40. Lessons - Measure slowest parts of your job - Count records in the most skewed parts
  • 41. Lessons - Measure slowest parts of your job - Count records in the most skewed parts - Keep track of how much CPU time your job actually consumes
  • 42. Lessons - Measure slowest parts of your job - Count records in the most skewed parts - Keep track of how much CPU time your job actually consumes - Have some alerting on these metrics, so you would know that your job gets slower
  • 44. Spot instances mitigation - Break the job into smaller survivable pieces
  • 45. Spot instances mitigation - Break the job into smaller survivable pieces - Use `rdd.checkpoint` instead of `rdd.persist` to save data to HDFS
  • 46. Spot instances mitigation - Break the job into smaller survivable pieces - Use `rdd.checkpoint` instead of `rdd.persist` to save data to HDFS - Helps dynamic allocation since executors don't hold any data, so they can leave the job and join other jobs
  • 47. Spot instances mitigation - Break the job into smaller survivable pieces - Use `rdd.checkpoint` instead of `rdd.persist` to save data to HDFS - Helps dynamic allocation since executors don't hold any data, so they can leave the job and join other jobs - Losing multiple executors won't result in recomputing partitions
  • 48. ExternalShuffleService Ex1 1 2 3 4 Ex2 Ex3 Driver
  • 49. ExternalShuffleService Ex1 1 2 3 4 Ex2 Ex3 Driver
  • 50. ExternalShuffleService 1 2 3 4 Ex2 Ex3 Driver
  • 51. ExternalShuffleService 1 2 3 4 Ex2 Ex3 Driver
  • 52. ExternalShuffleService 1 2 3 4 Ex2 Ex3 Driver
  • 53. ExternalShuffleService 1 2 3 4 Ex2 Ex3 Driver ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception while beginning fetch of 13 outstanding blocks java.io.IOException: Failed to connect to ip-10-12-32-67.us-west-2.compute.internal/1 0.12.32.67:7337
  • 55. ExternalShuffleService 2 3 4 Ex2 Ex3 Driver 1 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception while beginning fetch of 13 outstanding blocks java.io.IOException: Failed to connect to ip-10-12-32-67.us-west-2.compute.internal/1 0.12.32.67:7337
  • 57. ExternalShuffleService SPARK-19753 Remove all shuffle files on a host in case of slave lost of fetch failure SPARK-20832 Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
  • 58. Other FetchFailures SPARK-20178 Improve Scheduler fetch failures
  • 59. Keep an eye on failed tasks
  • 61. Lessons - Keep all logs - Spark isn't super-resilient even when one node dies
  • 62. Lessons - Keep all logs - Spark isn't super-resilient even when one node dies - Monitor the number of failed tasks/stages/lost nodes
  • 63. Late arriving partitions rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) => // We should always have one-to-one join, but who knows … if (iterA.toSet.size() > 1) throw new RuntimeException(s"Key $k received more than 1 A record") if (iterB.toSet.size() > 1) throw new RuntimeException(…) if (iterC.toSet.size() > 1) …
  • 64. Late arriving partitions rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) => // We should always have one-to-one join, but who knows … if (iterA.toSet.size() > 1) throw new RuntimeException(s"Key $k received more than 1 A record") if (iterB.toSet.size() > 1) throw new RuntimeException(…) if (iterC.toSet.size() > 1) …
  • 65. Late arriving partitions rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) => // We should always have one-to-one join, but who knows … if (iterA.toSet.size() > 1) throw new RuntimeException(s"Key $k received more than 1 A record") if (iterB.toSet.size() > 1) throw new RuntimeException(…) if (iterC.toSet.size() > 1) …
  • 66. Late arriving partitions .map({ case (key, values: Iterator[(Long, Int)]) => values.toList.sortBy(_._1) // (1L, 10), (1L, 1), (2L, 1) // (1L, 1), (1L, 10), (2L, 1) }) SPARK-19263 DAGScheduler should avoid sending conflicting task set
  • 67. Late arriving partitions .map({ case (key, values: Iterator[(Long, Int)]) => values.toList.sorted // (1L, 1), (1L, 10), (2L, 1) })
  • 68. Lessons - Trust but put extra checks and log everything
  • 69. Lessons - Trust but put extra checks and log everything - Add extra idempotency even if it should be there
  • 70. Lessons - Trust but put extra checks and log everything - Add extra idempotency even if it should be there - Fail the job if some unexpected situation is encountered, but also think ahead of time if such situations can be overcome
  • 71. Lessons - Trust but put extra checks and log everything - Add extra idempotency even if it should be there - Fail the job if some unexpected situation is encountered, but also think ahead of time if such situations can be overcome - Have retries on the pipeline scheduler level
  • 72. Migration to Spark 2 SPARK-13850 TimSort Comparison method violates its general contract SPARK-14560 Cooperative Memory Management for Spillables SPARK-14363 Executor OOM due to a memory leak in Sorter SPARK-14560 Cooperative Memory Management for Spillables SPARK-22033 BufferHolder, other size checks should account for the specific VM array size limitations
  • 73. Lessons - Check the bug tracker periodically
  • 74. Lessons - Check the bug tracker periodically - Subscribe to mailing lists
  • 75. Lessons - Check the bug tracker periodically - Subscribe to mailing lists - Participate in discussing issues
  • 76. In conclusion - Log everything (driver/executors, NodeManagers, GC)
  • 77. In conclusion - Log everything - Measure everything (heap/off-heap, GC, executors cpu, failed tasks/stages, slow parts, skewed parts)
  • 78. In conclusion - Log everything - Measure everything - Trust but be ready
  • 79. In conclusion - Log everything - Measure everything - Trust but be ready - Smaller survivable pieces
  • 80. Thanks! Want to work with us on Spark, Kafka, ES, and more? Come to our booth! jobs.datadoghq.com twitter.com/@databuryat [email protected] [email protected]