Using apache spark for processing trillions of records each day at Datadog

Using Apache Spark for processing
trillions of records each day at
Datadog
Vadim Semenov
Data Engineer @ Datadog
vadim@datadoghq.com

Initial setup
AWS EMR
100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE
3200-6400 cores
24-48 TiB memory
spot instances
spark 1.6 in yarn-cluster mode
scala + RDD API

Initial setup
AWS EMR
100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE
3200-6400 cores
23.5-47 TiB memory
spot instances
spark 1.6 in yarn-cluster mode
scala + RDD API
only 240.23
GiB available
because of Xen

Some initial settings
yarn.nodemanager.resource.memory-mb 240g
yarn.scheduler.maximum-allocation-mb 240g
spark.driver.memory 8g
spark.yarn.driver.memoryOverhead 3g
spark.executor.memory 201g
spark.yarn.executor.memoryOverhead 28g
spark.driver.cores 4
spark.executor.cores 32
spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
-XX:ErrorFile=/tmp/hs_err_pid%p.log

Trillion
How big is a trillion?
2^40 = 1,099,511,627,776
2^31 = 2,147,483,648 = Int.MaxValue
a trillion Integers = 4.3 TiB

OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)

OOMs
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)

OOMs
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)

OOMs
garbage)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)

OOMs
garbage)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)

The driver must survive
spark.driver.memory 8g 83g
spark.yarn.driver.memoryOverhead 3g 32g
spark.driver.cores 4 15
spark.executor.memory 201g 166g
spark.yarn.executor.memoryOverhead 28g 64g
spark.executor.cores 32 30

IMAGE: TYNE & WEAR ARCHIVES & MUSEUMS

Measure memory usage
https://github.com/etsy/statsd-jvm-profiler
spark.files = /tmp/statsd-jvm-profiler.jar
spark.executor.extraJavaOptions +=
-javaagent:statsd-jvm-profiler.jar=server=localhost,port=8125,profilers=Mem
oryProfiler

Off-heap OOMs
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
…
at parquet.hadoop.codec.…

Off-heap memory
Direct Allocated Buffers (NIO): Parquet, MessagePack, …
Java Native Interface (JNI): dynamically-linked native
libraries like libhadoop.so, GZIP, ZLIB, LZ4
sun.misc.Unsafe: org.apache.hadoop.io.nativeio,
org.apache.spark.unsafe

Process memory
$ cat /proc/<spark driver/executor pid>/status
VmPeak: 190317312 kB
VmSize: 190268160 kB
VmHWM: 187586408 kB
VmRSS: 187586408 kB
VmData: 190044492 kB

Process memory
Solution: let the java-agent get the memory
usage of its process right from the procfs
https://github.com/DataDog/spark-jvm-profiler

OOMs
garbage)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)

Lessons
- Give more resources than you think you
would need, and then reduce

Lessons
- Measure memory usage of each executor

Lessons
- Measure memory usage of each executor
- Keep an eye on your GC metrics

Measure slow parts
val timer = MaxAndTotalTimeAccumulator
rdd.map(key => {
val startTime = System.nanoTime()
...
val endTime = System.nanoTime()
val millisecondsPassed = ((endTime - startTime) / 1000000).toInt
timer.add(millisecondsPassed)
})

Watch skewed parts
.groupByKey().flatMap({ case (key, iter) =>
val size = iter.size
maxAccumulator.add(key, size)
if (size >= 100,000,000) {
log.info(s"Key $key has $size values")
None
} else {

Report accumulators per partition
sc.addSparkListener(new SparkListener {
override def onTaskEnd(
taskEnd: SparkListenerTaskEnd
): Unit =
Option(taskEnd.taskMetrics)
.foreach(taskMetrics => … )
})

Lessons
- Measure slowest parts of your job

Lessons
- Count records in the most skewed parts

Lessons
- Keep track of how much CPU time your job
actually consumes

Lessons
- Keep track of how much CPU time your job
actually consumes
- Have some alerting on these metrics, so you
would know that your job gets slower

Spot instances mitigation
- Break the job into smaller survivable pieces

- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS

- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs

- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs
- Losing multiple executors won't result in
recomputing partitions

ExternalShuffleService
Ex1 1 2 3 4
Ex2
Ex3
Driver

1 2 3 4
Ex2
Ex3
Driver

1 2 3 4
Ex2
Ex3
Driver
ERROR o.a.s.n.shuffle.RetryingBlockFetcher:
Exception while beginning fetch of 13
outstanding blocks
java.io.IOException: Failed to connect to
ip-10-12-32-67.us-west-2.compute.internal/1
0.12.32.67:7337

2 3 4
Ex2
Ex3
Driver
1

2 3 4
Ex2
Ex3
Driver
1
ERROR o.a.s.n.shuffle.RetryingBlockFetcher:
Exception while beginning fetch of 13
outstanding blocks
java.io.IOException: Failed to connect to
ip-10-12-32-67.us-west-2.compute.internal/1
0.12.32.67:7337

3 4
Ex2
Ex3
Driver
1
2

SPARK-19753 Remove all shuffle files on a host in case
of slave lost of fetch failure
SPARK-20832 Standalone master should explicitly inform
drivers of worker deaths and invalidate external shuffle
service outputs

Other FetchFailures
SPARK-20178 Improve Scheduler fetch failures

Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies

Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies
- Monitor the number of failed
tasks/stages/lost nodes

Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …

.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sortBy(_._1)
// (1L, 10), (1L, 1), (2L, 1)
// (1L, 1), (1L, 10), (2L, 1)
})
SPARK-19263 DAGScheduler should avoid sending
conflicting task set

.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sorted
// (1L, 1), (1L, 10), (2L, 1)
})

Lessons
- Trust but put extra checks and log everything

Lessons
- Add extra idempotency even if it should be there

Lessons
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome

Lessons
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome
- Have retries on the pipeline scheduler level

Migration to Spark 2
SPARK-13850 TimSort Comparison method violates its general contract
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-14363 Executor OOM due to a memory leak in Sorter
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-22033 BufferHolder, other size checks should account for the specific VM
array size limitations

Lessons
- Check the bug tracker periodically

Lessons
- Subscribe to mailing lists

Lessons
- Subscribe to mailing lists
- Participate in discussing issues

In conclusion
- Log everything (driver/executors,
NodeManagers, GC)

In conclusion
- Log everything
- Measure everything (heap/off-heap, GC,
executors cpu, failed tasks/stages, slow
parts, skewed parts)

In conclusion
- Log everything
- Measure everything
- Trust but be ready

In conclusion
- Log everything
- Measure everything
- Trust but be ready
- Smaller survivable pieces

Thanks!
Want to work with us on Spark, Kafka, ES, and
more? Come to our booth!
jobs.datadoghq.com
twitter.com/@databuryat
_@databuryat.com
vadim@datadoghq.com

Using apache spark for processing trillions of records each day at Datadog

More Related Content

What's hot (20)

Similar to Using apache spark for processing trillions of records each day at Datadog (20)

Recently uploaded (20)

Using apache spark for processing trillions of records each day at Datadog