SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Christopher Crosbie, Google
Ben Sidhom, Google
Improving Spark
Downscaling
#UnifiedDataAnalytics #SparkAISummit
Open
Source
Google
Cloud Products
Google
Research
2000 2010
GFS
Map
Reduce
Dremel
Flume
Java MillwheelPubSub
2020
BigTable
BigQuery Pub/Sub Dataflow Bigtable MLDataproc
Long History of Solving Data
Problems
Tensorflow
Apache Airflow
Cloud ML Engine
Cloud Dataflow
Cloud Data Fusion
Cloud Composer
Who are we and what is Cloud Dataproc?
Google Cloud Platform’s
fully-managed Apache Spark
and Apache Hadoop service
Rapid cluster creation
Familiar open source tools
Customizable machines
Ephemeral clusters on-demand
Tightly Integrated
with other Google Cloud
Platform services
Cloud Dataproc: Open source solutions with GCP
Taking the best of open source And opening up access to the best of GCP
Webhcat
BigQuery
Cloud
Datastore
Cloud
Bigtable
Compute
Engine
Kubernetes
Engine
Cloud
Dataflow
Cloud
Dataproc
Cloud
Functions
Cloud Machine
Learning
Engine
Cloud
Pub/Sub
Key
Management
Service
Cloud
Spanner
Cloud SQL BQ Transfer
Service
Cloud
Translation API
Cloud Vision
API
Cloud
Storage
Jobs are “fire and forget”
No need to manually intervene
when a cluster is over or under
capacity
Choose balance between
standard and preemptible workers
Save resources (quota & cost) at
any point in time
Dataproc Autoscaling GA
Complicating Spark Downscaling
Without autoscaling
Submit job
Monitor resource usage
Adjust cluster size
With autoscaling
Submit jobs
Based on the difference between
YARN pending and available
memory
If more memory is needed then
scale up
If there is excess memory then
scale down
Obey VM limits and scale based
on scale factor
Autoscaling policies: fine grained control
Is there too much or too little
YARN memory?
Do nothing
Is the cluster at the maximum
# of nodes?
Do not autoscale
Determine type and scale of
nodes to modify
Autoscale cluster
Yes No
Yes No
YARN Infrastructure
Complexities
Finding processed data
(shuffle files, cached RDDs, etc)
Optimizing costs
Spark Autoscaling Challenges
YARN
YARN-based managed Spark
Dataproc Cluster
HDFS
Persistent Disk
Cluster bucket
Cloud Storage Compute engine nodes
Dataproc Image
Apache Spark
Apache Hadoop
Apache Hive
...
Clients
Cloud Dataproc API
Clusters
...
Jobs
Clients
(SSH)
Dataproc Agent
User Data
Cloud Storage
YARN pain points
Management is difficult
Clusters are complicated and have to use more components than are
required for a job or model. This also requires hard-to-find experts.
Complicated OSS software stack
Version and dependency management is hard. Have to understand how to
tune multiple components for efficiency.
Isolation is hard
I have to think about my jobs to size clusters, and isolating jobs requires
additional steps.
 Improving Apache Spark Downscaling
Multiple k8s
options
Moving the OSS ecosystem
to Kubernetes offers
customers a range of options
depending on their needs and
core expertise.
DIY k8s Dataproc
k8s Dataproc +
Vendor components
Runs OSS on k8s? Yes - self-managed
Yes - managed k8s
clusters
Yes - managed k8s
clusters
SLAs GKE only Dataproc cluster
Dataproc cluster
and component
OSS components Community only Google optimized
Google optimized +
vendor optimized
In-depth component
support
No No Yes
Integrated
management
No Yes Yes
Integrated security No Yes Yes
Hybrid/cross-cloud
support
No Yes Yes
How we are making this happen
• Kubernetes Operators - Application control
plane for complex applications
– The language of Kubernetes allows
extending its vocabulary through
Custom Resource Definition (CRD)
– Kubernetes Operator is an app-specific
control plane running in the cluster
• CRD: app-specific vocabulary
• CR: instance of CRD
• CR Controller: interpreter and
reconciliation loop for CRs
– The cluster can now speak the
app-specific words through the
Kubernetes API
Control Plane
(Master)
MyApp API
Data Plane
(Nodes)
CRUD MyApp ...
Kubernetes
MyApp Control Plane
Kubernetes API
● Integrates with BigQuery,
Google’s Serverless Data
Warehouse
● Provides Google Cloud Storage
as replacement for HDFS
● Ships logs to Stackdriver
Monitoring
○ via Prometheus server
with the Stackdriver
sidecar
● Contains sparkctl, a command
line tool that simplifies client-local
application dependencies in a
Kubernetes environment.
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Deployment options
1. Deploy unified resource management
Get away deal from two separate cluster management interfaces to manage
open source component. Offers one central view for easy management.
2. Isolate Spark jobs and resources
Remove the headaches of version and dependency management; instead,
move models and ETL pipelines from dev to production without added work.
Build resilient infrastructure
Don’t worry about sizing and building clusters, manipulating Docker files, or
messing around with Kubernetes networking configurations. It just works.
Key benefits for autoscaling
Helpful but does not solve our
core problem…..
Finding the
processed data
What exactly is a shuffle & why do we care? Rob Wynne
A Brief History of Spark Shuffle
● Shuffle files to local storage on the executors
● Executors responsible for serving the files
● Loss of an executor meant loss of the shuffle files
● Result: poor auto-scaling
○ Pathological loop: scale down, lose work, re-compute, trigger scale up…
● Depended on driver GC event to clean up shuffle files
22#UnifiedDataAnalytics #SparkAISummit
Today: Dynamic allocation and “external” shuffle
● Executors no longer need to serve data
● “External” shuffle is not exactly external
○ Only executors can be released
○ Can scale up & down executors but not the machines
● Still depends on driver GC event to clean up shuffle files
23#UnifiedDataAnalytics #SparkAISummit
Spark’s shuffle code today
private[spark] trait ShuffleManager {
def registerShuffle[K, V, C](shuffleId: Int, numMaps: Int, dependency: ShuffleDependency[K, V,
C]): ShuffleHandle
def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext, metrics:
ShuffleWriteMetricsReporter): ShuffleWriter[K, V]
def getReader[K, C](handle: ShuffleHandle, startPartition: Int, endPartition: Int, context:
TaskContext, metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]
def unregisterShuffle(shuffleId: Int): Boolean
def shuffleBlockResolver: ShuffleBlockResolver
def stop(): Unit
}
24#UnifiedDataAnalytics #SparkAISummit
Continued..
/**
* Obtained inside a map task to write out records to the shuffle system.
*/
private[spark] abstract class ShuffleWriter[K, V] {
/** Write a sequence of records to this task's output */
@throws[IOException]
def write(records: Iterator[Product2[K, V]]): Unit
/** Close this writer, passing along whether the map completed */
def stop(success: Boolean): Option[MapStatus]
}
25#UnifiedDataAnalytics #SparkAISummit
Continued..
/** Write a bunch of records to this task's output */
override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
// In this case we pass neither an aggregator nor an ordering to the sorter, because we
don't
// care whether the keys get sorted in each partition; that will be done on the reduce
side
// if the operation being run is sortByKey.
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
sorter.insertAll(records)
...
26#UnifiedDataAnalytics #SparkAISummit
Continued..
// Don't bother including the time to open the merged output file in the shuffle write time,
// because it just opens a single file, so is typically too fast to measure accurately
// (see SPARK-3570).
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
val tmp = Utils.tempFileWith(output)
try {
val blockId = ShuffleBlockId(dep.shuffleId, mapId,
IndexShuffleBlockResolver.NOOP_REDUCE_ID)
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {
if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}
27#UnifiedDataAnalytics #SparkAISummit
Continued..
/ Note: Changes to the format in this file should be kept in sync with
//
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getSortBasedShuffleBlockData().
private[spark] class IndexShuffleBlockResolver(
conf: SparkConf,
_blockManager: BlockManager = null)
extends ShuffleBlockResolver
………..
28#UnifiedDataAnalytics #SparkAISummit
Problems with This
● Rapid downscaling infeasible
○ Scaling down entire nodes hard
● Preemptible VMs & Spot Instances
29#UnifiedDataAnalytics #SparkAISummit
Optimizing Costs
Preemptible
VMs and
Spot
instances
PVMs Up to 80% cheaper for
short-lived instances. Can be pulled
at any time. Guaranteed to be
removed at least once in 24 hours.
Spot is based on Vickrey auction.
Stage 1 Stage 2Shuffle
How can we fix this?
Make intermediate shuffle data external to both the executor and the
machine itself
33#UnifiedDataAnalytics #SparkAISummit
Where we started
class HcfsShuffleWriter[K, V, C] extends ShuffleWriter[K, V] {
override def write(records: Iterator[Product2[K, V]]): Unit = {
val sorter = new ExternalSorter[K, V, C/V](...)
sorter.insertAll(records)
val partitionIter = sorter.partitionedIter
val hcfsStream = …
val countingStream = new CountingOutputStream(hcfsStream)
val framedOutput = new FramingOutputStream(countingStream)
try {
for ((partition, iter) <- partitionIter) {
// Write partition to external storage
}
} finally {
framedOutput.closeUnderlying()
}
}
34#UnifiedDataAnalytics #SparkAISummit
 Improving Apache Spark Downscaling
Alpha: HDFS not quite ready for prime time
● RPC overhead to HDFS or persistent storage
● Especially poor performance with misaligned partition/block sizes
○ HDFS/GCS/etc different expectations of block size
● Loss of implicit in-memory page cache
● Possibly slowness in cleaning up shuffle files
● Namenode contention when reading shuffle files (HDFS)
○ Added index caching layer to mitigate this
● Additional metadata tracking
36#UnifiedDataAnalytics #SparkAISummit
Object Storage?
 Improving Apache Spark Downscaling
Apache Crail (Incubating) is a high-performance distributed data store designed for fast sharing
of ephemeral data in distributed data processing workloads
● Fast
● Heterogeneous
● Modular
What about Google Cloud Bigtable?
Consistent low latency, high
throughput, and scalable
wide-column database service.
Back to basics - NFS
● Shuffle to Elastifile
○ Cloud based NFS service (scales horizontally)
○ Tailored to random access patterns, small files
○ NFS looks like local FS, but is not. Must be careful when dealing with
commit semantics and speculative execution.
● Still a performance hit but factors better than HDFS
41#UnifiedDataAnalytics #SparkAISummit
Goal: OSS Disaggregated Shuffle
Architecture
Kubernetes Cluster
Spark Driver Pod
Shuffle
Offload
(WIP)
Executor
Virtual Machine Group
Elastifile
Cloud (object)
Storage
Use the cloud to
fix the cloud?
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Spark shuffle introduction
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
The Parquet Format and Performance Optimization Opportunities
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Spark shuffle introduction
A Deep Dive into Query Execution Engine of Spark SQL
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Cosco: An Efficient Facebook-Scale Shuffle Service
Apache Iceberg Presentation for the St. Louis Big Data IDEA

What's hot (20)

PDF
Understanding Query Plans and Spark UIs
PDF
Spark SQL Join Improvement at Facebook
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Apache Spark Architecture
PPTX
Key-Value NoSQL Database
PDF
Memory Management in Apache Spark
PDF
Top 5 mistakes when writing Spark applications
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PPTX
Programming in Spark using PySpark
PPTX
RedisConf17- Using Redis at scale @ Twitter
PDF
Apache Spark Core – Practical Optimization
PPTX
HDFS Erasure Coding in Action
PDF
Лекция 12. Spark
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
How to Extend Apache Spark with Customized Optimizations
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Securing Hadoop with Apache Ranger
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
PPTX
Hive + Tez: A Performance Deep Dive
Understanding Query Plans and Spark UIs
Spark SQL Join Improvement at Facebook
Dynamic Partition Pruning in Apache Spark
Deep Dive: Memory Management in Apache Spark
Apache Spark Architecture
Key-Value NoSQL Database
Memory Management in Apache Spark
Top 5 mistakes when writing Spark applications
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Programming in Spark using PySpark
RedisConf17- Using Redis at scale @ Twitter
Apache Spark Core – Practical Optimization
HDFS Erasure Coding in Action
Лекция 12. Spark
Presto Summit 2018 - 09 - Netflix Iceberg
How to Extend Apache Spark with Customized Optimizations
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Securing Hadoop with Apache Ranger
Compression Options in Hadoop - A Tale of Tradeoffs
Hive + Tez: A Performance Deep Dive
Ad

Similar to Improving Apache Spark Downscaling (20)

PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
PDF
DevEx | there’s no place like k3s
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
PDF
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
PDF
Scalable Clusters On Demand
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
PDF
Powerful Google developer tools for immediate impact! (2023-24 C)
PDF
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
PDF
Free GitOps Workshop + Intro to Kubernetes & GitOps
PDF
Handout3o
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PDF
Google Cloud Dataflow
PPTX
[20200720]cloud native develoment - Nelson Lin
PDF
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PPTX
ETL with SPARK - First Spark London meetup
PDF
How to Puppetize Google Cloud Platform - PuppetConf 2014
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
DevEx | there’s no place like k3s
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Scalable Clusters On Demand
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Powerful Google developer tools for immediate impact! (2023-24 C)
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Free GitOps Workshop + Intro to Kubernetes & GitOps
Handout3o
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Running Airflow Workflows as ETL Processes on Hadoop
Google Cloud Dataflow
[20200720]cloud native develoment - Nelson Lin
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Scaling your Data Pipelines with Apache Spark on Kubernetes
ETL with SPARK - First Spark London meetup
How to Puppetize Google Cloud Platform - PuppetConf 2014
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Introduction to the R Programming Language
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
modul_python (1).pptx for professional and student
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Database Infoormation System (DBIS).pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Lecture1 pattern recognition............
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to the R Programming Language
Reliability_Chapter_ presentation 1221.5784
IB Computer Science - Internal Assessment.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
[EN] Industrial Machine Downtime Prediction
Supervised vs unsupervised machine learning algorithms
modul_python (1).pptx for professional and student
Optimise Shopper Experiences with a Strong Data Estate.pdf
Clinical guidelines as a resource for EBP(1).pdf
Database Infoormation System (DBIS).pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Lecture1 pattern recognition............
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
SAP 2 completion done . PRESENTATION.pptx
climate analysis of Dhaka ,Banglades.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx

Improving Apache Spark Downscaling

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Christopher Crosbie, Google Ben Sidhom, Google Improving Spark Downscaling #UnifiedDataAnalytics #SparkAISummit
  • 3. Open Source Google Cloud Products Google Research 2000 2010 GFS Map Reduce Dremel Flume Java MillwheelPubSub 2020 BigTable BigQuery Pub/Sub Dataflow Bigtable MLDataproc Long History of Solving Data Problems Tensorflow
  • 4. Apache Airflow Cloud ML Engine Cloud Dataflow Cloud Data Fusion Cloud Composer
  • 5. Who are we and what is Cloud Dataproc? Google Cloud Platform’s fully-managed Apache Spark and Apache Hadoop service Rapid cluster creation Familiar open source tools Customizable machines Ephemeral clusters on-demand Tightly Integrated with other Google Cloud Platform services
  • 6. Cloud Dataproc: Open source solutions with GCP Taking the best of open source And opening up access to the best of GCP Webhcat BigQuery Cloud Datastore Cloud Bigtable Compute Engine Kubernetes Engine Cloud Dataflow Cloud Dataproc Cloud Functions Cloud Machine Learning Engine Cloud Pub/Sub Key Management Service Cloud Spanner Cloud SQL BQ Transfer Service Cloud Translation API Cloud Vision API Cloud Storage
  • 7. Jobs are “fire and forget” No need to manually intervene when a cluster is over or under capacity Choose balance between standard and preemptible workers Save resources (quota & cost) at any point in time Dataproc Autoscaling GA Complicating Spark Downscaling Without autoscaling Submit job Monitor resource usage Adjust cluster size With autoscaling Submit jobs
  • 8. Based on the difference between YARN pending and available memory If more memory is needed then scale up If there is excess memory then scale down Obey VM limits and scale based on scale factor Autoscaling policies: fine grained control Is there too much or too little YARN memory? Do nothing Is the cluster at the maximum # of nodes? Do not autoscale Determine type and scale of nodes to modify Autoscale cluster Yes No Yes No
  • 9. YARN Infrastructure Complexities Finding processed data (shuffle files, cached RDDs, etc) Optimizing costs Spark Autoscaling Challenges
  • 10. YARN
  • 11. YARN-based managed Spark Dataproc Cluster HDFS Persistent Disk Cluster bucket Cloud Storage Compute engine nodes Dataproc Image Apache Spark Apache Hadoop Apache Hive ... Clients Cloud Dataproc API Clusters ... Jobs Clients (SSH) Dataproc Agent User Data Cloud Storage
  • 12. YARN pain points Management is difficult Clusters are complicated and have to use more components than are required for a job or model. This also requires hard-to-find experts. Complicated OSS software stack Version and dependency management is hard. Have to understand how to tune multiple components for efficiency. Isolation is hard I have to think about my jobs to size clusters, and isolating jobs requires additional steps.
  • 14. Multiple k8s options Moving the OSS ecosystem to Kubernetes offers customers a range of options depending on their needs and core expertise. DIY k8s Dataproc k8s Dataproc + Vendor components Runs OSS on k8s? Yes - self-managed Yes - managed k8s clusters Yes - managed k8s clusters SLAs GKE only Dataproc cluster Dataproc cluster and component OSS components Community only Google optimized Google optimized + vendor optimized In-depth component support No No Yes Integrated management No Yes Yes Integrated security No Yes Yes Hybrid/cross-cloud support No Yes Yes
  • 15. How we are making this happen • Kubernetes Operators - Application control plane for complex applications – The language of Kubernetes allows extending its vocabulary through Custom Resource Definition (CRD) – Kubernetes Operator is an app-specific control plane running in the cluster • CRD: app-specific vocabulary • CR: instance of CRD • CR Controller: interpreter and reconciliation loop for CRs – The cluster can now speak the app-specific words through the Kubernetes API Control Plane (Master) MyApp API Data Plane (Nodes) CRUD MyApp ... Kubernetes MyApp Control Plane Kubernetes API
  • 16. ● Integrates with BigQuery, Google’s Serverless Data Warehouse ● Provides Google Cloud Storage as replacement for HDFS ● Ships logs to Stackdriver Monitoring ○ via Prometheus server with the Stackdriver sidecar ● Contains sparkctl, a command line tool that simplifies client-local application dependencies in a Kubernetes environment. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
  • 18. 1. Deploy unified resource management Get away deal from two separate cluster management interfaces to manage open source component. Offers one central view for easy management. 2. Isolate Spark jobs and resources Remove the headaches of version and dependency management; instead, move models and ETL pipelines from dev to production without added work. Build resilient infrastructure Don’t worry about sizing and building clusters, manipulating Docker files, or messing around with Kubernetes networking configurations. It just works. Key benefits for autoscaling
  • 19. Helpful but does not solve our core problem…..
  • 21. What exactly is a shuffle & why do we care? Rob Wynne
  • 22. A Brief History of Spark Shuffle ● Shuffle files to local storage on the executors ● Executors responsible for serving the files ● Loss of an executor meant loss of the shuffle files ● Result: poor auto-scaling ○ Pathological loop: scale down, lose work, re-compute, trigger scale up… ● Depended on driver GC event to clean up shuffle files 22#UnifiedDataAnalytics #SparkAISummit
  • 23. Today: Dynamic allocation and “external” shuffle ● Executors no longer need to serve data ● “External” shuffle is not exactly external ○ Only executors can be released ○ Can scale up & down executors but not the machines ● Still depends on driver GC event to clean up shuffle files 23#UnifiedDataAnalytics #SparkAISummit
  • 24. Spark’s shuffle code today private[spark] trait ShuffleManager { def registerShuffle[K, V, C](shuffleId: Int, numMaps: Int, dependency: ShuffleDependency[K, V, C]): ShuffleHandle def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext, metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V] def getReader[K, C](handle: ShuffleHandle, startPartition: Int, endPartition: Int, context: TaskContext, metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C] def unregisterShuffle(shuffleId: Int): Boolean def shuffleBlockResolver: ShuffleBlockResolver def stop(): Unit } 24#UnifiedDataAnalytics #SparkAISummit
  • 25. Continued.. /** * Obtained inside a map task to write out records to the shuffle system. */ private[spark] abstract class ShuffleWriter[K, V] { /** Write a sequence of records to this task's output */ @throws[IOException] def write(records: Iterator[Product2[K, V]]): Unit /** Close this writer, passing along whether the map completed */ def stop(success: Boolean): Option[MapStatus] } 25#UnifiedDataAnalytics #SparkAISummit
  • 26. Continued.. /** Write a bunch of records to this task's output */ override def write(records: Iterator[Product2[K, V]]): Unit = { sorter = if (dep.mapSideCombine) { new ExternalSorter[K, V, C]( context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer) } else { // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't // care whether the keys get sorted in each partition; that will be done on the reduce side // if the operation being run is sortByKey. new ExternalSorter[K, V, V]( context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer) } sorter.insertAll(records) ... 26#UnifiedDataAnalytics #SparkAISummit
  • 27. Continued.. // Don't bother including the time to open the merged output file in the shuffle write time, // because it just opens a single file, so is typically too fast to measure accurately // (see SPARK-3570). val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId) val tmp = Utils.tempFileWith(output) try { val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID) val partitionLengths = sorter.writePartitionedFile(blockId, tmp) shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp) mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths) } finally { if (tmp.exists() && !tmp.delete()) { logError(s"Error while deleting temp file ${tmp.getAbsolutePath}") } } } 27#UnifiedDataAnalytics #SparkAISummit
  • 28. Continued.. / Note: Changes to the format in this file should be kept in sync with // org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getSortBasedShuffleBlockData(). private[spark] class IndexShuffleBlockResolver( conf: SparkConf, _blockManager: BlockManager = null) extends ShuffleBlockResolver ……….. 28#UnifiedDataAnalytics #SparkAISummit
  • 29. Problems with This ● Rapid downscaling infeasible ○ Scaling down entire nodes hard ● Preemptible VMs & Spot Instances 29#UnifiedDataAnalytics #SparkAISummit
  • 31. Preemptible VMs and Spot instances PVMs Up to 80% cheaper for short-lived instances. Can be pulled at any time. Guaranteed to be removed at least once in 24 hours. Spot is based on Vickrey auction.
  • 32. Stage 1 Stage 2Shuffle
  • 33. How can we fix this? Make intermediate shuffle data external to both the executor and the machine itself 33#UnifiedDataAnalytics #SparkAISummit
  • 34. Where we started class HcfsShuffleWriter[K, V, C] extends ShuffleWriter[K, V] { override def write(records: Iterator[Product2[K, V]]): Unit = { val sorter = new ExternalSorter[K, V, C/V](...) sorter.insertAll(records) val partitionIter = sorter.partitionedIter val hcfsStream = … val countingStream = new CountingOutputStream(hcfsStream) val framedOutput = new FramingOutputStream(countingStream) try { for ((partition, iter) <- partitionIter) { // Write partition to external storage } } finally { framedOutput.closeUnderlying() } } 34#UnifiedDataAnalytics #SparkAISummit
  • 36. Alpha: HDFS not quite ready for prime time ● RPC overhead to HDFS or persistent storage ● Especially poor performance with misaligned partition/block sizes ○ HDFS/GCS/etc different expectations of block size ● Loss of implicit in-memory page cache ● Possibly slowness in cleaning up shuffle files ● Namenode contention when reading shuffle files (HDFS) ○ Added index caching layer to mitigate this ● Additional metadata tracking 36#UnifiedDataAnalytics #SparkAISummit
  • 39. Apache Crail (Incubating) is a high-performance distributed data store designed for fast sharing of ephemeral data in distributed data processing workloads ● Fast ● Heterogeneous ● Modular
  • 40. What about Google Cloud Bigtable? Consistent low latency, high throughput, and scalable wide-column database service.
  • 41. Back to basics - NFS ● Shuffle to Elastifile ○ Cloud based NFS service (scales horizontally) ○ Tailored to random access patterns, small files ○ NFS looks like local FS, but is not. Must be careful when dealing with commit semantics and speculative execution. ● Still a performance hit but factors better than HDFS 41#UnifiedDataAnalytics #SparkAISummit
  • 42. Goal: OSS Disaggregated Shuffle Architecture Kubernetes Cluster Spark Driver Pod Shuffle Offload (WIP) Executor Virtual Machine Group Elastifile Cloud (object) Storage
  • 43. Use the cloud to fix the cloud?
  • 44. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT