Improving Apache Spark for Dynamic Allocation and Spot Instances

1 like313 views

Holden Karau discusses improvements to Apache Spark for dynamic allocation and spot instances, highlighting challenges related to data resilience and the impact of cloud technology and Kubernetes. The talk reflects on past experiences and personal anecdotes, including recovering from an accident that affected development work. Key features and future improvements for Spark's decommissioning capabilities are also outlined.

Data & Analytics

Who am I?
• Holden Kara
u

• She / he
r

• Apache Spark PMC
• Contributor to a lot of other projects
• co-author of High Performance
Spark, Learning Spark, and Kubeflow
for Machine Learning
• http://bit.ly/holdenSparkVideos
• https://youtube.com/user/holdenkarau

Let us start at the beginning
• Spark achieves resilience through re-computation which is part of how we go fas
• This poses challenges with removing executors that may contain dat
• We "solved" it for YARN/Mesos back in the da
• I drank waaaay too much coffee and came up with an alternativ
• But no one really liked it because we didn't need it so I closed the Google doc and
forgot about i
t

• Don’t worry, we’ll get to the code soon :)

But then….
• The "cloud" became really popula
r

• Kubernetes became popula
r

• Everything caught on fire :/

Our Protagonist Remembers
• I started drinking a lot of coffee

• We dusted off that old design and wrote
some cod
e

• And then I got hit by a ca
r

• More people wrote more cod
e

• We had a VOT
E

• We wrote waaaaay more cod
e

• Everyone lived happily ever after?
Photo by Lukas from Pexels

How did DA work on YARN?
• Scale up is "easy" (add more
resources
)

• Scale down required a stay resident
program to be on each YARN node to
serve any file
s

• Spark stored it's shuffle data as file
s

• Persist in memory data was still lost
when scaling down an executor
Photo by Markus Spiske from Pexels

Why did the cloud impact this?
• If you wanted a ~50% cost saving of
spot/preemptible instances you might
lose entire machine
s

• Yes Spark can "handle" this, but does
so by recomputing data (expensive
)

• You can't depend on leaving a program
around to serve files when the server is
just gon
e

• So we need to find a way to migrate the
data

Ok sure the cloud, but K8s?
• Kubernetes doesn't like like the idea of
scheduling a stay resident program on
every nod
e

• Also most people don't like the idea of
shared disk here either (accros jobs/
users
)

• So we need to find a way to migrate the
data

SPARK-20624
• Yee-haw
!

• Ok but more seriously how does it work? Great question lets open up the code
• BlockManagerDecomissioner.scala is where most of the magic happens

Collaboration
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-
Decommissioning-SPIP-td29701.htm
l

https://github.com/apache/spark/pulls?q=is%3Apr+decommission+is%3Aclosed+

Ok what about the car?
Getting hit by a car sucks a lot
Slowed down dev work while I did rehab to be able
to walk & type again
Shout out to everyone who helped me recover
(from my wife, girlfriend, partners, my friends, to
the hospital staff, nursing home, PT, OT,
Ambulance, my employer for giving me time off,
the Spark community for understanding I needed
time off <3)

It’s early though so please be careful
On a Happy Note: You can try this now
• Enable the followin
g

- spark.decommission.enabled

- spark.storage.decommission.enabled

- spark.storage.decommission.rddBlocks.enabled
- spark.storage.decommission.shuffleBlocks.enabled
• Want to get fancy? Optionally enable:

- spark.shuffle.externalStorage.enabled

- And configure a storage backend ( spark.shuffle.externalStorage.backend)

Future work
• Heuristics to migrate dat
a

• Improve container pre-emption selectio
• Better heuristics around when to scale up and down containers

TM and © 2021 Apple Inc. All rights reserved.

The document discusses best practices for enabling speculative execution in large-scale platforms, particularly in the context of Apache Spark at LinkedIn. It outlines configuration parameters, motivation for improvements, and metrics for analyzing speculative execution's impact on task performance and resource utilization. The findings indicate that tailored speculative execution parameters can enhance performance, reduce job completion times, and lead to more predictable system behavior.

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

The document discusses tuning Apache Spark at Facebook for large-scale workloads, addressing key areas such as driver and executor scaling, memory configuration, and fetch failure handling. It emphasizes optimizing performance through dynamic resource allocation, effective memory management, and configuring disk I/O settings. Additionally, it introduces tools for monitoring and analyzing task metrics to improve usability and automate tuning for job performance.

MLflow with DatabricksLiangjun Jiang

The document provides a comprehensive tutorial on using MLflow with Databricks for managing machine learning workflows, addressing challenges like tracking experiments and model deployment. It covers MLflow components such as tracking, projects, and models, which facilitate reproducibility, code organization, and diverse deployment options. Additionally, the document discusses CI/CD processes in Databricks, emphasizing integration with version control and testing methodologies.

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

This document discusses strategies for fine-tuning and enhancing the performance of Apache Spark jobs, focusing on optimizing data ingestion, handling skew, and effective partitioning. Key recommendations include experimenting with cluster configurations, using caching techniques, and selecting appropriate join optimization strategies. The importance of iterative performance tuning and utilizing monitoring tools such as the Spark UI to address major slowdowns is emphasized.

Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann

The document outlines the integration of Apache NiFi with Apache Spark and Kafka, focusing on data flow management, streaming analytics, and security features. It highlights various integration options and architectures that enhance real-time data processing and event management. Apache Livy is introduced as a REST interface for interacting with Spark, emphasizing its role in executing batch and interactive jobs securely.

Flink vs. SparkSlim Baltagi

The document compares Apache Flink and Apache Spark, discussing their capabilities and key differences, particularly in handling stream processing. It highlights the common misconceptions and marketing claims about both technologies, aiming to assist in evaluating their use, specifically within Capital One. A structured framework comprising over 100 criteria is presented to assess the strengths and weaknesses of both frameworks for data processing tasks.

Vectorized Query Execution in Apache Spark at FacebookDatabricks

This document summarizes Chen Yang's presentation on vectorized query execution in Apache Spark at Facebook. The key points are: 1) Spark is the largest SQL query engine at Facebook and uses columnar formats like ORC to improve storage efficiency. 2) Vectorized processing can improve performance over row-at-a-time processing by reducing per-row overhead and improving cache locality. 3) Facebook has implemented a vectorized ORC reader and writer in Spark that shows up to 8x speedup on microbenchmarks compared to the row-at-a-time approach.

Real-time Analytics with Presto and Apache PinotXiang Fu

The document discusses real-time analytics using Presto and Apache Pinot, highlighting their capabilities for user-facing applications and business metrics. It mentions the ingestion of millions of events per second, the handling of thousands of queries per second, and the balance between latency and flexibility. Additionally, it provides links for getting started with tutorials and community support.

Building Reliable Data Lakes at Scale with Delta LakeDatabricks

This document provides a tutorial for building scalable Delta Lakes using Databricks, including steps such as account creation, cluster setup, and data storage strategies. It addresses challenges in data lakes such as historical queries, messy data, and the need for updates and consistency. The document emphasizes the benefits of Delta Lakes such as data quality improvement, support for ACID transactions, and integration with various data formats and sources.

Apache Spark Core – Practical OptimizationDatabricks

The document contains speaker notes from Daniel Tomes' talk on optimizing Spark core at the 2019 AI Summit, covering key topics like Spark hierarchy, UI navigation, partition management, and data scanning minimization. It emphasizes the importance of understanding hardware capabilities, managing partitions effectively, and optimizing data processing to reduce job spills and maximize performance. Supplemental slides on specific topics like Spark UI and Delta optimization will be added based on audience familiarity with the presentation material.

Improving Apache Spark DownscalingDatabricks

The document discusses Google's Cloud Dataproc, a fully-managed service for Apache Spark and Hadoop, detailing its integration with various Google Cloud products and the management of cluster resources through autoscaling. It addresses the complexities and challenges of optimizing Spark jobs, particularly focusing on shuffle processing, data management, and the transition to Kubernetes for enhanced control and performance. Additionally, it highlights advancements in external shuffling mechanisms to improve efficiency and reduce the impact of scaling operations.

Introduction to Structured StreamingKnoldus Inc.

Structured Streaming is a scalable and fault-tolerant stream processing engine introduced in Spark 2.0, allowing for unified stream and batch processing through SQL-like queries. It represents live data streams as unbounded tables with three output modes: complete, append, and update, facilitating various computational approaches. The document covers essential operations like selection, aggregation, and window operations with examples, highlighting its ease of use and performance advantages through the Catalyst optimizer.

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

The document discusses the implementation and performance of the Zstandard compression algorithm in Apache Spark and related projects, highlighting various issues such as compatibility, buffer management, and memory consumption. It emphasizes the improvements brought by Zstandard in terms of compression ratios and speed, while also detailing limitations and specific use cases for efficient data handling. Additionally, it provides historical context for the integration of Zstandard across different data processing frameworks like Apache Parquet, ORC, and Avro.

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

The document discusses Iceberg, a modern table format for big data, highlighting its advantages over traditional Hive tables, particularly in terms of performance and usability for large datasets like those used by Netflix. It details the technical advantages of Iceberg's snapshot isolation, atomic operations, and enhanced schema evolution capabilities, contrasting these features with limitations of Hive. The document also provides practical guidance on getting started with Iceberg, including supported engines and community resources.

Druid deep diveKashif Khan

This document provides an overview of Druid, an open-source distributed real-time analytics database. Druid is designed to ingest and query large amounts of data quickly. It can combine both historical and real-time data streams. Druid uses a column-oriented data structure and supports features like streaming data ingestion, sub-second queries, and approximate computation. The document describes the various components of Druid including indexing, serving, and coordination nodes and how they work together. It also discusses querying, integration with Hive, and compares Druid to other real-time analytics solutions.

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

This document summarizes a USF Spark workshop that covers Spark internals and how to optimize Spark jobs. It discusses how Spark works with partitions, caching, serialization and shuffling data. It provides lessons on using less memory by partitioning wisely, avoiding shuffles, using the driver carefully, and caching strategically to speed up jobs. The workshop emphasizes understanding Spark and tuning configurations to improve performance and stability.

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

The document discusses hive bucketing in Apache Spark, including the reasons for using bucketing, potential performance improvements, and the differences between Spark's and Hive's bucketing semantics. It highlights the inefficiencies of shuffling during joins and provides guidelines for when to implement bucketing to enhance performance. Additionally, it outlines Spark's support for creating and managing bucketed tables and the associated SQL planner improvements.

[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)Seongyun Byeon

[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유Hyojun Jeon

The Hidden Value of Hadoop MigrationDatabricks

The document discusses the advantages of migrating from Hadoop to Databricks, highlighting benefits such as significant cost savings, improved performance, and increased productivity. It showcases successful migration stories, including enhanced data analytics and machine learning capabilities that drive business value. The content emphasizes a structured migration approach through automation and expert support to minimize risks and cut costs.

Apache Flink and Apache Hudi.pdfdogma28

This document summarizes a presentation on building a streaming Lakehouse with Apache Flink and Apache Hudi. The presentation introduces Hudi as a way to unify batch and streaming workloads in a centralized data lake platform. It discusses how Hudi enables features like efficient upserts/deletes, incremental processing for change streams, and automatic catalog synchronization. The presentation demonstrates using Flink and Hudi on Amazon EMR and outlines several ongoing Hudi projects, including a new metaserver and lake cache, to further optimize query performance and metadata handling for streaming data lakes.

Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi

The document discusses Apache Flink, emphasizing its transition from batch to streaming analytics, which is driven by the need for real-time data processing in various industries. It highlights key differentiators of Flink, such as low latency, high throughput, and its ability to ensure accurate results even during failures. Additionally, it explores real-world use cases across sectors like finance, healthcare, and retail, illustrating the growing adoption of streaming analytics for competitive advantage.

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

The document discusses the challenges and strategies related to downscaling Apache Spark clusters in a cloud environment, particularly emphasizing the difficulty of removing nodes compared to adding them. It outlines optimization techniques, including executor packing, external shuffle services, and the disaggregation of compute and storage to improve resource management and cost efficiency. Future work is proposed to enhance downscaling processes without significantly impacting performance.

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

The document discusses the implementation of a near real-time, exactly-once financial data processing pipeline using Kafka, Flink, and Pinot at Stripe, addressing the challenges of processing large volumes of transactions without missing or duplicating them. Key requirements include low latency and operational efficiency, with focus on deduplication strategies and handling Kafka transactions carefully to prevent data loss. Additionally, it outlines lessons learned and offers best practices for managing the architecture effectively.

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

The document discusses 5 common mistakes people make when writing Spark applications: 1) Not properly sizing executors for memory and cores. 2) Having shuffle blocks larger than 2GB which can cause jobs to fail. 3) Not addressing data skew which can cause joins and shuffles to be very slow. 4) Not properly managing the DAG to minimize shuffles and stages. 5) Classpath conflicts from mismatched dependencies causing errors.

Modernizing to a Cloud Data ArchitectureDatabricks

The document discusses the urgency for enterprises to transition from traditional Hadoop architectures to cloud-based solutions like Databricks due to rising costs and inefficiencies. It highlights significant business benefits, including increased revenue and productivity, as well as the advantages of a unified data platform for analytics and AI workloads. The content emphasizes the importance of modernization in achieving innovation and competitive advantage in an era of accelerated digital transformation.

3D: DBT using Databricks and DeltaDatabricks

The document discusses dbt (data build tool) and its integration with Delta Lake and Azure Databricks, focusing on building data pipelines using principles of DataOps. It highlights the capabilities of dbt in transforming data, managing tables and views, and enabling incremental data ingestion while detailing its features such as testing, documentation, and metrics tracking. Additionally, it emphasizes the importance of observability and provides links to resources for further exploration.

Leveraging Databricks for Spark PipelinesRose Toomey

Coatue Management migrated its Spark pipelines to Databricks, achieving significant reductions in operational overhead, running times, and costs. Key changes included consolidating multiple jobs into single executions and utilizing Databricks' runtime optimizations, leading to notable performance improvements, such as reducing pipeline completion times from hours to minutes. The transition resulted in more reliable cloud storage interactions and a simplified management through a unified API.

Leveraging Databricks for Spark pipelinesRose Toomey

Coatue Management improved the efficiency of their Spark pipelines by migrating to Databricks, which enhanced performance and reduced operational overhead. This transition enabled the consolidation of multiple job submissions into a single job, resulting in significant speed improvements for large and medium-sized data pipelines. The overall outcome included faster processing times, more reliable cloud storage operations, and simplified management through a single API.

More Related Content

What's hot (20)

Building Reliable Data Lakes at Scale with Delta LakeDatabricks

Apache Spark Core – Practical OptimizationDatabricks

Improving Apache Spark DownscalingDatabricks

Introduction to Structured StreamingKnoldus Inc.

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Druid deep diveKashif Khan

Data Lakehouse Symposium | Day 4Databricks

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)Seongyun Byeon

[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유Hyojun Jeon

The Hidden Value of Hadoop MigrationDatabricks

Apache Flink and Apache Hudi.pdfdogma28

Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Modernizing to a Cloud Data ArchitectureDatabricks

3D: DBT using Databricks and DeltaDatabricks

Building Reliable Data Lakes at Scale with Delta LakeDatabricks

Apache Spark Core – Practical OptimizationDatabricks

Improving Apache Spark DownscalingDatabricks

Introduction to Structured StreamingKnoldus Inc.

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Druid deep diveKashif Khan

Data Lakehouse Symposium | Day 4Databricks

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)Seongyun Byeon

[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유Hyojun Jeon

The Hidden Value of Hadoop MigrationDatabricks

Apache Flink and Apache Hudi.pdfdogma28

Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Modernizing to a Cloud Data ArchitectureDatabricks

3D: DBT using Databricks and DeltaDatabricks

Similar to Improving Apache Spark for Dynamic Allocation and Spot Instances (20)

Leveraging Databricks for Spark PipelinesRose Toomey

Leveraging Databricks for Spark pipelinesRose Toomey

Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDatabricks

Stackato v4Jonas Brømsø

Stackato is a Platform as a Service (PaaS) cloud computing product from ActiveState that allows developers to easily deploy applications and services written in languages like Perl, Ruby, and JavaScript to public and private clouds. The presenter evaluates Stackato based on their experience, demonstrating how to deploy a simple "Hello World" Perl application using Mojolicious and exploring Stackato's management console, application updating process, and built-in app store. They conclude that Stackato provides benefits like easy access to platforms and frameworks with minimal differences between development and production.

Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau

The document discusses the integration of Python with Apache Spark, focusing on using PySpark and the performance challenges it faces. It introduces techniques for optimizing Python user-defined functions (UDFs) and highlights the potential of Apache Arrow for improving data serialization speeds. Additionally, the speaker encourages collaboration and benchmarking of Python UDFs within Spark for better performance insights.

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent

The document discusses stream processing with Python and options to avoid summoning Cuthulu when doing so. It summarizes Apache Spark's capabilities for stream processing with Python, current limitations, and potential future improvements. It also discusses alternative approaches like using pure Python or Spark Structured Streaming. The document recommends Spark Streaming for Python stream processing needs today while noting potential performance improvements in the future.

Stackato v6Jonas Brømsø

Stackato is a PaaS cloud platform from ActiveState that allows developers to easily deploy applications to the cloud. It supports multiple languages including Perl, Ruby, and JavaScript. The presentation demonstrated deploying simple Perl apps to Stackato using the Mojolicious framework. Key benefits of Stackato include minimal differences between development and production environments, one-click deployments, and allowing developers to manage infrastructure. ActiveState is very open and provides documentation, examples, and a community forum to support Stackato users.

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

The document is a presentation by Sarah Guido on using Apache Spark for data science at Bitly, focusing on big data analysis, workflow setup, and live demonstrations of Spark's capabilities. It highlights the advantages of Spark over Hadoop, including speed and functionality, and discusses data processing, exploratory data analysis, and topic modeling. The talk concludes with current and future projects involving Spark, emphasizing its role in research and development.

Machine learning in real-time - the next frontierSnowplow Analytics

Alex Dean, CEO of Snowplow Analytics, discusses the shift from batch-based to real-time data pipelines in machine learning, emphasizing the challenges of operationalizing decision-making from static datasets. He critiques the 'lambda architecture' for real-time analytics as problematic and suggests the need for unified data models and decision-making. Dean invites collaboration and discussion with those experimenting with real-time machine learning.

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

This document outlines a workshop on Apache Spark, detailing its features, such as fast cluster computing, in-memory processing, and support for multiple programming languages. It emphasizes the importance of Spark for data science and machine learning applications, describing its capabilities for data querying and real-time processing. Additionally, it includes practical information on using Spark with various programming environments, such as Python and R, and offers resources for further learning.

sparkBen Liu

Spark is a big data analytic engine and cluster computing framework that was created by UC Berkeley AMP Lab in 2009 and donated to the Apache Software Foundation in 2014. It is 100x faster than Hadoop for certain applications because it temporarily stores data in RAM rather than on disk. Spark supports Scala, Java, Python and has four main APIs for SQL queries, streaming data, machine learning, and graph processing. It can be run locally for testing or in standalone, Mesos, or YARN cluster modes.

StackatoJonas Brømsø

The document presents an evaluation of Stackato, a Platform as a Service (PaaS) solution by ActiveState, focusing on its utility for developers in terms of deployment and management of applications using Perl and other languages. It highlights features such as ease of deployment, management console capabilities, and the minimal differences between development and production environments while noting existing issues with service integrations and custom components. The author advocates for Stackato as a significant part of future infrastructure due to its benefits in accessibility and reproducibility.

Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.

This document provides an overview of integrating Alluxio with Apache Spark for enhanced data analytics, focusing on improved input/output performance through better data locality and enabling data sharing between Spark jobs. It discusses the history of Alluxio as a data orchestration tool, its setup with Spark, and various use cases demonstrating its benefits, including faster data access and reduced computing costs. The document also outlines examples of coding practices for using Alluxio with Spark and mentions a growing community and potential career opportunities.

Stackato v3Jonas Brømsø

The document provides a personal evaluation of Stackato, a micro-cloud platform by ActiveState, and discusses its benefits for developers, including easy access to platforms and reduced gaps between development and production. It highlights features like deployment, logging, and management tools while acknowledging the proprietary nature of Stackato despite its support for open-source development. Overall, the author advocates for adopting Stackato as part of future infrastructure due to its reproducibility and transparency.

Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks

The document discusses the use of Apache Spark at Apple, highlighting the transition from traditional Hadoop and MapReduce to Spark, specifically in terms of scalability, resource optimization, and fault tolerance. It outlines the growth of Spark usage from 2016 to 2018, the challenges faced with streaming jobs, and the implementation of an elastic self-service infrastructure to enhance developer productivity and resource utilization. Additionally, it covers security measures, telemetry, and the integration of a multi-tenant history server for job management.

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup

This document provides an introduction and overview of Apache Spark, a lightning-fast cluster computing framework. It discusses Spark's ecosystem, how it differs from Hadoop MapReduce, where it shines well, how easy it is to install and start learning, includes some small code demos, and provides additional resources for information. The presentation introduces Spark and its core concepts, compares it to Hadoop MapReduce in areas like speed, usability, tools, and deployment, demonstrates how to use Spark SQL with an example, and shows a visualization demo. It aims to provide attendees with a high-level understanding of Spark without being a training class or workshop.

Dec6 meetup spark presentationRamesh Mudunuri

LanceShivnathHadoopSummit2015Lance Co Ting Keh

This document discusses Spark application development and common problems that can occur. It notes that failures, wrong results, poor performance, scalability issues, and application, data, storage, and resource problems can all go wrong with Spark applications. It asks how application developers currently detect and fix these issues by looking at logs, but that logs are spread out, incomplete, and difficult to understand. It proposes that a better approach is to visualize all relevant data in one place, analyze the data to provide diagnoses and fixes, and help prevent problems and meet goals. It then lists some existing tools for Hadoop and Spark that provide visualization, optimization, and strategic capabilities.

12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul

The document discusses scaling a web application called Wanelo that is built on PostgreSQL. It describes 12 steps for incrementally scaling the application as traffic increases. The first steps involve adding more caching, optimizing SQL queries, and upgrading hardware. Further steps include replicating reads to additional PostgreSQL servers, using alternative data stores like Redis where appropriate, moving write-heavy tables out of PostgreSQL, and tuning PostgreSQL and the underlying filesystem. The goal is to scale the application while maintaining PostgreSQL as the primary database.

Stackato v5Jonas Brømsø

The document discusses Stackato, a cloud-based platform as a service developed by ActiveState, focusing on the developer experience and deployment process using Perl and Mojolicious. It highlights features such as easy access to environments, one-click CLI deployment, and support for multiple targets including development, testing, and production. The author expresses enthusiasm about Stackato's potential to enhance infrastructure while minimizing discrepancies between different operational stages.

Leveraging Databricks for Spark PipelinesRose Toomey

Leveraging Databricks for Spark pipelinesRose Toomey

Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDatabricks

Stackato v4Jonas Brømsø

Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent

Stackato v6Jonas Brømsø

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Machine learning in real-time - the next frontierSnowplow Analytics

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

sparkBen Liu

StackatoJonas Brømsø

Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.

Stackato v3Jonas Brømsø

Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup

Dec6 meetup spark presentationRamesh Mudunuri

LanceShivnathHadoopSummit2015Lance Co Ting Keh

12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul

Stackato v5Jonas Brømsø

More from Databricks (20)

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1Databricks

The document discusses the concept of a data lakehouse, highlighting the integration of structured, textual, and analog/IOT data. It emphasizes the importance of common identifiers and universal connectors for meaningful analytics across different data types, ultimately aiming to improve healthcare and manufacturing outcomes through effective data analysis. The presentation outlines the challenges of managing diverse data formats and the potential for data-driven insights to enhance quality of life.

Data Lakehouse Symposium | Day 1 | Part 2Databricks

The document compares data lakehouses and data warehouses, outlining their similarities and differences. Both serve analytical processing and contain vetted, historical data, but the data lakehouse handles a much larger volume of machine-generated data and features fundamentally different structures from transaction-based data warehouses. Ultimately, they are presented as related yet distinct entities in the realm of data management.

Data Lakehouse Symposium | Day 2Databricks

The Data Lakehouse Symposium held in February 2022 discussed the evolution of data management from data warehouses to lakehouses, emphasizing the integration of governance and metadata. It highlighted the challenges companies face in utilizing various types of data, particularly unstructured textual data, and the importance of adding context for effective analysis. The presentation also examined strategies for transforming unstructured data into structured formats to enable better decision-making and analytical processes.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

The document outlines the challenges and considerations for migrating from Hadoop to Databricks, emphasizing the complexities of the Hadoop ecosystem and the advantages of a modern cloud-based data architecture. It provides a comprehensive migration plan that includes internal assessments, technical planning, and execution while addressing key topics such as data migration, security, and SQL integration. Specific tools and methodologies for effective transition and enhanced performance in data analytics are also discussed.

Democratizing Data Quality Through a Centralized PlatformDatabricks

Zillow's Data Governance Platform team addresses data quality challenges by creating a centralized platform that enhances visibility and standardizes data quality rules. The platform includes self-service capabilities and integrates with data lineage, allowing for built-in alerting and scalable onboarding. Key takeaways emphasize the importance of early alerting, collaboration, and the shared responsibility for maintaining high-quality data to improve decision-making.

Learn to Use Databricks for Data ScienceDatabricks

The document outlines the challenges and workflows involved in data science, emphasizing the need for proper setup and resource management. It highlights the importance of sharing results with stakeholders and describes how Databricks' lakehouse platform simplifies these processes by integrating data sources and providing essential tools for data analysis. Overall, the goal is to help data scientists focus on their core analytical work rather than dealing with setup complexities.

Why APM Is Not the Same As ML MonitoringDatabricks

The document discusses the distinctions between application performance monitoring (APM) and machine learning (ML) monitoring, emphasizing the unique challenges of ML monitoring, such as the need for intelligent detection and alerting. It outlines the essential components of ML monitoring, including statistical summarization, distribution comparison, and actionable alerts based on model performance. Additionally, it introduces Verta's end-to-end MLOps platform designed to meet the specialized needs of ML monitoring throughout the entire model lifecycle.

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Elijah Ben Izzy, a Data Platform Engineer at Stitch Fix, discusses building abstractions for machine learning operations to optimize workflows and enhance the separation of concerns between data science and platform engineering. The presentation highlights the importance of a custom-built model envelope for seamless integration and management of ML models, as well as advancements in deployment and inference processes. Future directions include enhanced production monitoring and sophisticated feature integration to further streamline data science workflows.

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

The document discusses stage-level scheduling and resource allocation in Apache Spark to enhance big data and AI integration. It outlines various resource requirements such as executors, memory, and accelerators, while presenting benefits like improved hardware utilization and simplified application pipelines. Additionally, it introduces the RAPIDS Accelerator for Spark and distributed deep learning with Horovod, emphasizing performance optimizations and future enhancements.

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

The document discusses the importance of data conversion between Spark and deep learning frameworks like TensorFlow and PyTorch. It highlights key pain points, such as challenges in migrating from single-node to distributed training and the complexity of saving and loading data. Additionally, it introduces the Spark Dataset Converter, which simplifies data handling while training deep learning models and offers best practices for efficient usage.

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

This document discusses the integration of Apache Spark with Kubernetes on Google Cloud, highlighting its advantages for running data engineering and machine learning workloads within existing infrastructure. It outlines benefits such as improved cost optimization, faster scaling, and enhanced resource management through Google Kubernetes Engine (GKE) and Dataproc, while detailing implementation steps and monitoring options. Additionally, it covers the compatibility with big data ecosystem tools, job execution, and enterprise security features.

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

The document discusses the Sawtooth Windows Zipline, a feature engineering framework focusing on machine learning with structured data. It emphasizes the importance of real-time, stable, and consistent features for model training and serving, while highlighting the challenges of data sources and the intricacies of aggregations. Key topics include model complexity, data quality, and various types of windowed aggregations for efficient data processing.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

The document discusses the integration of Redis with Apache Spark for managing long-running batch jobs and distributed counters. It outlines the challenges faced in submitting queries and the inefficiencies of existing solutions, proposing a system that utilizes Redis for queuing and job status communication. Key workflows and code views are provided to demonstrate the proposed solutions for efficient query handling and data processing.

Re-imagine Data Monitoring with whylogs and SparkDatabricks

The document discusses the challenges of monitoring machine learning data, emphasizing how traditional data analysis techniques fall short in addressing issues in ML data pipelines. It introduces the open-source library Whylogs for data logging, highlighting its lightweight profiling methods suitable for large datasets and integration with Apache Spark. Key topics include data quality problems, the need for scalable monitoring, and approaches for logging and analyzing ML data effectively.

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

The document discusses Raven, an optimizer for machine learning prediction queries at Microsoft, focusing on its ability to improve the performance of SQL-based ML operations. It details how Raven integrates with Azure data engines, utilizing techniques like model projection pushdown and model-to-SQL translation to enhance query efficiency. Performance evaluations indicate that Raven significantly outperforms existing ML runtimes in various scenarios, achieving speed increases of up to 44 times compared to traditional approaches.

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

The document outlines the use of Spark for processing large datasets in automated driving applications, focusing on semantic segmentation and the challenges of moving from prototype to production. It presents the architecture of the system, covering ETL processes, model training, and inference, while addressing design considerations like scaling, security, and governance. Key takeaways emphasize the importance of leveraging cloud-based solutions and effective workflow management to enhance the development of perception software for autonomous vehicles.

Massive Data Processing in Adobe Using Delta LakeDatabricks

The document discusses massive data processing at Adobe using Delta Lake, highlighting various aspects such as data representation, schema evolution, and challenges in data ingestion. It emphasizes the performance benefits of utilizing Delta Lake for handling large-scale data efficiently, while considering issues like schema management and replication lag. Key features like ACID transactions and lazy schema on-read approaches are also outlined to address the complexities of multi-tenant data architecture.

Machine Learning CI/CD for Email Attack DetectionDatabricks

The document discusses the need for continuous machine learning integration and delivery (CI/CD) to enhance email attack detection against various forms of fraud like invoice payment fraud and social engineering. It outlines the challenges faced in the machine learning domain, including the rarity of attacks and the high precision required, while proposing a CI/CD approach that allows for rapid development without sacrificing system integrity. Ultimately, it emphasizes that a well-designed CI/CD system can lead to faster iterations and improved product stability in the fight against sophisticated email threats.

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Recently uploaded (20)

Allotted-MBBS-Student-list-batch-2021.pdfsubhansaifi0603

Residential Zone 4 for industrial villageMdYasinArafat13

624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdfCristineGraceAcuyan

Communication_Skills_Class10_Visual.pptxnamanrastogi70555

@Reset-Password.pptx presentakh;kenvtionMarkLariosa1

最新版意大利米兰大学毕业证（UNIMI毕业证书）原版定制taqyea

2025原版米兰大学毕业证书pdf电子版【q薇1954292140】意大利毕业证办理UNIMI米兰大学毕业证书多少钱？【q薇1954292140】海外各大学Diploma版本，因为疫情学校推迟发放证书、证书原件丢失补办、没有正常毕业未能认证学历面临就业提供解决办法。当遭遇挂科、旷课导致无法修满学分，或者直接被学校退学，最后无法毕业拿不到毕业证。此时的你一定手足无措，因为留学一场，没有获得毕业证以及学历证明肯定是无法给自己和父母一个交代的。【复刻米兰大学成绩单信封,Buy Università degli Studi di MILANO Transcripts】购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单（q微1954292140）新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率，以及是作为留信认证申请材料的一部分。米兰大学成绩单能够体现您的的学习能力，包括米兰大学课程成绩、专业能力、研究能力。（q微1954292140）具体来说，成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分，因此，成绩单不仅是学生学术能力的证明，也是评估学生是否适合某个教育项目的重要依据！我们承诺采用的是学校原版纸张（原版纸质、底色、纹路）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【主营项目】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理毕业证|办理文凭: 买大学毕业证|买大学文凭【q薇1954292140】米兰大学学位证明书如何办理申请？二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理意大利成绩单米兰大学毕业证【q薇1954292140】国外大学毕业证, 文凭办理, 国外文凭办理, 留信网认证

Shifting Focus on AI: How it Can Make a Positive Difference1508 A/S

最新版美国约翰霍普金斯大学毕业证（JHU毕业证书）原版定制Taqyea

2025原版约翰霍普金斯大学毕业证书pdf电子版【q薇1954292140】美国毕业证办理JHU约翰霍普金斯大学毕业证书多少钱？【q薇1954292140】海外各大学Diploma版本，因为疫情学校推迟发放证书、证书原件丢失补办、没有正常毕业未能认证学历面临就业提供解决办法。当遭遇挂科、旷课导致无法修满学分，或者直接被学校退学，最后无法毕业拿不到毕业证。此时的你一定手足无措，因为留学一场，没有获得毕业证以及学历证明肯定是无法给自己和父母一个交代的。【复刻约翰霍普金斯大学成绩单信封,Buy The Johns Hopkins University Transcripts】购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单（q微1954292140）新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率，以及是作为留信认证申请材料的一部分。约翰霍普金斯大学成绩单能够体现您的的学习能力，包括约翰霍普金斯大学课程成绩、专业能力、研究能力。（q微1954292140）具体来说，成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分，因此，成绩单不仅是学生学术能力的证明，也是评估学生是否适合某个教育项目的重要依据！我们承诺采用的是学校原版纸张（原版纸质、底色、纹路）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【主营项目】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理毕业证|办理文凭: 买大学毕业证|买大学文凭【q薇1954292140】约翰霍普金斯大学学位证明书如何办理申请？二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理美国成绩单约翰霍普金斯大学毕业证【q薇1954292140】国外大学毕业证, 文凭办理, 国外文凭办理, 留信网认证

最新版美国佐治亚大学毕业证（UGA毕业证书）原版定制Taqyea

2025原版佐治亚大学毕业证书pdf电子版【q薇1954292140】美国毕业证办理UGA佐治亚大学毕业证书多少钱？【q薇1954292140】海外各大学Diploma版本，因为疫情学校推迟发放证书、证书原件丢失补办、没有正常毕业未能认证学历面临就业提供解决办法。当遭遇挂科、旷课导致无法修满学分，或者直接被学校退学，最后无法毕业拿不到毕业证。此时的你一定手足无措，因为留学一场，没有获得毕业证以及学历证明肯定是无法给自己和父母一个交代的。【复刻佐治亚大学成绩单信封,Buy The University of Georgia Transcripts】购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单（q微1954292140）新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率，以及是作为留信认证申请材料的一部分。佐治亚大学成绩单能够体现您的的学习能力，包括佐治亚大学课程成绩、专业能力、研究能力。（q微1954292140）具体来说，成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分，因此，成绩单不仅是学生学术能力的证明，也是评估学生是否适合某个教育项目的重要依据！我们承诺采用的是学校原版纸张（原版纸质、底色、纹路）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【主营项目】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理毕业证|办理文凭: 买大学毕业证|买大学文凭【q薇1954292140】佐治亚大学学位证明书如何办理申请？二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理美国成绩单佐治亚大学毕业证【q薇1954292140】国外大学毕业证, 文凭办理, 国外文凭办理, 留信网认证

lecture12.pdf Introduction to bioinformaticsSergeyTsygankov6

一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理taqyed

鉴于此，办理TUC大学毕业证开姆尼茨工业大学毕业证书【q薇1954292140】留学一站式办理学历文凭直通车（开姆尼茨工业大学毕业证TUC成绩单原版开姆尼茨工业大学学位证假文凭）未能正常毕业？【q薇1954292140】办理开姆尼茨工业大学毕业证成绩单/留信学历认证/学历文凭/使馆认证/留学回国人员证明/录取通知书/Offer/在读证明/成绩单/网上存档永久可查！如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金【办理开姆尼茨工业大学成绩单Buy Technische Universität Chemnitz Transcripts】购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单（q微1954292140）新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率，以及是作为留信认证申请材料的一部分。开姆尼茨工业大学成绩单能够体现您的的学习能力，包括开姆尼茨工业大学课程成绩、专业能力、研究能力。（q微1954292140）具体来说，成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分，因此，成绩单不仅是学生学术能力的证明，也是评估学生是否适合某个教育项目的重要依据！

Attendance Presentation Project Excel.pptxs2025266191

Presentation by Tariq & Mohammed (1).pptxAbooddSandoqaa

最新版美国威斯康星大学河城分校毕业证（UWRF毕业证书）原版定制taqyea

2025原版威斯康星大学河城分校毕业证书pdf电子版【q薇1954292140】美国毕业证办理UWRF威斯康星大学河城分校毕业证书多少钱？【q薇1954292140】海外各大学Diploma版本，因为疫情学校推迟发放证书、证书原件丢失补办、没有正常毕业未能认证学历面临就业提供解决办法。当遭遇挂科、旷课导致无法修满学分，或者直接被学校退学，最后无法毕业拿不到毕业证。此时的你一定手足无措，因为留学一场，没有获得毕业证以及学历证明肯定是无法给自己和父母一个交代的。【复刻威斯康星大学河城分校成绩单信封,Buy University of Wisconsin-River Falls Transcripts】购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单（q微1954292140）新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率，以及是作为留信认证申请材料的一部分。威斯康星大学河城分校成绩单能够体现您的的学习能力，包括威斯康星大学河城分校课程成绩、专业能力、研究能力。（q微1954292140）具体来说，成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分，因此，成绩单不仅是学生学术能力的证明，也是评估学生是否适合某个教育项目的重要依据！我们承诺采用的是学校原版纸张（原版纸质、底色、纹路）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【主营项目】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理毕业证|办理文凭: 买大学毕业证|买大学文凭【q薇1954292140】威斯康星大学河城分校学位证明书如何办理申请？二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理美国成绩单威斯康星大学河城分校毕业证【q薇1954292140】国外大学毕业证, 文凭办理, 国外文凭办理, 留信网认证

PPT2 W1L2.pptx.........................................palicteronalyn26

11_L2_Defects_and_Trouble_Shooting_2014[1].pdfgun3awan88

英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证 taqyed

LJMU利物浦约翰摩尔斯大学毕业证书多少钱【q薇1954292140】1:1原版利物浦约翰摩尔斯大学毕业证+LJMU成绩单【q薇1954292140】完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。【主营项目】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理毕业证|办理文凭: 买大学毕业证|买大学文凭【q薇1954292140】学位证明书如何办理申请？二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理利物浦约翰摩尔斯大学毕业证|LJMU成绩单【q薇1954292140】国外大学毕业证, 文凭办理, 国外文凭办理, 留信网认证三.材料咨询办理、认证咨询办理请加学历顾问【微信:1954292140】毕业证购买指大学文凭购买，毕业证办理和文凭办理。学院文凭定制，学校原版文凭补办，扫描件文凭定做，100%文凭复刻。

Model Evaluation & Visualisation part of a series of intro modules for data ...brandonlee626749

Camuflaje Tipos Características Militar 2025.ppte58650738

最新版美国芝加哥大学毕业证（UChicago毕业证书）原版定制taqyea

2025原版芝加哥大学毕业证书pdf电子版【q薇1954292140】美国毕业证办理UChicago芝加哥大学毕业证书多少钱？【q薇1954292140】海外各大学Diploma版本，因为疫情学校推迟发放证书、证书原件丢失补办、没有正常毕业未能认证学历面临就业提供解决办法。当遭遇挂科、旷课导致无法修满学分，或者直接被学校退学，最后无法毕业拿不到毕业证。此时的你一定手足无措，因为留学一场，没有获得毕业证以及学历证明肯定是无法给自己和父母一个交代的。【复刻芝加哥大学成绩单信封,Buy The University of Chicago Transcripts】购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单（q微1954292140）新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率，以及是作为留信认证申请材料的一部分。芝加哥大学成绩单能够体现您的的学习能力，包括芝加哥大学课程成绩、专业能力、研究能力。（q微1954292140）具体来说，成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分，因此，成绩单不仅是学生学术能力的证明，也是评估学生是否适合某个教育项目的重要依据！我们承诺采用的是学校原版纸张（原版纸质、底色、纹路）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【主营项目】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理毕业证|办理文凭: 买大学毕业证|买大学文凭【q薇1954292140】芝加哥大学学位证明书如何办理申请？二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理美国成绩单芝加哥大学毕业证【q薇1954292140】国外大学毕业证, 文凭办理, 国外文凭办理, 留信网认证

Allotted-MBBS-Student-list-batch-2021.pdfsubhansaifi0603

Residential Zone 4 for industrial villageMdYasinArafat13

624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdfCristineGraceAcuyan

Communication_Skills_Class10_Visual.pptxnamanrastogi70555

@Reset-Password.pptx presentakh;kenvtionMarkLariosa1

最新版意大利米兰大学毕业证（UNIMI毕业证书）原版定制taqyea

Shifting Focus on AI: How it Can Make a Positive Difference1508 A/S

最新版美国约翰霍普金斯大学毕业证（JHU毕业证书）原版定制Taqyea

最新版美国佐治亚大学毕业证（UGA毕业证书）原版定制Taqyea

lecture12.pdf Introduction to bioinformaticsSergeyTsygankov6

一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理taqyed

Attendance Presentation Project Excel.pptxs2025266191

Presentation by Tariq & Mohammed (1).pptxAbooddSandoqaa

最新版美国威斯康星大学河城分校毕业证（UWRF毕业证书）原版定制taqyea

PPT2 W1L2.pptx.........................................palicteronalyn26

11_L2_Defects_and_Trouble_Shooting_2014[1].pdfgun3awan88

英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证 taqyed

Model Evaluation & Visualisation part of a series of intro modules for data ...brandonlee626749

Camuflaje Tipos Características Militar 2025.ppte58650738

最新版美国芝加哥大学毕业证（UChicago毕业证书）原版定制taqyea

Improving Apache Spark for Dynamic Allocation and Spot Instances

1. Apple logo is a trademark of Apple Inc. Holden Karau | Data / AI Summi t @holdenkara u Improving Spark for Dynamic Allocation & Spot Instances

2. Who am I? • Holden Kara u • She / he r • Apache Spark PMC • Contributor to a lot of other projects • co-author of High Performance Spark, Learning Spark, and Kubeflow for Machine Learning • http://bit.ly/holdenSparkVideos • https://youtube.com/user/holdenkarau

3. Apple logo is a trademark of Apple Inc.

4. Let us start at the beginning • Spark achieves resilience through re-computation which is part of how we go fas • This poses challenges with removing executors that may contain dat • We "solved" it for YARN/Mesos back in the da • I drank waaaay too much coffee and came up with an alternativ • But no one really liked it because we didn't need it so I closed the Google doc and forgot about i t • Don’t worry, we’ll get to the code soon :)

5. But then…. • The "cloud" became really popula r • Kubernetes became popula r • Everything caught on fire :/

6. Our Protagonist Remembers • I started drinking a lot of coffee • We dusted off that old design and wrote some cod e • And then I got hit by a ca r • More people wrote more cod e • We had a VOT E • We wrote waaaaay more cod e • Everyone lived happily ever after? Photo by Lukas from Pexels

7. How did DA work on YARN? • Scale up is "easy" (add more resources ) • Scale down required a stay resident program to be on each YARN node to serve any file s • Spark stored it's shuffle data as file s • Persist in memory data was still lost when scaling down an executor Photo by Markus Spiske from Pexels

8. Why did the cloud impact this? • If you wanted a ~50% cost saving of spot/preemptible instances you might lose entire machine s • Yes Spark can "handle" this, but does so by recomputing data (expensive ) • You can't depend on leaving a program around to serve files when the server is just gon e • So we need to find a way to migrate the data

9. Ok sure the cloud, but K8s? • Kubernetes doesn't like like the idea of scheduling a stay resident program on every nod e • Also most people don't like the idea of shared disk here either (accros jobs/ users ) • So we need to find a way to migrate the data

10. SPARK-20624 • Yee-haw ! • Ok but more seriously how does it work? Great question lets open up the code • BlockManagerDecomissioner.scala is where most of the magic happens

11. Collaboration http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE- Decommissioning-SPIP-td29701.htm l https://github.com/apache/spark/pulls?q=is%3Apr+decommission+is%3Aclosed+

12. Ok what about the car? Getting hit by a car sucks a lot Slowed down dev work while I did rehab to be able to walk & type again Shout out to everyone who helped me recover (from my wife, girlfriend, partners, my friends, to the hospital staff, nursing home, PT, OT, Ambulance, my employer for giving me time off, the Spark community for understanding I needed time off <3)

13. It’s early though so please be careful On a Happy Note: You can try this now • Enable the followin g - spark.decommission.enabled - spark.storage.decommission.enabled - spark.storage.decommission.rddBlocks.enabled - spark.storage.decommission.shuffleBlocks.enabled • Want to get fancy? Optionally enable: - spark.shuffle.externalStorage.enabled - And configure a storage backend ( spark.shuffle.externalStorage.backend)

14. Future work • Heuristics to migrate dat a • Improve container pre-emption selectio • Better heuristics around when to scale up and down containers

15. Please review this talk :)

Improving Apache Spark for Dynamic Allocation and Spot Instances

Recommended

More Related Content

What's hot (20)

Similar to Improving Apache Spark for Dynamic Allocation and Spot Instances (20)

More from Databricks (20)

Recently uploaded (20)

Improving Apache Spark for Dynamic Allocation and Spot Instances