SlideShare a Scribd company logo
Optimizations in
Apache Spark
Presented By: Sarfaraz Hussain
Software Consultant
Knoldus Inc.
About Knoldus
Knoldus is a technology consulting firm with focus on modernizing the digital systems
at the pace your business demands.
DevOps
Functional. Reactive. Cloud Native
01 Spark Execution Model
02 Optimizing Shuffle Operations
03 Optimizing Functions
04 SQL vs RDD
05 Logical & Physical Plan
Agenda
06 Optimizing Joins
Apple
Banana
Orange
Apple
Cat
Dog
Cow
Orange
Cow
Banana
RDD
Optimizations in Spark; RDD, DataFrame
● Two kinds of operations:
1. Transformation
2. Action
● Dependency are divided into two types:
1. Narrow Dependency
2. Wide Dependency
● Stages
Spark Execution Model
Optimizations in Spark; RDD, DataFrame
DAG
Stage Details
Narrow Transformation Wide Transformation
map cogroup
mapValues groupWith
flatMap join
filter leftOuterJoin
mapPartitions rightOuterJoin
groupByKey
reduceByKey
combineByKey
distinct
intersection
repartition
coalesce
Shuffle Operations
What is Shuffle?
- Shuffles are data transfers between different executors of a Spark cluster.
Shuffle Operations
1. In which executors the data needs to be sent?
2. How to send the data?
GroupByKey
Shuffle Operations
Where to send data?
- Partitioner - The partitioner defines how records will be distributed and thus which records
will be completed by each task
Partitioner
Types of partitioner:
- Hash Partitioner: Uses Java’s Object.hashCode method to determine the partition as:
partition = key.hashCode() % numPartitions.
- Range Partitioner: It partition data either based on set of sorted ranges of keys, tuples
with the same range will be on the same machine. This method is suitable where
there’s a natural ordering in the keys and the keys are non negative.
Example:
Hash Partitioner - GroupByKey, ReduceByKey
Range Partitioner - SortByKey
Further reading: https://www.edureka.co/blog/demystifying-partitioning-in-spark
Partitioner
Co-partitioned RDD
RDDs are co-partitioned if they are partitioned by the same partitioner.
Co-partitioned RDD
Co-located RDD
Partitions are co-located if they are both loaded into the memory of the same machine
(executor).
Shuffle Operations
How to send data?
- Serialization - It a mechanism of representing an object as a stream of byte,
transferring it through the network, and then reconstructing the same object, and its
state on another computer.
Serializer in Spark
- Types of Serializer in Spark -
- Java : slow, but robust
- Kryo : fast, but has few problem
Further Reading: https://spark.apache.org/docs/latest/tuning.html#data-serialization
Optimizing Functions In Transformation
Optimizing Functions
Optimizing Functions
map vs mapPartitions
- Map works the function being utilized at a per element level while mapPartitions
exercises the function at the partition level.
- map: Applies a transformation function on each item of the RDD and returns the result
as a new RDD.
- mapPartition: It is called only once for each partition. The entire content of the
respective partitions is available as a sequential stream of values via the input
argument (Iterarator[T]).
- https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions
map vs mapPartitions
SQL vs RDD
SQL RDD
SQL is high-level. RDD is low-level API.
SQL focus on “WHAT”. RDD focus on “HOW”.
Spark takes care of optimizing most SQL
queries.
Optimizing RDD is developer’s responsibility.
SQL are Declarative. RDD are Imperative i.e. we need to specify each
step of computation.
SQL knows about your data. RDD doesn’t know anything about your data.
Does not involves much
serialization/deserialization as Catalyst
Optimizer takes care to optimize it.
RDD involves too many
serialization/deserialization
SQL
SQL
RDD
RDD
Logical & Physical Plan
● Logical Plan
- Unresolved Logical Plan OR Parsed Logical Plan
- Resolved Logical Plan OR Logical Plan OR Analyzed Logical Plan
- Optimized Logical Plan
● Catalog
● Catalyst Optimizer
● Tungsten
● Physical Plan
Logical & Physical Plan
https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-laymans-term/
Catalyst Optimizer and Tungsten
Codegen
Once the Best Physical Plan is selected, it’s the time to generate the executable
code (DAG of RDDs) for the query that is to be executed in a cluster in a distributed
fashion. This process is called Codegen and that’s the job of Spark’s Tungsten
Execution Engine.
Let’s see them in action!
Unresolved Logical Plan
Resolved Logical Plan
Optimized Logical Plan
Physical Plan
Optimizing Joins
Types of Joins -
a. Shuffle hash Join
b. Sort-merge Join
c. Broadcast Join
Shuffle hash Join
- When join keys are not sortable.
- It is used when Sort-merge Join is disabled.
- spark.sql.join.preferSortMergeJoin is false.
- One side is much smaller (at least 3 times) than the other.
- Can build hash map.
Sort-merge Join
- spark.sql.join.preferSortMergeJoin is true by default.
- Default Join implementation.
- Join keys must be sortable.
- In our previous example, Sort-merge Join took place.
- Use Bucketing : Pre shuffle + sort data based on join key
Bucketing
- Bucketing helps to pre-compute the shuffle and store the data as input table, thus
avoiding shuffle at each stage.
- SET spark.sql.sources.bucketing.enabled = TRUE
Broadcast Join
- Broadcast smaller Dataframe to all Worker Node.
- Perform map-side join.
- No shuffle operations take places.
- spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, 10485760)
Broadcast Join
Broadcast Join
Broadcast Join
Broadcast Join
Sort-merge Join
Caching/Persisting
a. It keeps the lineage intact.
b. Data is cached to Executor’s memory and is fetched from the cache.
c. Data can be recomputed from scratch if some partitions are lost while cached partitions are not
recomputed. (Done by the Resource Manager)
d. Subsequent use of a RDD will not lead to computation beyond that point where it is cached.
e. The cache is cleared after the SparkContext is destroyed.
f. Persisting is unreliable.
g. data.persist() OR data.cache()
Checkpointing
a. It breaks the lineage.
b. Data is written and fetched from HDFS or local file system.
c. Data can not be recomputed from scratch if some partitions are lost
as the lineage chain is completely lost.
d. Checkpointed data can be used in subsequent job run.
e. Checkpointed data is persistent and not removed after SparkContext is
destroyed.
f. Checkpointing is reliable.
Checkpointing
spark.sparkContext.setCheckpointDir("/hdfs_directory/")
myRdd.checkpoint()
df.rdd.checkpoint()
Why to make a checkpoint?
- Busy cluster.
- Expensive and long computations.
Thank You!
https://www.linkedin.com/in/sarf
araz-hussain-8123b4132/
sarfaraz.hussain@knoldus.com

More Related Content

PDF
Snowflake Architecture and Performance
PDF
Cookpad TechConf 2016 - DWHに必要なこと
PPTX
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PPTX
PostgreSQLのロール管理とその注意点(Open Source Conference 2022 Online/Osaka 発表資料)
PPTX
RETEアルゴリズムを使いこなせ
PDF
著名PHPアプリの脆弱性に学ぶセキュアコーディングの原則
PDF
Confluenceショートカットキー表 v1
PDF
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)
Snowflake Architecture and Performance
Cookpad TechConf 2016 - DWHに必要なこと
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLのロール管理とその注意点(Open Source Conference 2022 Online/Osaka 発表資料)
RETEアルゴリズムを使いこなせ
著名PHPアプリの脆弱性に学ぶセキュアコーディングの原則
Confluenceショートカットキー表 v1
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)

What's hot (20)

PDF
パターンでわかる! .NET Coreの非同期処理
PDF
【BS7】GitHubをフル活用した開発
PDF
シリコンバレーでエンジニア就職する前に知りたかったこと
PDF
ざっくり DDD 入門!!
PDF
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
PDF
初心者向けWebinar AWSで開発環境を構築しよう
PDF
サイボウズの CI/CD 事情 〜Jenkins おじさんは CircleCI おじさんにしんかした!〜
PDF
OutSystems ユーザー会 セッション資料
PDF
elasticsearch-hadoopをつかってごにょごにょしてみる
PPTX
WikipediaからのSolr用類義語辞書の自動生成
PPTX
物体検出の歴史(R-CNNからSSD・YOLOまで)
PDF
イミュータブルデータモデル(入門編)
PPTX
FiNC DDD第一回勉強会
PDF
あなたの知らないPostgreSQL監視の世界
PDF
データレイクを基盤としたAWS上での機械学習サービス構築
PDF
テストを書こう、Unity編
PDF
リアクティブプログラミングとMVVMパターンについて
PDF
PostgreSQLバックアップの基本
PDF
RDF Semantic Graph「RDF 超入門」
PDF
世界最速の正規表現JITエンジンの実装
パターンでわかる! .NET Coreの非同期処理
【BS7】GitHubをフル活用した開発
シリコンバレーでエンジニア就職する前に知りたかったこと
ざっくり DDD 入門!!
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
初心者向けWebinar AWSで開発環境を構築しよう
サイボウズの CI/CD 事情 〜Jenkins おじさんは CircleCI おじさんにしんかした!〜
OutSystems ユーザー会 セッション資料
elasticsearch-hadoopをつかってごにょごにょしてみる
WikipediaからのSolr用類義語辞書の自動生成
物体検出の歴史(R-CNNからSSD・YOLOまで)
イミュータブルデータモデル(入門編)
FiNC DDD第一回勉強会
あなたの知らないPostgreSQL監視の世界
データレイクを基盤としたAWS上での機械学習サービス構築
テストを書こう、Unity編
リアクティブプログラミングとMVVMパターンについて
PostgreSQLバックアップの基本
RDF Semantic Graph「RDF 超入門」
世界最速の正規表現JITエンジンの実装
Ad

Similar to Optimizations in Spark; RDD, DataFrame (20)

PDF
End-to-end working of Apache Spark
PPTX
PDF
Data Processing with Apache Spark Meetup Talk
PPTX
Learn about SPARK tool and it's componemts
PPTX
How to build your query engine in spark
PDF
Tuning and Debugging in Apache Spark
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
DOCX
Quick Guide to Refresh Spark skills
PDF
Apache Spark Introduction.pdf
PPTX
Apache Spark
PPTX
03 spark rdd operations
PPTX
Apache Spark Architecture
PDF
Databricks spark-knowledge-base-1
PDF
Improving Spark SQL at LinkedIn
PDF
Spark For Faster Batch Processing
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPTX
Berlin buzzwords 2018
PPTX
Spark from the Surface
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Fast Data Analytics with Spark and Python
End-to-end working of Apache Spark
Data Processing with Apache Spark Meetup Talk
Learn about SPARK tool and it's componemts
How to build your query engine in spark
Tuning and Debugging in Apache Spark
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Quick Guide to Refresh Spark skills
Apache Spark Introduction.pdf
Apache Spark
03 spark rdd operations
Apache Spark Architecture
Databricks spark-knowledge-base-1
Improving Spark SQL at LinkedIn
Spark For Faster Batch Processing
Spark Summit East 2015 Advanced Devops Student Slides
Berlin buzzwords 2018
Spark from the Surface
Apache spark-melbourne-april-2015-meetup
Fast Data Analytics with Spark and Python
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
A Presentation on Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I

Optimizations in Spark; RDD, DataFrame

  • 1. Optimizations in Apache Spark Presented By: Sarfaraz Hussain Software Consultant Knoldus Inc.
  • 2. About Knoldus Knoldus is a technology consulting firm with focus on modernizing the digital systems at the pace your business demands. DevOps Functional. Reactive. Cloud Native
  • 3. 01 Spark Execution Model 02 Optimizing Shuffle Operations 03 Optimizing Functions 04 SQL vs RDD 05 Logical & Physical Plan Agenda 06 Optimizing Joins
  • 6. ● Two kinds of operations: 1. Transformation 2. Action ● Dependency are divided into two types: 1. Narrow Dependency 2. Wide Dependency ● Stages Spark Execution Model
  • 8. DAG
  • 10. Narrow Transformation Wide Transformation map cogroup mapValues groupWith flatMap join filter leftOuterJoin mapPartitions rightOuterJoin groupByKey reduceByKey combineByKey distinct intersection repartition coalesce
  • 11. Shuffle Operations What is Shuffle? - Shuffles are data transfers between different executors of a Spark cluster.
  • 12. Shuffle Operations 1. In which executors the data needs to be sent? 2. How to send the data? GroupByKey
  • 13. Shuffle Operations Where to send data? - Partitioner - The partitioner defines how records will be distributed and thus which records will be completed by each task
  • 14. Partitioner Types of partitioner: - Hash Partitioner: Uses Java’s Object.hashCode method to determine the partition as: partition = key.hashCode() % numPartitions. - Range Partitioner: It partition data either based on set of sorted ranges of keys, tuples with the same range will be on the same machine. This method is suitable where there’s a natural ordering in the keys and the keys are non negative. Example: Hash Partitioner - GroupByKey, ReduceByKey Range Partitioner - SortByKey Further reading: https://www.edureka.co/blog/demystifying-partitioning-in-spark
  • 16. Co-partitioned RDD RDDs are co-partitioned if they are partitioned by the same partitioner.
  • 18. Co-located RDD Partitions are co-located if they are both loaded into the memory of the same machine (executor).
  • 19. Shuffle Operations How to send data? - Serialization - It a mechanism of representing an object as a stream of byte, transferring it through the network, and then reconstructing the same object, and its state on another computer.
  • 20. Serializer in Spark - Types of Serializer in Spark - - Java : slow, but robust - Kryo : fast, but has few problem Further Reading: https://spark.apache.org/docs/latest/tuning.html#data-serialization
  • 21. Optimizing Functions In Transformation
  • 24. map vs mapPartitions - Map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level. - map: Applies a transformation function on each item of the RDD and returns the result as a new RDD. - mapPartition: It is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). - https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions
  • 27. SQL RDD SQL is high-level. RDD is low-level API. SQL focus on “WHAT”. RDD focus on “HOW”. Spark takes care of optimizing most SQL queries. Optimizing RDD is developer’s responsibility. SQL are Declarative. RDD are Imperative i.e. we need to specify each step of computation. SQL knows about your data. RDD doesn’t know anything about your data. Does not involves much serialization/deserialization as Catalyst Optimizer takes care to optimize it. RDD involves too many serialization/deserialization
  • 28. SQL
  • 29. SQL
  • 30. RDD
  • 31. RDD
  • 32. Logical & Physical Plan ● Logical Plan - Unresolved Logical Plan OR Parsed Logical Plan - Resolved Logical Plan OR Logical Plan OR Analyzed Logical Plan - Optimized Logical Plan ● Catalog ● Catalyst Optimizer ● Tungsten ● Physical Plan
  • 33. Logical & Physical Plan https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-laymans-term/
  • 35. Codegen Once the Best Physical Plan is selected, it’s the time to generate the executable code (DAG of RDDs) for the query that is to be executed in a cluster in a distributed fashion. This process is called Codegen and that’s the job of Spark’s Tungsten Execution Engine.
  • 36. Let’s see them in action!
  • 41. Optimizing Joins Types of Joins - a. Shuffle hash Join b. Sort-merge Join c. Broadcast Join
  • 42. Shuffle hash Join - When join keys are not sortable. - It is used when Sort-merge Join is disabled. - spark.sql.join.preferSortMergeJoin is false. - One side is much smaller (at least 3 times) than the other. - Can build hash map.
  • 43. Sort-merge Join - spark.sql.join.preferSortMergeJoin is true by default. - Default Join implementation. - Join keys must be sortable. - In our previous example, Sort-merge Join took place. - Use Bucketing : Pre shuffle + sort data based on join key
  • 44. Bucketing - Bucketing helps to pre-compute the shuffle and store the data as input table, thus avoiding shuffle at each stage. - SET spark.sql.sources.bucketing.enabled = TRUE
  • 45. Broadcast Join - Broadcast smaller Dataframe to all Worker Node. - Perform map-side join. - No shuffle operations take places. - spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, 10485760)
  • 49. Caching/Persisting a. It keeps the lineage intact. b. Data is cached to Executor’s memory and is fetched from the cache. c. Data can be recomputed from scratch if some partitions are lost while cached partitions are not recomputed. (Done by the Resource Manager) d. Subsequent use of a RDD will not lead to computation beyond that point where it is cached. e. The cache is cleared after the SparkContext is destroyed. f. Persisting is unreliable. g. data.persist() OR data.cache()
  • 50. Checkpointing a. It breaks the lineage. b. Data is written and fetched from HDFS or local file system. c. Data can not be recomputed from scratch if some partitions are lost as the lineage chain is completely lost. d. Checkpointed data can be used in subsequent job run. e. Checkpointed data is persistent and not removed after SparkContext is destroyed. f. Checkpointing is reliable.