Optimizations in Spark; RDD, DataFrame

Optimizations in
Apache Spark
Presented By: Sarfaraz Hussain
Software Consultant
Knoldus Inc.

About Knoldus
Knoldus is a technology consulting firm with focus on modernizing the digital systems
at the pace your business demands.
DevOps
Functional. Reactive. Cloud Native

01 Spark Execution Model
02 Optimizing Shuffle Operations
03 Optimizing Functions
04 SQL vs RDD
05 Logical & Physical Plan
Agenda
06 Optimizing Joins

Apple
Banana
Orange
Apple
Cat
Dog
Cow
Orange
Cow
Banana
RDD

● Two kinds of operations:
1. Transformation
2. Action
● Dependency are divided into two types:
1. Narrow Dependency
2. Wide Dependency
● Stages
Spark Execution Model

Narrow Transformation Wide Transformation
map cogroup
mapValues groupWith
flatMap join
filter leftOuterJoin
mapPartitions rightOuterJoin
groupByKey
reduceByKey
combineByKey
distinct
intersection
repartition
coalesce

Shuffle Operations
What is Shuffle?
- Shuffles are data transfers between different executors of a Spark cluster.

Shuffle Operations
1. In which executors the data needs to be sent?
2. How to send the data?
GroupByKey

Shuffle Operations
Where to send data?
- Partitioner - The partitioner defines how records will be distributed and thus which records
will be completed by each task

Partitioner
Types of partitioner:
- Hash Partitioner: Uses Java’s Object.hashCode method to determine the partition as:
partition = key.hashCode() % numPartitions.
- Range Partitioner: It partition data either based on set of sorted ranges of keys, tuples
with the same range will be on the same machine. This method is suitable where
there’s a natural ordering in the keys and the keys are non negative.
Example:
Hash Partitioner - GroupByKey, ReduceByKey
Range Partitioner - SortByKey
Further reading: https://www.edureka.co/blog/demystifying-partitioning-in-spark

Co-partitioned RDD
RDDs are co-partitioned if they are partitioned by the same partitioner.

Co-located RDD
Partitions are co-located if they are both loaded into the memory of the same machine
(executor).

Shuffle Operations
How to send data?
- Serialization - It a mechanism of representing an object as a stream of byte,
transferring it through the network, and then reconstructing the same object, and its
state on another computer.

Serializer in Spark
- Types of Serializer in Spark -
- Java : slow, but robust
- Kryo : fast, but has few problem
Further Reading: https://spark.apache.org/docs/latest/tuning.html#data-serialization

Optimizing Functions In Transformation

map vs mapPartitions
- Map works the function being utilized at a per element level while mapPartitions
exercises the function at the partition level.
- map: Applies a transformation function on each item of the RDD and returns the result
as a new RDD.
- mapPartition: It is called only once for each partition. The entire content of the
respective partitions is available as a sequential stream of values via the input
argument (Iterarator[T]).
- https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions

SQL RDD
SQL is high-level. RDD is low-level API.
SQL focus on “WHAT”. RDD focus on “HOW”.
Spark takes care of optimizing most SQL
queries.
Optimizing RDD is developer’s responsibility.
SQL are Declarative. RDD are Imperative i.e. we need to specify each
step of computation.
SQL knows about your data. RDD doesn’t know anything about your data.
Does not involves much
serialization/deserialization as Catalyst
Optimizer takes care to optimize it.
RDD involves too many
serialization/deserialization

Logical & Physical Plan
● Logical Plan
- Unresolved Logical Plan OR Parsed Logical Plan
- Resolved Logical Plan OR Logical Plan OR Analyzed Logical Plan
- Optimized Logical Plan
● Catalog
● Catalyst Optimizer
● Tungsten
● Physical Plan

Logical & Physical Plan
https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-laymans-term/

Catalyst Optimizer and Tungsten

Codegen
Once the Best Physical Plan is selected, it’s the time to generate the executable
code (DAG of RDDs) for the query that is to be executed in a cluster in a distributed
fashion. This process is called Codegen and that’s the job of Spark’s Tungsten
Execution Engine.

Optimizing Joins
Types of Joins -
a. Shuffle hash Join
b. Sort-merge Join
c. Broadcast Join

Shuffle hash Join
- When join keys are not sortable.
- It is used when Sort-merge Join is disabled.
- spark.sql.join.preferSortMergeJoin is false.
- One side is much smaller (at least 3 times) than the other.
- Can build hash map.

Sort-merge Join
- spark.sql.join.preferSortMergeJoin is true by default.
- Default Join implementation.
- Join keys must be sortable.
- In our previous example, Sort-merge Join took place.
- Use Bucketing : Pre shuffle + sort data based on join key

Bucketing
- Bucketing helps to pre-compute the shuffle and store the data as input table, thus
avoiding shuffle at each stage.
- SET spark.sql.sources.bucketing.enabled = TRUE

Broadcast Join
- Broadcast smaller Dataframe to all Worker Node.
- Perform map-side join.
- No shuffle operations take places.
- spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, 10485760)

Broadcast Join
Broadcast Join
Sort-merge Join

Caching/Persisting
a. It keeps the lineage intact.
b. Data is cached to Executor’s memory and is fetched from the cache.
c. Data can be recomputed from scratch if some partitions are lost while cached partitions are not
recomputed. (Done by the Resource Manager)
d. Subsequent use of a RDD will not lead to computation beyond that point where it is cached.
e. The cache is cleared after the SparkContext is destroyed.
f. Persisting is unreliable.
g. data.persist() OR data.cache()

Checkpointing
a. It breaks the lineage.
b. Data is written and fetched from HDFS or local file system.
c. Data can not be recomputed from scratch if some partitions are lost
as the lineage chain is completely lost.
d. Checkpointed data can be used in subsequent job run.
e. Checkpointed data is persistent and not removed after SparkContext is
destroyed.
f. Checkpointing is reliable.

Checkpointing
spark.sparkContext.setCheckpointDir("/hdfs_directory/")
myRdd.checkpoint()
df.rdd.checkpoint()
Why to make a checkpoint?
- Busy cluster.
- Expensive and long computations.

Thank You!
https://www.linkedin.com/in/sarf
araz-hussain-8123b4132/
sarfaraz.hussain@knoldus.com

Optimizations in Spark; RDD, DataFrame

More Related Content

What's hot (20)

Similar to Optimizations in Spark; RDD, DataFrame (20)

More from Knoldus Inc. (20)

Recently uploaded (20)

Optimizations in Spark; RDD, DataFrame