SlideShare a Scribd company logo
Advanced Spark Programming (2)
Advanced Spark Programming
Shared variables:
● When we pass functions such map()
● Every node gets a copy of the variable
● The change to these variables is not communicated back
● After starting of the map(), changes to the variable on driver
doesn't impact the worker.
Two Kinds:
1. Accumulators to aggregate information
2. Broadcast variables to efficiently distribute large values
Advanced Spark Programming
SHARED MEMORY - Accumulators
+= 10 += 20
are only “added” to
through associative operation
assoc.: 2+3+4=2+4+3=9
Advanced Spark Programming
● Accumulators are variables that are only “added” to through an
associative operation
● Can therefore be efficiently supported in parallel.
● They can be used to implement counters (as in MapReduce) or
sums.
Accumulators
Advanced Spark Programming
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
def toWords(line:String): Array[String] = {
if(line.length == 0) {numBlankLines += 1}
return line.split(" ");
}
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
def toWords(line:String): Array[String] = {
if(line.length == 0) {numBlankLines += 1}
return line.split(" ");
}
var words = file.flatMap(toWords)
words.saveAsTextFile("words3")
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
def toWords(line:String): Array[String] = {
if(line.length == 0) {numBlankLines += 1}
return line.split(" ");
}
var words = file.flatMap(toWords)
words.saveAsTextFile("words3")
printf("Blank lines: %d", numBlankLines.value)
//Blank lines: 24857
Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is ???
Accumulators and Fault Tolerance
Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Accumulators and Fault Tolerance
Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Does it mean accumulators will give wrong result?
Accumulators and Fault Tolerance
Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Does it mean accumulators will give wrong result?
YES, for accumulators in Transformation.
No, for accumulators in Action
Accumulators and Fault Tolerance
Advanced Spark Programming
○ For accumulators in actions, Each task’s accumulator update
applied once.
○ For reliable absolute value counter, put it inside an action
○ In transformations, this guarantee doesn't exist.
○ In transformations, use accumulators for debug only.
Accumulators and Fault Tolerance
Advanced Spark Programming
Custom Accumulators
● Out of the box, Spark supports accumulators of type Double, Long,
and Float.
● Spark also includes an API to define custom accumulator types
and custom aggregation operations
○ (e.g., finding the maximum of the accumulated values instead of
adding them).
● Custom accumulators need to extend AccumulatorV2.
Advanced Spark Programming
Custom Accumulators - version 1.x
Advanced Spark Programming
class MyComplex(var x: Int, var y: Int) extends Serializable{
def reset(): Unit = {
x = 0
y = 0
}
def add(p:MyComplex): MyComplex = {
x = x + p.x
y = y + p.y
return this
}
}
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
import org.apache.spark.AccumulatorParam
class ComplexAccumulatorV1 extends AccumulatorParam[MyComplex] {
def zero(initialVal: MyComplex): MyComplex = {
return initialVal
}
def addInPlace(v1: MyComplex, v2: MyComplex): MyComplex = {
v1.add(v2)
return v1;
}
}
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1)
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1)
var myrdd = sc.parallelize(Array(1,2,3))
def myfunc(x:Int):Int = {
vecAccum += new MyComplex(x, x)
return x * 3
}
var myrdd1 = myrdd.map(myfunc)
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1)
var myrdd = sc.parallelize(Array(1,2,3))
def myfunc(x:Int):Int = {
vecAccum += new MyComplex(x, x)
return x * 3
}
var myrdd1 = myrdd.map(myfunc)
myrdd1.collect()
vecAccum.value.x
vecAccum.value.y
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
import org.apache.spark.util.AccumulatorV2
object ComplexAccumulatorV2 extends AccumulatorV2[MyComplex, MyComplex] {
private val myc:MyComplex = new MyComplex(0,0)
def reset(): Unit = {
myc.reset()
}
def add(v: MyComplex): Unit = {
myc.add(v)
}
def value():MyComplex = {
return myc
}
def isZero(): Boolean = {
return (myc.x == 0 && myc.y == 0)
}
def copy():AccumulatorV2[MyComplex, MyComplex] = {
return ComplexAccumulatorV2
}
def merge(other:AccumulatorV2[MyComplex, MyComplex]) = {
myc.add(other.value)
}
}
sc.register(ComplexAccumulatorV2, "mycomplexacc")
Custom Accumulators - version 2.x
https://gist.github.com/girisandeep/35b21cca890157afe0084a9e400e2e70
Advanced Spark Programming
Broadcast Variables : Introduction
commonWords = ["a", "an", "the", "of", "at", "is",
"am", "are", "this", "that", '', 'at']
If we need to remove the common words from our
wordcount, what do we need to do?
Advanced Spark Programming
Broadcast Variables : Introduction
commonWords = ["a", "an", "the", "of", "at", "is",
"am", "are", "this", "that", '', 'at']
If we need to remove the common words from our
wordcount, what do we need to do?
> We can create a local variable and use it
Advanced Spark Programming
commonWords = List("a", "an", "the", "of", "at",
"is", "am", "are", "this", "that", "", "at")
If we need to remove the common words from our
wordcount, what do we need to do?
> We can create a local variable and use it
> Is it inefficient?
Broadcast Variables : Introduction
Advanced Spark Programming
Yes, because
1. Spark sends referenced variables to all workers.
1. The default task launching mechanism is optimised for small task sizes.
2. If using multiple times, spark will be sending it again to all nodes
So, we use broadcast variable instead.
Broadcast Variables : Introduction
Advanced Spark Programming
SHARED MEMORY
Broadcast
Variables
broadcast.value()
Broadcast()
Hadoop Distributed File System (HDFS
Resilient Distributed Dataset (RDD)
Spark
Application
Spark
Application
Spark
Application
Spark
Application
Broadcast Variables
Cache
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
Broadcast Variables
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
Broadcast Variables
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
Broadcast Variables
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
● It is like a distributed cache of Hadoop
● Spark distributes broadcast variables efficiently to reduce communication
cost.
Broadcast Variables
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
● It is like a distributed cache of Hadoop
● Spark distributes broadcast variables efficiently to reduce communication
cost.
● Useful when
○ Tasks across multiple stages need the same data
○ Caching the data in deserialized form is important.
Broadcast Variables
Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
Advanced Spark Programming
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
Advanced Spark Programming
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
Advanced Spark Programming
Broadcast Variables : Example
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
def toWords(line:String):Array[String] = {
var words = line.split(" ")
var output = Array[String]();
for(word <- words){
if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]","")))
output = output :+ word;
}
return output;
}
var uncommonWords = file.flatMap(toWords)
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
def toWords(line:String):Array[String] = {
var words = line.split(" ")
var output = Array[String]();
for(word <- words){
if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]","")))
output = output :+ word;
}
return output;
}
var uncommonWords = file.flatMap(toWords)
uncommonWords.take(100)
Advanced Spark Programming
Key Performance Considerations
1. Level of Parallelism
2. Serialization Format
3. Memory Management
4. Hardware Provisioning
Advanced Spark Programming
Level of Parallelism
By Default
● A single task per one partition,
● A single core in the cluster to execute.
● Default partitions are based on underlying storage or CPU
● HDFS RDDs - One partition per block
Advanced Spark Programming
Level of Parallelism
Too Less ⇒ Might leave resources idle
Too Much ⇒ Small overheads due to each partition adds up
By Default
● A single task per one partition,
● A single core in the cluster to execute.
● Default partitions are based on underlying storage or CPU
● HDFS RDDs - One partition per block
Advanced Spark Programming
Key Performance Considerations
1. Level of Parallelism - How many default partitions?
Advanced Spark Programming
Key Performance Considerations - Partitions
/data/msprojects/in_table.csv has 62 blocks theoratically. Lets check.
$ hadoop fs -ls /data/msprojects/in_table.csv
-rw-r--r-- 3 sandeep sandeep 8303338297 2017-04-18 02:26 /data/msprojects/in_table.csv
$ python
>>> 8303338297.0/128.0/1024.0/1024.0
61.86469120532274
>>>
$ hdfs fsck /data/msprojects/in_table.csv
…..
Total blocks (validated): 62 (avg. block size 133924811 B)
Yes, it has 62 blocks actually.
Advanced Spark Programming
$ spark-shell --master yarn
scala> var myrdd = sc.textFile("/data/msprojects/in_table.csv")
scala> myrdd.partitions.length
res1: Int = 62
Key Performance Considerations - Partitions
So, number of partitions is a function of number of data blocks in case
of sc.textFile.
Advanced Spark Programming
Key Performance Considerations - Partitions
// In the local mode
spark-shell
scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res1: Int = 4
[sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor
processor : 0
processor : 1
processor : 2
processor : 3
Since my machine has 4 cores, it has created 4 partitions.
Advanced Spark Programming
$ spark-shell --master yarn
scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res6: Int = 2
When we are running in yarn mode, the number of partitions is function
of tasks that can be executed on a node, Here it is 2.
Key Performance Considerations - Partitions
Advanced Spark Programming
Level of Parallelism
1. Specify number of partitions in sc.parallelize and sc.textFile
2. Shuffling operations accept degree of parallelism in parameter
3. repartition() or partitionBy
4. To efficiently shrink, prefer coalesce() over repartition()
How to control parallelism?
Advanced Spark Programming
Level of Parallelism
1. We are reading a large amount of data from S3.
2. filter() operation is likely to leave a tiny fraction
3. Result of filter() will have same size RDD as parent but with
many empty or small partitions.
4. Improve the application’s performance by coalescing
Example
Advanced Spark Programming
Serialization Format
● While transferring or saving objects need serialization
● Comes into play during large transfers
● By default Spark will use Java’s built-in serializer.
Advanced Spark Programming
Serialization Format
Benchmarks
https://github.com/eishay/jvm-serializers/wiki
Advanced Spark Programming
Serialization Format
Kryo
● Spark also supports the use of Kryo
● Faster and more compact
● But cannot serialize all types of objects “out of the box.”
● Almost all applications will benefit from shifting to Kryo
● To use,
○ sc.getConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
● For best performance, register classes with Kryo
○ sc.getConf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
○ Class needs to implement Java’s Serializable interface
Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
Shuffle and aggregation buffers
● For storing shuffle output data
● spark.shuffle.memoryFraction - Default: 20%
Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
Shuffle and aggregation buffers
● For storing shuffle output data
● spark.shuffle.memoryFraction - Default: 20%
User Code
Remaining
Default: 20% of memory
Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
Hardware Provisioning
Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
● App Speed = (Impact of Memory + Cores)
○ Huge memory -> GC pauses
○ 64GB or less
Hardware Provisioning
Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
● App Speed = (Impact of Memory + Cores)
○ Huge memory -> GC pauses
○ 64GB or less
● Linear scaling
○ 2 x Hardware == 2 x speed
Hardware Provisioning
Thank you!
Advanced Programming

More Related Content

PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Apache Spark
PDF
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
PDF
Apache Spark: What's under the hood
PPTX
Introduction to Apache Spark Developer Training
PDF
Top 5 mistakes when writing Spark applications
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Processing Large Data with Apache Spark -- HasGeek
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Apache Spark: What's under the hood
Introduction to Apache Spark Developer Training
Top 5 mistakes when writing Spark applications
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...

What's hot (20)

PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Spark Study Notes
PPTX
Apache spark Intro
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PPTX
Transformations and actions a visual guide training
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
PPT
Scala and spark
PPTX
Apache Spark Architecture
PDF
Top 5 mistakes when writing Spark applications
PDF
Spark overview
PPTX
Apache Spark overview
PDF
Introduction to Apache Spark
PDF
BDM25 - Spark runtime internal
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
Apache Spark Tutorial
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PDF
Apache Spark Introduction
PDF
Spark shuffle introduction
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Spark Study Notes
Apache spark Intro
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Transformations and actions a visual guide training
SparkR - Play Spark Using R (20160909 HadoopCon)
Scala and spark
Apache Spark Architecture
Top 5 mistakes when writing Spark applications
Spark overview
Apache Spark overview
Introduction to Apache Spark
BDM25 - Spark runtime internal
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Tutorial
Apache Spark 2.0: Faster, Easier, and Smarter
Frustration-Reduced PySpark: Data engineering with DataFrames
Apache Spark Introduction
Spark shuffle introduction
Ad

Similar to Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab (20)

PDF
Big Data processing with Apache Spark
PDF
PDF
Distributed computing with spark
PPTX
Introduction to spark
PDF
No more struggles with Apache Spark workloads in production
PDF
Introduction to Apache Spark
PDF
Apache spark - Spark's distributed programming model
PDF
Spark Performance Tuning .pdf
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
PDF
Apache Spark: What? Why? When?
PPTX
SparkNotes
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
PPTX
Learning spark ch06 - Advanced Spark Programming
PDF
Introduction to Spark
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PDF
Large scale logistic regression and linear support vector machines using spark
PDF
Meetup ml spark_ppt
PDF
Spark architecture
PPTX
Apache Spark
Big Data processing with Apache Spark
Distributed computing with spark
Introduction to spark
No more struggles with Apache Spark workloads in production
Introduction to Apache Spark
Apache spark - Spark's distributed programming model
Spark Performance Tuning .pdf
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache Spark: What? Why? When?
SparkNotes
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Learning spark ch06 - Advanced Spark Programming
Introduction to Spark
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Large scale logistic regression and linear support vector machines using spark
Meetup ml spark_ppt
Spark architecture
Apache Spark
Ad

More from CloudxLab (20)

PDF
Understanding computer vision with Deep Learning
PDF
Deep Learning Overview
PDF
Recurrent Neural Networks
PDF
Natural Language Processing
PDF
Naive Bayes
PDF
Autoencoders
PDF
Training Deep Neural Nets
PDF
Reinforcement Learning
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
PPTX
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
PPTX
Introduction to Deep Learning | CloudxLab
PPTX
Dimensionality Reduction | Machine Learning | CloudxLab
PPTX
Ensemble Learning and Random Forests
PPTX
Decision Trees
PPTX
Support Vector Machines
PDF
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Understanding computer vision with Deep Learning
Deep Learning Overview
Recurrent Neural Networks
Natural Language Processing
Naive Bayes
Autoencoders
Training Deep Neural Nets
Reinforcement Learning
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction to Deep Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
Ensemble Learning and Random Forests
Decision Trees
Support Vector Machines
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 2 Digital Image Fundamentals.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Monthly Chronicles - July 2025
Sensors and Actuators in IoT Systems using pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
madgavkar20181017ppt McKinsey Presentation.pdf
MYSQL Presentation for SQL database connectivity
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Review of recent advances in non-invasive hemoglobin estimation
Chapter 2 Digital Image Fundamentals.pdf

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Advanced Spark Programming Shared variables: ● When we pass functions such map() ● Every node gets a copy of the variable ● The change to these variables is not communicated back ● After starting of the map(), changes to the variable on driver doesn't impact the worker. Two Kinds: 1. Accumulators to aggregate information 2. Broadcast variables to efficiently distribute large values
  • 3. Advanced Spark Programming SHARED MEMORY - Accumulators += 10 += 20 are only “added” to through associative operation assoc.: 2+3+4=2+4+3=9
  • 4. Advanced Spark Programming ● Accumulators are variables that are only “added” to through an associative operation ● Can therefore be efficiently supported in parallel. ● They can be used to implement counters (as in MapReduce) or sums. Accumulators
  • 5. Advanced Spark Programming Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 6. Advanced Spark Programming sc.setLogLevel("ERROR") Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 7. Advanced Spark Programming sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 8. Advanced Spark Programming sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") var numBlankLines = sc.accumulator(0) Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 9. Advanced Spark Programming sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") var numBlankLines = sc.accumulator(0) def toWords(line:String): Array[String] = { if(line.length == 0) {numBlankLines += 1} return line.split(" "); } Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 10. Advanced Spark Programming sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") var numBlankLines = sc.accumulator(0) def toWords(line:String): Array[String] = { if(line.length == 0) {numBlankLines += 1} return line.split(" "); } var words = file.flatMap(toWords) words.saveAsTextFile("words3") Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 11. Advanced Spark Programming Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0 sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") var numBlankLines = sc.accumulator(0) def toWords(line:String): Array[String] = { if(line.length == 0) {numBlankLines += 1} return line.split(" "); } var words = file.flatMap(toWords) words.saveAsTextFile("words3") printf("Blank lines: %d", numBlankLines.value) //Blank lines: 24857
  • 12. Advanced Spark Programming ● Spark Re-executes failed or slow tasks. ● Preemptively launches “speculative” copy of slow worker task The net result is ??? Accumulators and Fault Tolerance
  • 13. Advanced Spark Programming ● Spark Re-executes failed or slow tasks. ● Preemptively launches “speculative” copy of slow worker task The net result is: The same function may run multiple times on the same data. Accumulators and Fault Tolerance
  • 14. Advanced Spark Programming ● Spark Re-executes failed or slow tasks. ● Preemptively launches “speculative” copy of slow worker task The net result is: The same function may run multiple times on the same data. Does it mean accumulators will give wrong result? Accumulators and Fault Tolerance
  • 15. Advanced Spark Programming ● Spark Re-executes failed or slow tasks. ● Preemptively launches “speculative” copy of slow worker task The net result is: The same function may run multiple times on the same data. Does it mean accumulators will give wrong result? YES, for accumulators in Transformation. No, for accumulators in Action Accumulators and Fault Tolerance
  • 16. Advanced Spark Programming ○ For accumulators in actions, Each task’s accumulator update applied once. ○ For reliable absolute value counter, put it inside an action ○ In transformations, this guarantee doesn't exist. ○ In transformations, use accumulators for debug only. Accumulators and Fault Tolerance
  • 17. Advanced Spark Programming Custom Accumulators ● Out of the box, Spark supports accumulators of type Double, Long, and Float. ● Spark also includes an API to define custom accumulator types and custom aggregation operations ○ (e.g., finding the maximum of the accumulated values instead of adding them). ● Custom accumulators need to extend AccumulatorV2.
  • 18. Advanced Spark Programming Custom Accumulators - version 1.x
  • 19. Advanced Spark Programming class MyComplex(var x: Int, var y: Int) extends Serializable{ def reset(): Unit = { x = 0 y = 0 } def add(p:MyComplex): MyComplex = { x = x + p.x y = y + p.y return this } } Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 20. Advanced Spark Programming import org.apache.spark.AccumulatorParam class ComplexAccumulatorV1 extends AccumulatorParam[MyComplex] { def zero(initialVal: MyComplex): MyComplex = { return initialVal } def addInPlace(v1: MyComplex, v2: MyComplex): MyComplex = { v1.add(v2) return v1; } } Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 21. Advanced Spark Programming val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1) Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 22. Advanced Spark Programming val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1) var myrdd = sc.parallelize(Array(1,2,3)) def myfunc(x:Int):Int = { vecAccum += new MyComplex(x, x) return x * 3 } var myrdd1 = myrdd.map(myfunc) Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 23. Advanced Spark Programming val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1) var myrdd = sc.parallelize(Array(1,2,3)) def myfunc(x:Int):Int = { vecAccum += new MyComplex(x, x) return x * 3 } var myrdd1 = myrdd.map(myfunc) myrdd1.collect() vecAccum.value.x vecAccum.value.y Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 24. Advanced Spark Programming import org.apache.spark.util.AccumulatorV2 object ComplexAccumulatorV2 extends AccumulatorV2[MyComplex, MyComplex] { private val myc:MyComplex = new MyComplex(0,0) def reset(): Unit = { myc.reset() } def add(v: MyComplex): Unit = { myc.add(v) } def value():MyComplex = { return myc } def isZero(): Boolean = { return (myc.x == 0 && myc.y == 0) } def copy():AccumulatorV2[MyComplex, MyComplex] = { return ComplexAccumulatorV2 } def merge(other:AccumulatorV2[MyComplex, MyComplex]) = { myc.add(other.value) } } sc.register(ComplexAccumulatorV2, "mycomplexacc") Custom Accumulators - version 2.x https://gist.github.com/girisandeep/35b21cca890157afe0084a9e400e2e70
  • 25. Advanced Spark Programming Broadcast Variables : Introduction commonWords = ["a", "an", "the", "of", "at", "is", "am", "are", "this", "that", '', 'at'] If we need to remove the common words from our wordcount, what do we need to do?
  • 26. Advanced Spark Programming Broadcast Variables : Introduction commonWords = ["a", "an", "the", "of", "at", "is", "am", "are", "this", "that", '', 'at'] If we need to remove the common words from our wordcount, what do we need to do? > We can create a local variable and use it
  • 27. Advanced Spark Programming commonWords = List("a", "an", "the", "of", "at", "is", "am", "are", "this", "that", "", "at") If we need to remove the common words from our wordcount, what do we need to do? > We can create a local variable and use it > Is it inefficient? Broadcast Variables : Introduction
  • 28. Advanced Spark Programming Yes, because 1. Spark sends referenced variables to all workers. 1. The default task launching mechanism is optimised for small task sizes. 2. If using multiple times, spark will be sending it again to all nodes So, we use broadcast variable instead. Broadcast Variables : Introduction
  • 29. Advanced Spark Programming SHARED MEMORY Broadcast Variables broadcast.value() Broadcast() Hadoop Distributed File System (HDFS Resilient Distributed Dataset (RDD) Spark Application Spark Application Spark Application Spark Application Broadcast Variables Cache
  • 30. Advanced Spark Programming ● Efficiently send a large, read-only value to workers Broadcast Variables
  • 31. Advanced Spark Programming ● Efficiently send a large, read-only value to workers ● For example: ○ Send a large, read-only lookup table to all the nodes Broadcast Variables
  • 32. Advanced Spark Programming ● Efficiently send a large, read-only value to workers ● For example: ○ Send a large, read-only lookup table to all the nodes ○ Large feature vector in a machine learning algorithm Broadcast Variables
  • 33. Advanced Spark Programming ● Efficiently send a large, read-only value to workers ● For example: ○ Send a large, read-only lookup table to all the nodes ○ Large feature vector in a machine learning algorithm ● It is like a distributed cache of Hadoop ● Spark distributes broadcast variables efficiently to reduce communication cost. Broadcast Variables
  • 34. Advanced Spark Programming ● Efficiently send a large, read-only value to workers ● For example: ○ Send a large, read-only lookup table to all the nodes ○ Large feature vector in a machine learning algorithm ● It is like a distributed cache of Hadoop ● Spark distributes broadcast variables efficiently to reduce communication cost. ● Useful when ○ Tasks across multiple stages need the same data ○ Caching the data in deserialized form is important. Broadcast Variables
  • 35. Advanced Spark Programming Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
  • 36. Advanced Spark Programming var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
  • 37. Advanced Spark Programming Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") val commonWordsMap = collection.mutable.Map[String, Int]() for(word <- commonWords){ commonWordsMap(word) = 1 } var commonWordsBC = sc.broadcast(commonWordsMap)
  • 38. Advanced Spark Programming var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") val commonWordsMap = collection.mutable.Map[String, Int]() for(word <- commonWords){ commonWordsMap(word) = 1 } var commonWordsBC = sc.broadcast(commonWordsMap) var file = sc.textFile("/data/mr/wordcount/input/big.txt") Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
  • 39. Advanced Spark Programming Broadcast Variables : Example var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") val commonWordsMap = collection.mutable.Map[String, Int]() for(word <- commonWords){ commonWordsMap(word) = 1 } var commonWordsBC = sc.broadcast(commonWordsMap) var file = sc.textFile("/data/mr/wordcount/input/big.txt") def toWords(line:String):Array[String] = { var words = line.split(" ") var output = Array[String](); for(word <- words){ if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]",""))) output = output :+ word; } return output; } var uncommonWords = file.flatMap(toWords) Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
  • 40. Advanced Spark Programming Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") val commonWordsMap = collection.mutable.Map[String, Int]() for(word <- commonWords){ commonWordsMap(word) = 1 } var commonWordsBC = sc.broadcast(commonWordsMap) var file = sc.textFile("/data/mr/wordcount/input/big.txt") def toWords(line:String):Array[String] = { var words = line.split(" ") var output = Array[String](); for(word <- words){ if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]",""))) output = output :+ word; } return output; } var uncommonWords = file.flatMap(toWords) uncommonWords.take(100)
  • 41. Advanced Spark Programming Key Performance Considerations 1. Level of Parallelism 2. Serialization Format 3. Memory Management 4. Hardware Provisioning
  • 42. Advanced Spark Programming Level of Parallelism By Default ● A single task per one partition, ● A single core in the cluster to execute. ● Default partitions are based on underlying storage or CPU ● HDFS RDDs - One partition per block
  • 43. Advanced Spark Programming Level of Parallelism Too Less ⇒ Might leave resources idle Too Much ⇒ Small overheads due to each partition adds up By Default ● A single task per one partition, ● A single core in the cluster to execute. ● Default partitions are based on underlying storage or CPU ● HDFS RDDs - One partition per block
  • 44. Advanced Spark Programming Key Performance Considerations 1. Level of Parallelism - How many default partitions?
  • 45. Advanced Spark Programming Key Performance Considerations - Partitions /data/msprojects/in_table.csv has 62 blocks theoratically. Lets check. $ hadoop fs -ls /data/msprojects/in_table.csv -rw-r--r-- 3 sandeep sandeep 8303338297 2017-04-18 02:26 /data/msprojects/in_table.csv $ python >>> 8303338297.0/128.0/1024.0/1024.0 61.86469120532274 >>> $ hdfs fsck /data/msprojects/in_table.csv ….. Total blocks (validated): 62 (avg. block size 133924811 B) Yes, it has 62 blocks actually.
  • 46. Advanced Spark Programming $ spark-shell --master yarn scala> var myrdd = sc.textFile("/data/msprojects/in_table.csv") scala> myrdd.partitions.length res1: Int = 62 Key Performance Considerations - Partitions So, number of partitions is a function of number of data blocks in case of sc.textFile.
  • 47. Advanced Spark Programming Key Performance Considerations - Partitions // In the local mode spark-shell scala> var myrdd = sc.parallelize(1 to 100000) scala> myrdd.partitions.length res1: Int = 4 [sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor processor : 0 processor : 1 processor : 2 processor : 3 Since my machine has 4 cores, it has created 4 partitions.
  • 48. Advanced Spark Programming $ spark-shell --master yarn scala> var myrdd = sc.parallelize(1 to 100000) scala> myrdd.partitions.length res6: Int = 2 When we are running in yarn mode, the number of partitions is function of tasks that can be executed on a node, Here it is 2. Key Performance Considerations - Partitions
  • 49. Advanced Spark Programming Level of Parallelism 1. Specify number of partitions in sc.parallelize and sc.textFile 2. Shuffling operations accept degree of parallelism in parameter 3. repartition() or partitionBy 4. To efficiently shrink, prefer coalesce() over repartition() How to control parallelism?
  • 50. Advanced Spark Programming Level of Parallelism 1. We are reading a large amount of data from S3. 2. filter() operation is likely to leave a tiny fraction 3. Result of filter() will have same size RDD as parent but with many empty or small partitions. 4. Improve the application’s performance by coalescing Example
  • 51. Advanced Spark Programming Serialization Format ● While transferring or saving objects need serialization ● Comes into play during large transfers ● By default Spark will use Java’s built-in serializer.
  • 52. Advanced Spark Programming Serialization Format Benchmarks https://github.com/eishay/jvm-serializers/wiki
  • 53. Advanced Spark Programming Serialization Format Kryo ● Spark also supports the use of Kryo ● Faster and more compact ● But cannot serialize all types of objects “out of the box.” ● Almost all applications will benefit from shifting to Kryo ● To use, ○ sc.getConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") ● For best performance, register classes with Kryo ○ sc.getConf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) ○ Class needs to implement Java’s Serializable interface
  • 54. Advanced Spark Programming RDD storage ● persist()'ed memory ● spark.storage.memoryFraction - Default: 60% ● If exceeded, older will be dropped ○ will be computed on demand ● For huge data, use persist() with MEMORY_AND_DISK Memory Management
  • 55. Advanced Spark Programming RDD storage ● persist()'ed memory ● spark.storage.memoryFraction - Default: 60% ● If exceeded, older will be dropped ○ will be computed on demand ● For huge data, use persist() with MEMORY_AND_DISK Memory Management Shuffle and aggregation buffers ● For storing shuffle output data ● spark.shuffle.memoryFraction - Default: 20%
  • 56. Advanced Spark Programming RDD storage ● persist()'ed memory ● spark.storage.memoryFraction - Default: 60% ● If exceeded, older will be dropped ○ will be computed on demand ● For huge data, use persist() with MEMORY_AND_DISK Memory Management Shuffle and aggregation buffers ● For storing shuffle output data ● spark.shuffle.memoryFraction - Default: 20% User Code Remaining Default: 20% of memory
  • 57. Advanced Spark Programming ● Main Parameters ○ Executor’s Memory (spark.executor.memory) ○ Number of cores per Executor, ○ Total number of executors ○ No. of disks Hardware Provisioning
  • 58. Advanced Spark Programming ● Main Parameters ○ Executor’s Memory (spark.executor.memory) ○ Number of cores per Executor, ○ Total number of executors ○ No. of disks ● App Speed = (Impact of Memory + Cores) ○ Huge memory -> GC pauses ○ 64GB or less Hardware Provisioning
  • 59. Advanced Spark Programming ● Main Parameters ○ Executor’s Memory (spark.executor.memory) ○ Number of cores per Executor, ○ Total number of executors ○ No. of disks ● App Speed = (Impact of Memory + Cores) ○ Huge memory -> GC pauses ○ 64GB or less ● Linear scaling ○ 2 x Hardware == 2 x speed Hardware Provisioning