SlideShare a Scribd company logo
Spark tutorial, developing
locally and deploying on EMR
Use cases (my biased opinion)
β€’ Interactive and Expressive Data Analysis
β€’ If you feel limited when trying to express yourself in β€œgroup by”, β€œjoin” and
β€œwhere”
β€’ Only if it is not possible to work with datasets locally
β€’ Entering Danger Zone:
β€’ Spark SQL engine, like Impala/Hive
β€’ Speed up ETLs if your data can fit in memory (speculation)
β€’ Machine learning
β€’ Graph analytics
β€’ Streaming (not mature yet)
Possible working styles
β€’ Develop in IDE
β€’ Develop as you go in Spark shell
IDE Spark-shell
Easier to manipulate with objects,
inheritance, package management
Easier to debug code with production
scale data
Requires some hacking to get programs
run on both Windows and Prod
environments
Will only run on Windows if you have
correct line endings in spark-shell
launcher scripts or use Cygwin
IntelliJ IDEA
β€’ Basic set up https://gitz.adform.com/dspr/audience-
extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar
k-skeleton
Hacks
β€’ 99% chance that on Windows you won’t be able to use function
`saveAsTextFile()`
β€’ Download exe file from
http://stackoverflow.com/questions/19620642/failed-to-locate-the-
winutils-binary-in-the-hadoop-binary-path
β€’ Place it somewhere on your PC in bin folder
(C:somewherebinwinutils.exe) and set in your code before using
save function
System.setProperty("hadoop.home.dir", "C:somewhere")
When you are done with your code…
β€’ It is time to package everything to fat jar with sbt assembly
β€’ Add β€œprovided” to library dependencies, since spark libs are already in
the classpath if you run job on emr with spark already set-up
β€’ Find more info in Audience Extension project Spark branch build.sbt
file.
libraryDependencies += "org.apache.spark" %% "spark-core" %
"1.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"1.2.0" % "provided"
Running on EMR
β€’ build.sbt can be configured (S3 package) to upload fat jar to s3 when
it is done with assembly, if you don’t have that just upload it manually
β€’ Run bootstrap action s3://support.elasticmapreduce/spark/install-
spark with arguments -v 1.2.0.a -x –g (some documentation in
https://github.com/awslabs/emr-bootstrap-
actions/tree/master/spark)
β€’ Also install ganglia for monitoring cluster load (run this before spark
bootstrap step)
β€’ If you don’t install ganglia ssh tunnels to spark UI won’t work.
Start with local mode first
Use only one instance in cluster, submit your jar with this:
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master local[16] 
--driver-memory 4G 
--conf spark.default.parallelism=112
SimilarityJob.jar 
--remote 
--input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* 
--output s3://dev-adform-data-engineers/tmp/spark/2days 
--similarity-threshold 300
Run on multiple machines with yarn master
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master yarn 
--deploy-mode client  #or cluster
--num-executors 7 
--executor-memory 116736 M 
--executor-cores 16 
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4 
SimilarityJob.jar 
--remote 
… … …
Executor parameters are optional, bootstrap
script will automatically try to maximize spark
configuration options. Note that scripts are
not aware of tasks that you are doing, they
only read emr cluster specifications.
Spark UI
β€’ Need to set up ssh tunnel to use access it from your PC
β€’ Alternative is to use command line browser lynx
β€’ When you submit app with local master UI will be in ip:4040
β€’ When you submit with Yarn master, go to Hadoop UI on port 9026, it
will have Spark task running, click on ApplicationMaster in Tracking UI
column, or get UI url from command line when you submit task
Spark UI
For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs
are Jobs, Stages and Environment.
Some useful settings
β€’ spark.hadoop.validateOutputSpecs useful when developing, set to
false, then you can overwrite output files
β€’ spark.default.parallelism (number of output files / number of cores),
automatically configured when you run bootstrap actions with -x
option
β€’ spark.shuffle.consolidateFiles (default false)
β€’ spark.rdd.compress (default false)
β€’ spark.akka.timeout, spark.akka.frameSize, spark.speculation, …
β€’ http://spark.apache.org/docs/1.2.0/configuration.html
Spark shell
/home/hadoop/spark/bin/spark-shell 
--master <yarn|local[*]> 
--deploy-mode client 
--num-executors 7 
--executor-memory 4G 
--executor-cores 16 
--driver-memory 4G
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4
Spark shell
β€’ In spark shell you don’t need to instantiate spark context, it is already
intantiated, but you can create another one if you like
β€’ Type scala expressions and see what is happening
β€’ Note the lazy evaluation, to force expression evaluation fore
debugging use action functions like [expression].take(n) or
[expression].count to see if your statements are OK
Summary
β€’ Spark is better suited for developing in Linux
β€’ Don’t trust Amazon bootstrap scripts, check if your application is
utilizing resources with Ganglia
β€’ Try to write scala code in a way that it is possible to run parts of it in
spark-shell, otherwise it is hard to debug problems which occur only
at production dataset scale.

More Related Content

PPTX
Testing in Scala. Adform Research
PPTX
Spark intro by Adform Research
PDF
Akka lsug skills matter
PPTX
Akka.net versus microsoft orleans
PPTX
Ansible Devops North East - slides
ZIP
5εˆ†γ§θͺ¬ζ˜Žγ™γ‚‹ Play! scala
PDF
Akka Cluster in Java - JCConf 2015
KEY
2011/10/08_Playframework_GAE_to_Heroku
Testing in Scala. Adform Research
Spark intro by Adform Research
Akka lsug skills matter
Akka.net versus microsoft orleans
Ansible Devops North East - slides
5εˆ†γ§θͺ¬ζ˜Žγ™γ‚‹ Play! scala
Akka Cluster in Java - JCConf 2015
2011/10/08_Playframework_GAE_to_Heroku

What's hot (20)

PDF
γ¨γ‚Šγ‚γˆγšδ½Ώγ†Scalaz
PPTX
Akka Actor presentation
PDF
Full Stack Scala
PPTX
"Walk in a distributed systems park with Orleans" Π•Π²Π³Π΅Π½ΠΈΠΉ Π‘ΠΎΠ±Ρ€ΠΎΠ²
Β 
PPTX
Extending ansible
PDF
Scala.js - yet another what..?
PDF
First glance at Akka 2.0
KEY
Curator intro
PDF
Akka in Practice: Designing Actor-based Applications
Β 
PDF
Real-time search in Drupal. Meet Elasticsearch
PDF
ChefConf 2014 - AWS OpsWorks Under The Hood
PDF
Go database/sql
PDF
20150627 bigdatala
Β 
PPTX
Introduction to Akka - Atlanta Java Users Group
KEY
Wider than rails
PPTX
Terraform day02
PDF
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
PDF
What's New in Apache Solr 4.10
PPTX
Scala.js for large and complex frontend apps
PPTX
Building an aws sdk for Perl - Granada Perl Workshop 2014
γ¨γ‚Šγ‚γˆγšδ½Ώγ†Scalaz
Akka Actor presentation
Full Stack Scala
"Walk in a distributed systems park with Orleans" Π•Π²Π³Π΅Π½ΠΈΠΉ Π‘ΠΎΠ±Ρ€ΠΎΠ²
Β 
Extending ansible
Scala.js - yet another what..?
First glance at Akka 2.0
Curator intro
Akka in Practice: Designing Actor-based Applications
Β 
Real-time search in Drupal. Meet Elasticsearch
ChefConf 2014 - AWS OpsWorks Under The Hood
Go database/sql
20150627 bigdatala
Β 
Introduction to Akka - Atlanta Java Users Group
Wider than rails
Terraform day02
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
What's New in Apache Solr 4.10
Scala.js for large and complex frontend apps
Building an aws sdk for Perl - Granada Perl Workshop 2014
Ad

Similar to Spark Intro by Adform Research (20)

PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PDF
Productionizing Spark and the Spark Job Server
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
PDF
Apache Spark Tutorial
PDF
Spark on YARN
PPTX
How to build your query engine in spark
PDF
Spark 101
PDF
Spark 2.x Troubleshooting Guide
Β 
PPTX
Spark 101 - First steps to distributed computing
PDF
Introduction to apache spark and the architecture
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Incorta spark integration
PPTX
Introduction to Apache Spark
PDF
Debugging Apache Spark - Scala & Python super happy fun times 2017
PDF
Fast Data Analytics with Spark and Python
PDF
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
PDF
Hadoop spark online demo
PDF
Spark + H20 = Machine Learning at scale
PPTX
ETL with SPARK - First Spark London meetup
PDF
Spark Working Environment in Windows OS
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Productionizing Spark and the Spark Job Server
Productionizing Spark and the REST Job Server- Evan Chan
Apache Spark Tutorial
Spark on YARN
How to build your query engine in spark
Spark 101
Spark 2.x Troubleshooting Guide
Β 
Spark 101 - First steps to distributed computing
Introduction to apache spark and the architecture
Scalding by Adform Research, Alex Gryzlov
Incorta spark integration
Introduction to Apache Spark
Debugging Apache Spark - Scala & Python super happy fun times 2017
Fast Data Analytics with Spark and Python
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Hadoop spark online demo
Spark + H20 = Machine Learning at scale
ETL with SPARK - First Spark London meetup
Spark Working Environment in Windows OS
Ad

More from Vasil Remeniuk (20)

PPTX
Product Minsk - Π Π’Π‘ ΠΈ ΠŸΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ°Ρ‚ΠΈΠΊ
PDF
Π Π°Π±ΠΎΡ‚Π° с Akka Π‘luster, @afiskon, scalaby#14
PDF
Cake pattern. Presentation by Alex Famin at scalaby#14
PDF
Scala laboratory: Globus. iteration #3
PPTX
Testing in Scala by Adform research
PPTX
Types by Adform Research, Saulius Valatka
PPTX
Types by Adform Research
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Spark by Adform Research, Paulius
PPTX
Scala Style by Adform Research (Saulius Valatka)
PPTX
SBT by Aform Research, Saulius Valatka
PDF
Scala laboratory: Globus. iteration #2
PDF
Scala laboratory. Globus. iteration #1
PDF
Cassandra + Spark + Elk
PDF
ΠžΠΏΡ‹Ρ‚ использования Spark, Основано Π½Π° Ρ€Π΅Π°Π»ΡŒΠ½Ρ‹Ρ… событиях
PDF
ETL со Spark
PDF
Funtional Reactive Programming with Examples in Scala + GWT
PDF
Vaadin+Scala
PDF
[НС]ΠΏΡ€Π°ΠΊΡ‚ΠΈΡ‡Π½Ρ‹Π΅ Ρ‚ΠΈΠΏΡ‹
PPTX
Π—Π°Ρ‡Π΅ΠΌ Π½ΡƒΠΆΠ½Π° Scala?
Product Minsk - Π Π’Π‘ ΠΈ ΠŸΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ°Ρ‚ΠΈΠΊ
Π Π°Π±ΠΎΡ‚Π° с Akka Π‘luster, @afiskon, scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14
Scala laboratory: Globus. iteration #3
Testing in Scala by Adform research
Types by Adform Research, Saulius Valatka
Types by Adform Research
Scalding by Adform Research, Alex Gryzlov
Spark by Adform Research, Paulius
Scala Style by Adform Research (Saulius Valatka)
SBT by Aform Research, Saulius Valatka
Scala laboratory: Globus. iteration #2
Scala laboratory. Globus. iteration #1
Cassandra + Spark + Elk
ΠžΠΏΡ‹Ρ‚ использования Spark, Основано Π½Π° Ρ€Π΅Π°Π»ΡŒΠ½Ρ‹Ρ… событиях
ETL со Spark
Funtional Reactive Programming with Examples in Scala + GWT
Vaadin+Scala
[НС]ΠΏΡ€Π°ΠΊΡ‚ΠΈΡ‡Π½Ρ‹Π΅ Ρ‚ΠΈΠΏΡ‹
Π—Π°Ρ‡Π΅ΠΌ Π½ΡƒΠΆΠ½Π° Scala?

Recently uploaded (20)

PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
Group 1 Presentation -Planning and Decision Making .pptx
NewMind AI Weekly Chronicles - August'25-Week II
SOPHOS-XG Firewall Administrator PPT.pptx
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Univ-Connecticut-ChatGPT-Presentaion.pdf
Tartificialntelligence_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
OMC Textile Division Presentation 2021.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TLE Review Electricity (Electricity).pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cloud_computing_Infrastucture_as_cloud_p

Spark Intro by Adform Research

  • 1. Spark tutorial, developing locally and deploying on EMR
  • 2. Use cases (my biased opinion) β€’ Interactive and Expressive Data Analysis β€’ If you feel limited when trying to express yourself in β€œgroup by”, β€œjoin” and β€œwhere” β€’ Only if it is not possible to work with datasets locally β€’ Entering Danger Zone: β€’ Spark SQL engine, like Impala/Hive β€’ Speed up ETLs if your data can fit in memory (speculation) β€’ Machine learning β€’ Graph analytics β€’ Streaming (not mature yet)
  • 3. Possible working styles β€’ Develop in IDE β€’ Develop as you go in Spark shell IDE Spark-shell Easier to manipulate with objects, inheritance, package management Easier to debug code with production scale data Requires some hacking to get programs run on both Windows and Prod environments Will only run on Windows if you have correct line endings in spark-shell launcher scripts or use Cygwin
  • 4. IntelliJ IDEA β€’ Basic set up https://gitz.adform.com/dspr/audience- extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar k-skeleton
  • 5. Hacks β€’ 99% chance that on Windows you won’t be able to use function `saveAsTextFile()` β€’ Download exe file from http://stackoverflow.com/questions/19620642/failed-to-locate-the- winutils-binary-in-the-hadoop-binary-path β€’ Place it somewhere on your PC in bin folder (C:somewherebinwinutils.exe) and set in your code before using save function System.setProperty("hadoop.home.dir", "C:somewhere")
  • 6. When you are done with your code… β€’ It is time to package everything to fat jar with sbt assembly β€’ Add β€œprovided” to library dependencies, since spark libs are already in the classpath if you run job on emr with spark already set-up β€’ Find more info in Audience Extension project Spark branch build.sbt file. libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.2.0" % "provided"
  • 7. Running on EMR β€’ build.sbt can be configured (S3 package) to upload fat jar to s3 when it is done with assembly, if you don’t have that just upload it manually β€’ Run bootstrap action s3://support.elasticmapreduce/spark/install- spark with arguments -v 1.2.0.a -x –g (some documentation in https://github.com/awslabs/emr-bootstrap- actions/tree/master/spark) β€’ Also install ganglia for monitoring cluster load (run this before spark bootstrap step) β€’ If you don’t install ganglia ssh tunnels to spark UI won’t work.
  • 8. Start with local mode first Use only one instance in cluster, submit your jar with this: /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master local[16] --driver-memory 4G --conf spark.default.parallelism=112 SimilarityJob.jar --remote --input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* --output s3://dev-adform-data-engineers/tmp/spark/2days --similarity-threshold 300
  • 9. Run on multiple machines with yarn master /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master yarn --deploy-mode client #or cluster --num-executors 7 --executor-memory 116736 M --executor-cores 16 --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4 SimilarityJob.jar --remote … … … Executor parameters are optional, bootstrap script will automatically try to maximize spark configuration options. Note that scripts are not aware of tasks that you are doing, they only read emr cluster specifications.
  • 10. Spark UI β€’ Need to set up ssh tunnel to use access it from your PC β€’ Alternative is to use command line browser lynx β€’ When you submit app with local master UI will be in ip:4040 β€’ When you submit with Yarn master, go to Hadoop UI on port 9026, it will have Spark task running, click on ApplicationMaster in Tracking UI column, or get UI url from command line when you submit task
  • 11. Spark UI For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs are Jobs, Stages and Environment.
  • 12. Some useful settings β€’ spark.hadoop.validateOutputSpecs useful when developing, set to false, then you can overwrite output files β€’ spark.default.parallelism (number of output files / number of cores), automatically configured when you run bootstrap actions with -x option β€’ spark.shuffle.consolidateFiles (default false) β€’ spark.rdd.compress (default false) β€’ spark.akka.timeout, spark.akka.frameSize, spark.speculation, … β€’ http://spark.apache.org/docs/1.2.0/configuration.html
  • 13. Spark shell /home/hadoop/spark/bin/spark-shell --master <yarn|local[*]> --deploy-mode client --num-executors 7 --executor-memory 4G --executor-cores 16 --driver-memory 4G --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4
  • 14. Spark shell β€’ In spark shell you don’t need to instantiate spark context, it is already intantiated, but you can create another one if you like β€’ Type scala expressions and see what is happening β€’ Note the lazy evaluation, to force expression evaluation fore debugging use action functions like [expression].take(n) or [expression].count to see if your statements are OK
  • 15. Summary β€’ Spark is better suited for developing in Linux β€’ Don’t trust Amazon bootstrap scripts, check if your application is utilizing resources with Ganglia β€’ Try to write scala code in a way that it is possible to run parts of it in spark-shell, otherwise it is hard to debug problems which occur only at production dataset scale.