SlideShare a Scribd company logo
By @przemur from
HTTP://ABOUT.ME/PRZEMEK.MACIOLEK/
•

Data Scientist, Hadoop user since 2009	


•

Did research for Academia, data mined for oil&gas exploration industry,
cofounded Data Science startup, built Big Data team in Base CRM, …	


•

A lot of different tools used meanwhile (Mahout, HBase, Cassandra,
Redis, Pig, Storm, …) 	


•

Dreaming about something powerful and concise for Big Data…	


•

AD 2014: Head of Analytics & Data @ Toptal - researching new ways of
doing Big Data Analytics, rediscovered Storm.
P.S. Ever considered doing Analytics & Data Science for a
very cool startup? Drop me a note at: prze@toptal.com
HADOOP IS COOL…
HADOOP IS COOL 	

(BUT SOMETIMES IT’S NOT)
•

High latency (interactive, anyone?)	


•

Challenging expressibility of business logic	


•

Iterative algorithms? (think: PageRank)
SOLUTION?
Giraph
MapReduce

Pig
S4
Hive

General batch
processing

Pregel
Storm

Drill

…

Specialized
systems

Impala
Map

Reduce

Data

Data

Data
Data
Data

Data

Data
Data

Data

Data

Data

Data

Data
Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data
Data

Data

Data

Data

Data

Data

Data
MAYBE MAP REDUCE IS NOT
ALWAYS THE BEST SOLUTION?
GENERALIZE FTW!

Spark

Task DAG and
Data Sharing

MapReduce

…

Batch 

processing

Specialized
systems
RESILIENT DISTRIBUTED
DATASET (RDD)
•

A collection of elements that can be operated in
parallel	

•

Parallel Collection, e.g. sc.paralellize(Array(1,2,3))

•

Hadoop Dataset	


•

Lazily evaluated, able to rebuild lost data any time	


•

Can be stored in memory without replication
ACTIONS

TRANSFORMATIONS
•

Creates a new dataset from
an existing one	


•

•

Return the value to the
driver after computation
finishes	


•

Runs all required
transformations

Lazily evaluated	


•

Recomputed each time an
action runs on it, but might be
persisted (in memory or disk)	


•

Broadcast Variables and
Accumulators for cluster-level
sharing
Scala, Java, Python!
HOW TO USE IT?
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

!

scala> textFile.count() // Number of items in this RDD
res0: Long = 74

!

scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark

!

scala> textFile.map(line => line.split(" ").size).reduce((a, b) =>
Math.max(a, b)) // How many words are in the longest line
res2: Int = 16

!

scala> textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b).collect
res3: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,
3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/
libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1),
(`PATH`,,2), (200,1), (To,3),...
WHAT HAPPENS
UNDERNEATH?
RDD Objects

DAG Scheduler

Split graph into stages
of tasks. Submit each
one when ready.

rdd.filter().map(…).
groupBy(…).filter(…)

t
Se
sk
Ta

Worker
Execute tasks. Store
and serve blocks.

Task

Task Scheduler
Lunch tasks via cluster
manager. Retry.
NARROW
DEPENDENCIES

WIDE (SHUFFLE)
DEPENDENCIES

map, filter

groupByKey

union
join (inputs not
co-partitioned)
* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
How much code is needed to implement Big Data Page Rank?
* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
* http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf
* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
BERKELEY DATA
ANALYTICS STACK

* https://amplab.cs.berkeley.edu/software/
SPARK LIVE
REFERENCES
•

http://spark.incubator.apache.org/

•

https://amplab.cs.berkeley.edu/software/	


•

http://ampcamp.berkeley.edu/3/exercises/index.html	


•

http://www.mlbase.org/	


•

https://amplab.cs.berkeley.edu/benchmark/	


•

http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx	


•

http://spark-summit.org/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf	


•

http://spark-summit.org/wp-content/uploads/2013/10/Kay_Sparrow_Spark_Summit.pdf	


•

http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf	


•

http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf	


•

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf	


•

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

More Related Content

PPTX
5 Apache Spark Tips in 5 Minutes
PPTX
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
PDF
Tachyon and Apache Spark
PDF
Productionizing Spark and the Spark Job Server
PDF
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
PDF
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
PDF
Spark Pipelines in the Cloud with Alluxio with Gene Pang
5 Apache Spark Tips in 5 Minutes
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
Tachyon and Apache Spark
Productionizing Spark and the Spark Job Server
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Pipelines in the Cloud with Alluxio with Gene Pang

What's hot (20)

PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
PDF
Apache Spark Performance is too hard. Let's make it easier
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PDF
Re-Architecting Spark For Performance Understandability
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
DIscover Spark and Spark streaming
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
PDF
Spark Summit EU talk by Michael Nitschinger
PDF
Pedal to the Metal: Accelerating Spark with Silicon Innovation
PDF
Spark Summit EU talk by Bas Geerdink
PPTX
Apache Spark on Kubernetes
PDF
Wisely Chen Spark Talk At Spark Gathering in Taiwan
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
Scalable Scientific Computing with Dask
PPTX
Speed up R with parallel programming in the Cloud
PDF
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
PDF
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
PDF
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Apache Spark Performance is too hard. Let's make it easier
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Re-Architecting Spark For Performance Understandability
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
DIscover Spark and Spark streaming
Apache Spark on K8S Best Practice and Performance in the Cloud
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit EU talk by Michael Nitschinger
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Spark Summit EU talk by Bas Geerdink
Apache Spark on Kubernetes
Wisely Chen Spark Talk At Spark Gathering in Taiwan
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Scalable Scientific Computing with Dask
Speed up R with parallel programming in the Cloud
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
Ad

Similar to Spark! (20)

PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
A look under the hood at Apache Spark's API and engine evolutions
PPTX
Real time hadoop + mapreduce intro
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
PDF
Interactive SQL-on-Hadoop and JethroData
PDF
Apache Spark for Everyone - Women Who Code Workshop
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Paris Data Geek - Spark Streaming
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PPTX
Big data clustering
PPTX
Apache Spark
Big Data Analytics with Hadoop, MongoDB and SQL Server
Intro to Apache Spark by CTO of Twingo
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Big data vahidamiri-tabriz-13960226-datastack.ir
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
A look under the hood at Apache Spark's API and engine evolutions
Real time hadoop + mapreduce intro
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Interactive SQL-on-Hadoop and JethroData
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Big Data Analytics Projects - Real World with Pentaho
Paris Data Geek - Spark Streaming
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Jump Start on Apache Spark 2.2 with Databricks
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Big data clustering
Apache Spark
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Mushroom cultivation and it's methods.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Getting Started with Data Integration: FME Form 101
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
project resource management chapter-09.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
A Presentation on Touch Screen Technology
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Approach and Philosophy of On baking technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Mushroom cultivation and it's methods.pdf
A Presentation on Artificial Intelligence
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Getting Started with Data Integration: FME Form 101
TLE Review Electricity (Electricity).pptx
Building Integrated photovoltaic BIPV_UPV.pdf
1 - Historical Antecedents, Social Consideration.pdf
project resource management chapter-09.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A Presentation on Touch Screen Technology
Zenith AI: Advanced Artificial Intelligence
SOPHOS-XG Firewall Administrator PPT.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
OMC Textile Division Presentation 2021.pptx
Web App vs Mobile App What Should You Build First.pdf
Approach and Philosophy of On baking technology
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf

Spark!