SlideShare a Scribd company logo
Apache Spark
The next Generation Cluster Computing
Ivan Lozić, 04/25/2017
Ivan Lozić, software engineer & entrepreneur
Scala & Spark, C#, Node.js, Swift
Web page: www.deegloo.com
E-Mail: ilozic@gmail.com
LinkedIn: https://www.linkedin.com/in/ilozic/
Zagreb, Croatia
Contents
● Apache Spark and its relation to Hadoop MapReduce
● What makes Apache Spark run fast
● How to use Spark rich API to build batch ETL jobs
● Streaming capabilities
● Structured streaming
3
Apache Hadoop
44
Apache Hadoop
● Open Source framework for distributed storage and processing
● Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella)
● 2006. Yahoo! Created Hadoop based on GFS and MapReduce
● Based on MapReduce programming model
● Fundamental assumption - all the modules are built to handle
hardware failures automatically
● Clusters built of commodity hardware
5
6
Apache Spark
77
Motivation
● Hardware - CPU compute bottleneck
● Users - democratise access to data and improve usability
● Applications - necessity to build near real time big data applications
8
Apache Spark
● Open source fast and expressive cluster computing framework
designed for Big data analytics
● Compatible with Apache Hadoop
● Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache
Software Foundation in 2013.
● Original author - Matei Zaharia
● Databricks inc. - company behind Apache Spark
9
Apache Spark
● General distributed computing engine which unifies:
○ SQL and DataFrames
○ Real-time streaming (Spark streaming)
○ Machine learning (SparkML/MLLib)
○ Graph processing (GraphX)
10
Apache Spark
● Runs everywhere - standalone, EC2, Hadoop YARN, Apache Mesos
● Reads and writes from/to:
○ File/Directory
○ HDFS/S3
○ JDBC
○ JSON
○ CSV
○ Parquet
○ Cassandra, HBase, ...
11
Apache Spark - architecture
12
source: Databricks
Word count - MapReduce vs Spark
13
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Hadoop ecosystem
14
Who uses Apache Spark?
15
Core data
abstractions
1616
Resilient Distributed Dataset
● RDDs are partitioned collections of objects - building blocks of Spark
● Immutable and provide fault tolerant computation
● Two types of operations:
1. Transformations - map, reduce, sort, filter, groupBy, ...
2. Actions - collect, count, take, first, foreach, saveToCassandra, ...
17
RDD
● Types of operations are based on Scala collection API
● Transformations are lazily evaluated DAG (Directed Acyclic Graph)
constituents
● Actions invoke DAG creation and actual computation
18
RDD
19
Data shuffling
● Sending data over the network
● Slow - should be minimized as much as possible!
● Typical example - groupByKey (slow) vs reduceByKey (faster)
20
RDD - the problems
● They express the how better than what
● Operations and data type in clojure are black box for Spark - Spark
cannot make optimizations
21
val category = spark.sparkContext.textFile("/data/SFPD_Incidents_2003.csv")
.map(line => line.split(byCommaButNotUnderQuotes)(1))
.filter(cat => cat != "Category")
Structure
(Structured APIs)
22
SparkSQL
23
● Originally named “Shark” - to enable HiveQL queries
● As of Spark 2.0 - SQL 2003 support
category.toDF("categoryName").createOrReplaceTempView("category")
spark.sql("""
SELECT categoryName, count(*) AS Count
FROM category
GROUP BY categoryName
ORDER BY 2 DESC
""").show(5)
DataFrame
● Higher level abstraction (DSL) to manipulate with data
● Distributed collection of rows organized into named columns
● Modeled after Pandas DataFrame
● DataFrame has schema (something RDD is missing)
24
val categoryDF = category.toDF("categoryName")
categoryDF
.groupBy("categoryName")
.count()
.orderBy($"Count".desc)
.show(5)
DataFrame
25
Structured APIs error-check comparison
26
source: Databricks
Dataset
● Extension to DataFrame
● Type-safe
● DataFrame = Dataset[Row]
27
case class Incident(Category: String, DayOfWeek: String)
val incidents = spark
.read
.option("header", "true")
.csv("/data/SFPD_Incidents_2003.csv")
.select("Category", "DayOfWeek")
.as[Incident]
val days = Array("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
val histogram = incidents.groupByKey(_.Category).mapGroups {
case (category, daysOfWeek) => {
val buckets = new Array[Int](7)
daysOfWeek.map(_.DayOfWeek).foreach { dow =>
buckets(days.indexOf(dow)) += 1
}
(category, buckets)
}
}
What makes
Spark fast?
2828
In memory computation
● Fault tolerance is achieved by using HDFS
● Easy possible to spend 90% of time in Disk I/O only
29
iter. 1
input
iter. 2 ...
HDFS read HDFS write HDFS read HDFS write HDFS read
● Fault tolerance is provided by building lineage of transformations
● Data is not being replicated
iter. 1
input
iter. 2 ...
Catalyst - query optimizer
30
source: Databricks
● Applies transformations to convert unoptimized to optimized query
plan
Project Tungsten
● Improve Spark execution memory and CPU efficiency by:
○ Performing explicit memory management instead of relying on JVM objects (Dataset
encoders)
○ Generating code on the fly to fuse multiple operators into one (Whole stage codegen)
○ Introducing cache-aware computation
○ In-memory columnar format
● Bringing Spark closer to the bare metal
31
Dataset encoders
● Encoders translate between domain objects and Spark's internal
format
32
source: Databricks
Dataset encoders
● Encoders bridge objects with data sources
33
{
"Category": "THEFT",
"IncidntNum": "150060275",
"DayOfWeek": "Saturday"
}
case class Incident(IncidntNum: Int,
Category: String,
DayOfWeek: String)
Dataset benchmark
Space efficiency
34
source: Databricks
Dataset benchmark
Serialization/deserialization performance
35
source: Databricks
Whole stage codegen
● Fuse the operators together
● Generate code on the fly
● The idea: generate specialized code as if it was written manually to be
fast
Result: Spark 2.0 is 10x faster than Spark 1.6
36
Whole stage codegen
37
SELECT COUNT(*) FROM store_sales
WHERE ss_item_sk=1000
Whole stage codegen
Volcano iterator model
38
Whole stage codegen
What if we would ask some intern to write this in c#?
39
long count = 0;
foreach (var ss_item_sk in store_sales) {
if (ss_item_sk == 1000)
count++;
}
Volcano vs Intern
40
Volcano
Intern
source: Databricks
Volcano vs Intern
41
Developing ETL
with Spark
4242
Choose your favorite IDE
43
Define Spark job entry point
44
object IncidentsJob {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Incidents processing job")
.config("spark.sql.shuffle.partitions", "16")
.master("local[4]")
.getOrCreate()
{ spark transformations and actions... }
System.exit(0)
}
Create build.sbt file
45
lazy val root = (project in file(".")).
settings(
organization := "com.mycompany",
name := "spark.job.incidents",
version := "1.0.0",
scalaVersion := "2.11.8",
mainClass in Compile := Some("com.mycompany.spark.job.incidents.main")
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.1" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.0.1" % "provided",
"com.microsoft.sqlserver" % "sqljdbc4" % "4.0"
)
Create application (fat) jar file
$ sbt compile
$ sbt test
$ sbt assembly (sbt-assembly plugin)
46
Submit job via spark-submit command
./bin/spark-submit 
--class <main-class> 
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options
<application-jar> 
[application-arguments]
47
Example workflow
48
code
1. pull content
2. take build number (331)
3. build & test
4. copy to cluster
job331.jar
produce job artifact
notification
5. create/schedule job job331 (http)
6. spark submit
job331
Spark Streaming
4949
Apache Spark streaming
● Scalable fault tolerant streaming system
● Receivers receive data streams and chop them into batches
● Spark processes batches and pushes out the result
50
● Input: Files, Socket, Kafka, Flume, Kinesis...
Apache Spark streaming
51
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("Incidents processing job - Stream")
val ssc = new StreamingContext(conf, Seconds(1))
val topics = Set(
Topics.Incident,
val directKafkaStream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte],
DefaultDecoder, DefaultDecoder](
ssc,
kafkaParams,
topics)
// process batches
directKafkaStream.map(_._2).flatMap(_.split(“ “))...
// Start the computation
ssc.start()
ssc.awaitTermination()
System.exit(0)
}
Apache Spark streaming
● Integrates with the rest of the ecosystem
○ Combine batch and stream processing
○ Combine machine learning with streaming
○ Combine SQL with streaming
52
Structured
streaming
53
[Alpha version in Spark 2.1]
53
Structured streaming (continuous apps)
● High-level streaming API built on DataFrames
● Catalyst optimizer creates incremental execution plan
● Unifies streaming, interactive and batch queries
● Supports multiple sources and sinks
● E.g. aggregate data in a stream, then serve using JDBC
54
Structured streaming key idea
The simplest way to perform streaming analytics is not having to reason
about streaming.
55
Structured streaming
56
Structured streaming
● Reusing same API
57
val categories = spark
.read
.option("header", "true")
.schema(schema)
.csv("/data/source")
.select("Category")
val categories = spark
.readStream
.option("header", "true")
.schema(schema)
.csv("/data/source")
.select("Category")
finite infinite
Structured streaming
● Reusing same API
58
categories
.write
.format("parquet")
.save("/data/warehouse/categories.parquet")
categories
.writeStream
.format("parquet")
.start("/data/warehouse/categories.parquet")
finite infinite
Structured streaming
59
Useful resources
● Spark home page: https://spark.apache.org/
● Spark summit page: https://spark-summit.org/
● Apache Spark Docker image:
https://github.com/dylanmei/docker-zeppelin
● SFPD Incidents:
https://data.sfgov.org/Public-Safety/Police-Department-Incidents/tmn
f-yvry
60
Thank you for the attention!
61
References
62
● Michael Armbrust - STRUCTURING SPARK: DATAFRAMES, DATASETS AND STREAMING -
https://spark-summit.org/2016/events/structuring-spark-dataframes-datasets-and-streaming/
● Apache Parquet - https://parquet.apache.org/
● Spark Performance: What's Next -
https://spark-summit.org/east-2016/events/spark-performance-whats-next/
● Avoid groupByKey -
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reduceby
key_over_groupbykey.html

More Related Content

PPTX
Apache Spark Architecture
ODP
Spark Deep Dive
PPTX
Spark introduction and architecture
PDF
Spark & Spark Streaming Internals - Nov 15 (1)
PDF
Spark shuffle introduction
PDF
BDM25 - Spark runtime internal
PDF
Apache Spark: What's under the hood
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark Architecture
Spark Deep Dive
Spark introduction and architecture
Spark & Spark Streaming Internals - Nov 15 (1)
Spark shuffle introduction
BDM25 - Spark runtime internal
Apache Spark: What's under the hood
Apache Spark in Depth: Core Concepts, Architecture & Internals

What's hot (20)

PDF
Apache Spark RDDs
PPTX
Apache Spark overview
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
20140908 spark sql & catalyst
PPTX
Introduction to Apache Spark Developer Training
PDF
How Apache Spark fits into the Big Data landscape
PDF
Spark overview
PDF
DTCC '14 Spark Runtime Internals
PPTX
Tuning and Debugging in Apache Spark
PDF
Apache Spark Introduction
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PPTX
Apache Spark
PDF
Top 5 mistakes when writing Spark applications
PPTX
Transformations and actions a visual guide training
PDF
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
PPTX
Apache Spark RDD 101
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPTX
Apache spark Intro
PPTX
Intro to Apache Spark
Apache Spark RDDs
Apache Spark overview
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
20140908 spark sql & catalyst
Introduction to Apache Spark Developer Training
How Apache Spark fits into the Big Data landscape
Spark overview
DTCC '14 Spark Runtime Internals
Tuning and Debugging in Apache Spark
Apache Spark Introduction
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Apache Spark
Top 5 mistakes when writing Spark applications
Transformations and actions a visual guide training
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Apache Spark RDD 101
Spark Summit East 2015 Advanced Devops Student Slides
Apache spark Intro
Intro to Apache Spark
Ad

Viewers also liked (7)

PDF
Introduction to Stateful Stream Processing with Apache Flink.
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Apache Spark Briefing
PPTX
Apache Spark in Scientific Applications
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
PDF
What the Spark!? Intro and Use Cases
Introduction to Stateful Stream Processing with Apache Flink.
Large-Scale Stream Processing in the Hadoop Ecosystem
Apache Spark Briefing
Apache Spark in Scientific Applications
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
What the Spark!? Intro and Use Cases
Ad

Similar to Apache Spark, the Next Generation Cluster Computing (20)

PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
Introduction to Apache Spark
PPTX
Spark Study Notes
PDF
[@NaukriEngineering] Apache Spark
PDF
20150716 introduction to apache spark v3
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
PDF
Introduction to Spark
PPTX
Introduction to Apache Spark
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
PPTX
Apache Spark
PPTX
APACHE SPARK.pptx
PDF
Artigo 81 - spark_tutorial.pdf
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
Apache spark installation [autosaved]
PDF
Hadoop and Spark
PPTX
In Memory Analytics with Apache Spark
PPTX
Apache Spark Fundamentals
PPTX
2016-07-21-Godil-presentation.pptx
PPTX
Introduction to Spark - DataFactZ
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Introduction to Apache Spark
Spark Study Notes
[@NaukriEngineering] Apache Spark
20150716 introduction to apache spark v3
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Introduction to Spark
Introduction to Apache Spark
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Apache Spark
APACHE SPARK.pptx
Artigo 81 - spark_tutorial.pdf
Intro to Apache Spark by CTO of Twingo
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Apache spark installation [autosaved]
Hadoop and Spark
In Memory Analytics with Apache Spark
Apache Spark Fundamentals
2016-07-21-Godil-presentation.pptx
Introduction to Spark - DataFactZ

More from Gerger (13)

PDF
Source Control for the Oracle Database
PDF
Big Data for Oracle Professionals
PDF
Best Way to Write SQL in Java
PDF
Version control for PL/SQL
PDF
Gitora, Version Control for PL/SQL
PDF
Gitora, Version Control for PL/SQL
PDF
PostgreSQL for Oracle Developers and DBA's
PDF
Shaping Optimizer's Search Space
PDF
Gitora, Version Control for PL/SQL
PDF
Monitoring Oracle Database Instances with Zabbix
PDF
Introducing ProHuddle
PDF
Use Cases of Row Pattern Matching in Oracle 12c
PDF
Introducing Gitora,the version control tool for PL/SQL
Source Control for the Oracle Database
Big Data for Oracle Professionals
Best Way to Write SQL in Java
Version control for PL/SQL
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
PostgreSQL for Oracle Developers and DBA's
Shaping Optimizer's Search Space
Gitora, Version Control for PL/SQL
Monitoring Oracle Database Instances with Zabbix
Introducing ProHuddle
Use Cases of Row Pattern Matching in Oracle 12c
Introducing Gitora,the version control tool for PL/SQL

Recently uploaded (20)

PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Introduction to Artificial Intelligence
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
ai tools demonstartion for schools and inter college
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
history of c programming in notes for students .pptx
PPT
Introduction Database Management System for Course Database
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo Companies in India – Driving Business Transformation.pdf
Softaken Excel to vCard Converter Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Introduction to Artificial Intelligence
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Design an Analysis of Algorithms II-SECS-1021-03
ai tools demonstartion for schools and inter college
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Upgrade and Innovation Strategies for SAP ERP Customers
Which alternative to Crystal Reports is best for small or large businesses.pdf
Digital Systems & Binary Numbers (comprehensive )
Operating system designcfffgfgggggggvggggggggg
Reimagine Home Health with the Power of Agentic AI​
history of c programming in notes for students .pptx
Introduction Database Management System for Course Database
How to Choose the Right IT Partner for Your Business in Malaysia

Apache Spark, the Next Generation Cluster Computing

  • 1. Apache Spark The next Generation Cluster Computing Ivan Lozić, 04/25/2017
  • 2. Ivan Lozić, software engineer & entrepreneur Scala & Spark, C#, Node.js, Swift Web page: www.deegloo.com E-Mail: [email protected] LinkedIn: https://www.linkedin.com/in/ilozic/ Zagreb, Croatia
  • 3. Contents ● Apache Spark and its relation to Hadoop MapReduce ● What makes Apache Spark run fast ● How to use Spark rich API to build batch ETL jobs ● Streaming capabilities ● Structured streaming 3
  • 5. Apache Hadoop ● Open Source framework for distributed storage and processing ● Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella) ● 2006. Yahoo! Created Hadoop based on GFS and MapReduce ● Based on MapReduce programming model ● Fundamental assumption - all the modules are built to handle hardware failures automatically ● Clusters built of commodity hardware 5
  • 6. 6
  • 8. Motivation ● Hardware - CPU compute bottleneck ● Users - democratise access to data and improve usability ● Applications - necessity to build near real time big data applications 8
  • 9. Apache Spark ● Open source fast and expressive cluster computing framework designed for Big data analytics ● Compatible with Apache Hadoop ● Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache Software Foundation in 2013. ● Original author - Matei Zaharia ● Databricks inc. - company behind Apache Spark 9
  • 10. Apache Spark ● General distributed computing engine which unifies: ○ SQL and DataFrames ○ Real-time streaming (Spark streaming) ○ Machine learning (SparkML/MLLib) ○ Graph processing (GraphX) 10
  • 11. Apache Spark ● Runs everywhere - standalone, EC2, Hadoop YARN, Apache Mesos ● Reads and writes from/to: ○ File/Directory ○ HDFS/S3 ○ JDBC ○ JSON ○ CSV ○ Parquet ○ Cassandra, HBase, ... 11
  • 12. Apache Spark - architecture 12 source: Databricks
  • 13. Word count - MapReduce vs Spark 13 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 15. Who uses Apache Spark? 15
  • 17. Resilient Distributed Dataset ● RDDs are partitioned collections of objects - building blocks of Spark ● Immutable and provide fault tolerant computation ● Two types of operations: 1. Transformations - map, reduce, sort, filter, groupBy, ... 2. Actions - collect, count, take, first, foreach, saveToCassandra, ... 17
  • 18. RDD ● Types of operations are based on Scala collection API ● Transformations are lazily evaluated DAG (Directed Acyclic Graph) constituents ● Actions invoke DAG creation and actual computation 18
  • 20. Data shuffling ● Sending data over the network ● Slow - should be minimized as much as possible! ● Typical example - groupByKey (slow) vs reduceByKey (faster) 20
  • 21. RDD - the problems ● They express the how better than what ● Operations and data type in clojure are black box for Spark - Spark cannot make optimizations 21 val category = spark.sparkContext.textFile("/data/SFPD_Incidents_2003.csv") .map(line => line.split(byCommaButNotUnderQuotes)(1)) .filter(cat => cat != "Category")
  • 23. SparkSQL 23 ● Originally named “Shark” - to enable HiveQL queries ● As of Spark 2.0 - SQL 2003 support category.toDF("categoryName").createOrReplaceTempView("category") spark.sql(""" SELECT categoryName, count(*) AS Count FROM category GROUP BY categoryName ORDER BY 2 DESC """).show(5)
  • 24. DataFrame ● Higher level abstraction (DSL) to manipulate with data ● Distributed collection of rows organized into named columns ● Modeled after Pandas DataFrame ● DataFrame has schema (something RDD is missing) 24 val categoryDF = category.toDF("categoryName") categoryDF .groupBy("categoryName") .count() .orderBy($"Count".desc) .show(5)
  • 26. Structured APIs error-check comparison 26 source: Databricks
  • 27. Dataset ● Extension to DataFrame ● Type-safe ● DataFrame = Dataset[Row] 27 case class Incident(Category: String, DayOfWeek: String) val incidents = spark .read .option("header", "true") .csv("/data/SFPD_Incidents_2003.csv") .select("Category", "DayOfWeek") .as[Incident] val days = Array("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") val histogram = incidents.groupByKey(_.Category).mapGroups { case (category, daysOfWeek) => { val buckets = new Array[Int](7) daysOfWeek.map(_.DayOfWeek).foreach { dow => buckets(days.indexOf(dow)) += 1 } (category, buckets) } }
  • 29. In memory computation ● Fault tolerance is achieved by using HDFS ● Easy possible to spend 90% of time in Disk I/O only 29 iter. 1 input iter. 2 ... HDFS read HDFS write HDFS read HDFS write HDFS read ● Fault tolerance is provided by building lineage of transformations ● Data is not being replicated iter. 1 input iter. 2 ...
  • 30. Catalyst - query optimizer 30 source: Databricks ● Applies transformations to convert unoptimized to optimized query plan
  • 31. Project Tungsten ● Improve Spark execution memory and CPU efficiency by: ○ Performing explicit memory management instead of relying on JVM objects (Dataset encoders) ○ Generating code on the fly to fuse multiple operators into one (Whole stage codegen) ○ Introducing cache-aware computation ○ In-memory columnar format ● Bringing Spark closer to the bare metal 31
  • 32. Dataset encoders ● Encoders translate between domain objects and Spark's internal format 32 source: Databricks
  • 33. Dataset encoders ● Encoders bridge objects with data sources 33 { "Category": "THEFT", "IncidntNum": "150060275", "DayOfWeek": "Saturday" } case class Incident(IncidntNum: Int, Category: String, DayOfWeek: String)
  • 36. Whole stage codegen ● Fuse the operators together ● Generate code on the fly ● The idea: generate specialized code as if it was written manually to be fast Result: Spark 2.0 is 10x faster than Spark 1.6 36
  • 37. Whole stage codegen 37 SELECT COUNT(*) FROM store_sales WHERE ss_item_sk=1000
  • 38. Whole stage codegen Volcano iterator model 38
  • 39. Whole stage codegen What if we would ask some intern to write this in c#? 39 long count = 0; foreach (var ss_item_sk in store_sales) { if (ss_item_sk == 1000) count++; }
  • 44. Define Spark job entry point 44 object IncidentsJob { def main(args: Array[String]) { val spark = SparkSession.builder() .appName("Incidents processing job") .config("spark.sql.shuffle.partitions", "16") .master("local[4]") .getOrCreate() { spark transformations and actions... } System.exit(0) }
  • 45. Create build.sbt file 45 lazy val root = (project in file(".")). settings( organization := "com.mycompany", name := "spark.job.incidents", version := "1.0.0", scalaVersion := "2.11.8", mainClass in Compile := Some("com.mycompany.spark.job.incidents.main") ) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.0.1" % "provided", "org.apache.spark" %% "spark-sql" % "2.0.1" % "provided", "org.apache.spark" %% "spark-streaming" % "2.0.1" % "provided", "com.microsoft.sqlserver" % "sqljdbc4" % "4.0" )
  • 46. Create application (fat) jar file $ sbt compile $ sbt test $ sbt assembly (sbt-assembly plugin) 46
  • 47. Submit job via spark-submit command ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments] 47
  • 48. Example workflow 48 code 1. pull content 2. take build number (331) 3. build & test 4. copy to cluster job331.jar produce job artifact notification 5. create/schedule job job331 (http) 6. spark submit job331
  • 50. Apache Spark streaming ● Scalable fault tolerant streaming system ● Receivers receive data streams and chop them into batches ● Spark processes batches and pushes out the result 50 ● Input: Files, Socket, Kafka, Flume, Kinesis...
  • 51. Apache Spark streaming 51 def main(args: Array[String]) { val conf = new SparkConf() .setMaster("local[2]") .setAppName("Incidents processing job - Stream") val ssc = new StreamingContext(conf, Seconds(1)) val topics = Set( Topics.Incident, val directKafkaStream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder]( ssc, kafkaParams, topics) // process batches directKafkaStream.map(_._2).flatMap(_.split(“ “))... // Start the computation ssc.start() ssc.awaitTermination() System.exit(0) }
  • 52. Apache Spark streaming ● Integrates with the rest of the ecosystem ○ Combine batch and stream processing ○ Combine machine learning with streaming ○ Combine SQL with streaming 52
  • 54. Structured streaming (continuous apps) ● High-level streaming API built on DataFrames ● Catalyst optimizer creates incremental execution plan ● Unifies streaming, interactive and batch queries ● Supports multiple sources and sinks ● E.g. aggregate data in a stream, then serve using JDBC 54
  • 55. Structured streaming key idea The simplest way to perform streaming analytics is not having to reason about streaming. 55
  • 57. Structured streaming ● Reusing same API 57 val categories = spark .read .option("header", "true") .schema(schema) .csv("/data/source") .select("Category") val categories = spark .readStream .option("header", "true") .schema(schema) .csv("/data/source") .select("Category") finite infinite
  • 58. Structured streaming ● Reusing same API 58 categories .write .format("parquet") .save("/data/warehouse/categories.parquet") categories .writeStream .format("parquet") .start("/data/warehouse/categories.parquet") finite infinite
  • 60. Useful resources ● Spark home page: https://spark.apache.org/ ● Spark summit page: https://spark-summit.org/ ● Apache Spark Docker image: https://github.com/dylanmei/docker-zeppelin ● SFPD Incidents: https://data.sfgov.org/Public-Safety/Police-Department-Incidents/tmn f-yvry 60
  • 61. Thank you for the attention! 61
  • 62. References 62 ● Michael Armbrust - STRUCTURING SPARK: DATAFRAMES, DATASETS AND STREAMING - https://spark-summit.org/2016/events/structuring-spark-dataframes-datasets-and-streaming/ ● Apache Parquet - https://parquet.apache.org/ ● Spark Performance: What's Next - https://spark-summit.org/east-2016/events/spark-performance-whats-next/ ● Avoid groupByKey - https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reduceby key_over_groupbykey.html