SlideShare a Scribd company logo
2 
24.11.2014 
uweseiler 
Apache Spark
2 About me 
24.11.2014 
Big Data Nerd 
Hadoop Trainer NoSQL Fan Boy 
Photography Enthusiast Travelpirate
2 About us 
24.11.2014 
specializes on... 
Big Data Nerds Agile Ninjas Continuous Delivery Gurus 
Enterprise Java Specialists Performance Geeks 
Join us!
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: In a tweet 
24.11.2014 
“Spark … is what you might 
call a Swiss Army knife of Big 
Data analytics tools” 
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
2 Spark: In a nutshell 
24.11.2014 
• Fast and general engine for large scale data 
processing 
• Advanced DAG execution engine with support for 
 in-memory storage 
 data locality 
 (micro) batch  streaming support 
• Improves usability via 
 Rich APIs in Scala, Java, Python 
 Interactive shell 
• Runs Standalone, on YARN, on Mesos, and on 
Amazon EC2
2 Spark is also… 
24.11.2014 
• Came out of AMPLab at UCB in 2009 
• A top-level Apache project as of 2014 
– http://spark.apache.org 
• Backed by a commercial entity: Databricks 
• A toolset for Data Scientist / Analysts 
• Implementation of Resilient Distributed Dataset 
(RDD) in Scala 
• Hadoop Compatible
2 Spark: Trends 
24.11.2014 
Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez 
Generated using http://www.google.com/trends/
2 Spark: Community 
24.11.2014 
https://github.com/apache/spark/pulse
2 Spark: Performance 
24.11.2014 
3X faster using 10X fewer machines 
http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html 
http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
2 
24.11.2014 
BlinkDB 
MapReduce 
Cluster resource mgmt. + data 
processing 
HDFS 
Spark: Ecosystem 
Redundant, reliable storage 
Spark Core 
Spark 
SQL 
SQL 
Spark 
Streaming 
Streaming 
MLlib 
Machine 
Learning 
SparkR 
R on Spark 
GraphX 
Graph 
Computation
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Core Concept 
24.11.2014 
• Resilient Distributed Dataset (RDD) 
Conceptually, RDDs can be roughly 
viewed as partitioned, locality aware 
distributed vectors 
RDD 
A11 
A12 
A13 
• Read-only collection of objects spread across a 
cluster 
• Built through parallel transformations  actions 
• Computation can be represented by lazy evaluated 
lineage DAGs composed by connected RDDs 
• Automatically rebuilt on failure 
• Controllable persistence
2 Spark: RDD Example 
24.11.2014 
Base RDD from HDFS 
lines = spark.textFile(“hdfs://...”) 
errors = 
lines.filter(_.startsWith(Error)) 
messages = errors.map(_.split('t')(2)) 
messages.cache() 
RDD in memory 
Iterative Processing 
for (str - Array(“foo”, “bar”)) 
messages.filter(_.contains(str)).count()
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map(func) 
filter(func) 
flatMap(func) 
mapPartitions(func) 
mapPartitionsWithIndex(func) 
union(otherDataset) 
intersection(otherDataset) 
distinct([numTasks])) 
groupByKey([numTasks]) 
sortByKey([ascending], [numTasks]) 
reduceByKey(func, [numTasks]) 
aggregateByKey(zeroValue)(seqOp, 
combOp, [numTasks]) 
join(otherDataset, [numTasks]) 
cogroup(otherDataset, [numTasks]) 
cartesian(otherDataset) 
pipe(command, [envVars]) 
coalesce(numPartitions) 
sample(withReplacement,fraction, seed) 
repartition(numPartitions)
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce(func) 
collect() 
count() 
first() 
countByKey() 
foreach(func) 
take(n) 
takeSample(withReplacement,num, [seed]) 
takeOrdered(n, [ordering]) 
saveAsTextFile(path) 
saveAsSequenceFile(path) 
(Only Java and Scala) 
saveAsObjectFile(path) 
(Only Java and Scala)
2 Spark: Dataflow 
24.11.2014 
All transformations in Spark are lazy and are only 
computed when an actions requires it.
2 Spark: Persistence 
24.11.2014 
One of the most important capabilities in Spark is 
caching a dataset in-memory across operations 
• cache() MEMORY_ONLY 
• persist() MEMORY_ONLY
2 Spark: Storage Levels 
24.11.2014 
• persist(Storage Level) 
Storage Level Meaning 
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, some partitions will not be cached and will be 
recomputed on the fly each time they're needed. This is the default 
level. 
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, store the partitions that don't fit on disk, and 
read them from there when they're needed. 
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). 
This is generally more space-efficient than deserialized objects, 
especially when using a fast serializer, but more CPU-intensive to 
read. 
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in 
memory to disk instead of recomputing them on the fly each time 
they're needed. 
DISK_ONLY Store the RDD partitions only on disk. 
MEMORY_ONLY_2, 
MEMORY_AND_DISK_2, 
… … … 
Same as the levels above, but replicate each partition on two cluster 
nodes.
2 Spark: Parallelism 
24.11.2014 
Can be specified in a number of different ways 
• RDD partition number 
• sc.textFile(input, minSplits = 10) 
• sc.parallelize(1 to 10000, numSlices = 10) 
• Mapper side parallelism 
• Usually inherited from parent RDD(s) 
• Reducer side parallelism 
• rdd.reduceByKey(_ + _, numPartitions = 10) 
• rdd.reduceByKey(partitioner = p, _ + _) 
• “Zoom in/out” 
• rdd.repartition(numPartitions: Int) 
• rdd.coalesce(numPartitions: Int, shuffle: Boolean)
2 Spark: Example 
24.11.2014 
Text Processing Example 
Top words by frequency
2 Spark: Frequency Example 
24.11.2014 
Create RDD from external data 
Data Sources supported by 
Hadoop 
Cassandra ElasticSearch 
HDFS S3 HBase 
Mongo 
DB 
… 
I/O via Hadoop optional 
// Step 1. Create RDD from Hadoop text files 
val docs = spark.textFile(“hdfs://docs/“)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String]
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
= 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
= 
// Step 2. Convert lines to lower case 
val lower = docs.map(line = line.ToLowerCase) 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[Array[String]] 
hello 
spark 
_.split(s+) 
world 
this is spark 
the end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
spark 
.flatten* 
_.split(s+) 
world 
this is spark 
hello 
world 
this 
the end 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
the end 
.flatMap(line = line.split(“s+“)) 
hello 
world 
this 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
hello 
world 
this 
the end 
end 
.flatMap(line = line.split(“s+“)) 
// Step 3. Split lines into words 
val words = lower.flatMap(line = line.split(“s+“))
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
// Step 4. Convert into tuples 
val counts = words.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b 
.reduceByKey((a,b) = a+b)
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
spark 
end 
1 
1 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
// Step 5. Count all words 
val freq = counts.reduceByKey(_ + _) 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Top N (Prepare data) 
RDD[(String, Int)] 
end 1 
hello 1 
spark 2 
world 1 
// Step 6. Swap tupels (Partial code) 
freq.map(_.swap) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
.map(_.swap)
2 Spark: Frequency Example 
24.11.2014 
Top N (First Attempt) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.sortByKey
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
local top N 
.top(N) 
local top N
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 
spark 
1 world 
RDD[(Int, String)] 
spark 
2 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction 
// Step 6. Swap tupels (Complete code) 
val top = freq.map(_.swap).top(N)
2 Spark: Frequency Example 
24.11.2014 
val spark = new SparkContext() 
// Create RDD from Hadoop text file 
val docs = spark.textFile(“hdfs://docs/“) 
// Split lines into words and process 
val lower = docs.map(line = line.ToLowerCase) 
val words = lower.flatMap(line = line.split(“s+“)) 
val counts = words.map(word = (word,1)) 
// Count all words 
val freq = counts.reduceByKey(_ + _) 
// Swap tupels and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Streaming 
24.11.2014 
• Real-time computation 
• Similar to Apache Storm… 
• Streaming input split into sliding windows of 
RDD‘s 
• Input distributed to memory for fault 
tolerance 
• Supports input from Kafka, Flume, ZeroMQ, 
HDFS, S3, Kinesis, Twitter, …
2 Spark: Streaming 
24.11.2014 
Discretized Stream 
Windowed Computations
2 Spark: Streaming 
24.11.2014 
TwitterUtils.createStream() 
.filter(_.getText.contains(Spark)) 
.countByWindow(Seconds(5))
2 Spark: SQL 
24.11.2014 
• Spark SQL allows relational queries 
expressed in SQL, HiveQL or Scala 
• Uses SchemaRDD’s composed of Row objects 
(= table in a traditional RDBMS) 
• SchemaRDD can be created from an 
• Existing RDD 
• Parquet File 
• JSON dataset 
• By running HiveQL against data stored in Apache Hive 
• Supports a domain specific language for 
writing queries
2 Spark: SQL 
24.11.2014 
registerFunction(LEN, (_: String).length) 
val queryRdd = sql( 
SELECT * FROM counts 
WHERE LEN(word) = 10 
ORDER BY total DESC 
LIMIT 10 
) 
queryRdd 
.map( c = sword: ${c(0)} t| total: ${c(1)}) 
.collect() 
.foreach(println)
2 Spark: GraphX 
24.11.2014 
• GraphX is the Spark API for graphs 
and graph-parallel computation 
• API’s to join and traverse graphs 
• Optimally partitions and indexes 
vertices  edges (represented as RDD’s) 
• Supports PageRank, connected 
components, triangle counting, …
2 Spark: GraphX 
24.11.2014 
val graph = Graph(userIdRDD, assocRDD) 
val ranks = graph.pageRank(0.0001).vertices 
val userRDD = sc.textFile(graphx/data/users.txt) 
val users = userRdd. map {line = 
val fields = line.split(,) 
(fields(0).toLong, fields( 1)) 
} 
val ranksByUsername = users.join(ranks).map { 
case (id, (username, rank)) = (username, rank) 
}
2 Spark: MLlib 
24.11.2014 
• Machine learning library similar to 
Apache Mahout 
• Supports statistics, regression, decision 
trees, clustering, PCA, gradient 
descent, … 
• Iterative algorithms much faster due to 
in-memory processing
2 Spark: MLlib 
24.11.2014 
val data = sc.textFile(data.txt) 
val parsedData = data.map {line = 
val parts = line.split(',') 
LabeledPoint( 
parts( 0). toDouble, 
Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) 
} 
val model = LinearRegressionWithSGD.train( 
parsedData, 100 
) 
val valuesAndPreds = parsedData.map {point = 
val prediction = model.predict(point.features) 
(point.label, prediction) 
} 
val MSE = valuesAndPreds 
.map{case(v, p) = math.pow((v - p), 2)}.mean()
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Use Case: Yahoo Native Ads 
24.11.2014 
Logistic regression 
algorithm 
• 120 LOC in Spark/Scala 
• 30 min. on model creation for 
100M samples and 13K 
features 
Initial version launched 
within 2 hours after Spark-on- 
YARN announcement 
• Compared: Several days on 
hardware acquisition, system 
setup and data movement 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Use Case: Yahoo Mobile Ads 
24.11.2014 
Learn from mobile search 
ads clicks data 
• 600M labeled examples on 
HDFS 
• 100M sparse features 
Spark programs for 
Gradient Boosting Decision 
Trees 
• 6 hours for model training 
with 100 workers 
• Model with accuracy very 
close to heavily-manually-tuned 
Logistic Regression 
models 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark-on-YARN (Current) 
24.11.2014 
Hadoop 2 Spark as YARN App 
Pig … In- 
Hive Stream 
Tez 
Spark MapReduce 
Execution Engine 
Execution Engine 
YARN 
Memory 
Cluster resource management 
HDFS 
Redundant, reliable storage 
ing 
Storm 
…
2 Spark-on-YARN (Future) 
24.11.2014 
Hadoop 2 Spark as Execution Engine 
Hive … Mahout 
YARN 
HDFS 
Pig 
MapReduce 
Execution Engine 
Stream 
ing 
Storm 
… 
Tez 
Execution Engine 
Spark 
Execution Engine 
Slider
2 Spark: Future work 
24.11.2014 
• Spark Core 
• Focus on maturity, optimization  
pluggability 
• Enable long-running services (Slider) 
• Give resources back to cluster when idle 
• Integrate with Hadoop enhancements 
• Timeline server 
• ORC File Format 
• Spark Eco System 
• Focus on adding capabilities
2 One more thing… 
24.11.2014 
Let’s get started with 
Spark!
2 Hortonworks Sandbox 2.2 
24.11.2014 
http://hortonworks.com/hdp/downloads/
2 Hortonworks Sandbox 2.2 
24.11.2014 
// 1. Download 
wget http://public-repo-1.hortonworks.com/HDP-LABS/ 
Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 
2.4.0.tgz 
// 2. Untar 
tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz 
// 3. Start Spark Shell 
./bin/spark-shell
2 Thanks for listening 
24.11.2014 
Twitter: 
@uweseiler 
Mail: 
uwe.seiler@codecentric.de 
XING: 
https://www.xing.com/profile 
/Uwe_Seiler

More Related Content

PDF
Container Runtime Security with Falco
PDF
Apresentação Docker
PDF
Ace Up the Sleeve
PPTX
Kafka connect 101
PDF
New Generation Oracle RAC Performance
PDF
Aws guard duty security monitoring service
PPTX
Lateral Movement with PowerShell
PDF
10年効く分散ファイルシステム技術 GlusterFS & Red Hat Storage
Container Runtime Security with Falco
Apresentação Docker
Ace Up the Sleeve
Kafka connect 101
New Generation Oracle RAC Performance
Aws guard duty security monitoring service
Lateral Movement with PowerShell
10年効く分散ファイルシステム技術 GlusterFS & Red Hat Storage

What's hot (20)

PDF
AWS Black Belt Online Seminar AWS 体験ハンズオン 〜 Amazon DynamoDB テーブル作成編 〜
PDF
Karpenter
PDF
実践 Amazon KMS #cmdevio2015
PDF
Cluster management with Kubernetes
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
PDF
Chaos Engineering 시작하기 - 윤석찬 (AWS 테크에반젤리스트) :: 한국 카오스엔지니어링 밋업
PDF
Deep Dive into Docker Swarm Mode
PPTX
0x002 - Windows Priv Esc - A Low Level Explanation of Token Theft
PPTX
Apache spark
PDF
Orchestrating Redis & K8s Operators
PPTX
Docker introduction (1)
PPTX
Hashicorp Vault Open Source vs Enterprise
PDF
Erasure codes and storage tiers on gluster
PDF
KubeCon EU 2016: Kubernetes Storage 101
PPTX
PowerShell Inside Out: Applied .NET Hacking for Enhanced Visibility by Satosh...
PDF
Active directory のセキュリティ対策 130119
PDF
あなたも「違いが分かる人」になりましょう! ~ Azure, AzureStack, AzureStack HCI ~
PDF
Run Cloud Native MySQL NDB Cluster in Kubernetes
AWS Black Belt Online Seminar AWS 体験ハンズオン 〜 Amazon DynamoDB テーブル作成編 〜
Karpenter
実践 Amazon KMS #cmdevio2015
Cluster management with Kubernetes
An Insider’s Guide to Maximizing Spark SQL Performance
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Chaos Engineering 시작하기 - 윤석찬 (AWS 테크에반젤리스트) :: 한국 카오스엔지니어링 밋업
Deep Dive into Docker Swarm Mode
0x002 - Windows Priv Esc - A Low Level Explanation of Token Theft
Apache spark
Orchestrating Redis & K8s Operators
Docker introduction (1)
Hashicorp Vault Open Source vs Enterprise
Erasure codes and storage tiers on gluster
KubeCon EU 2016: Kubernetes Storage 101
PowerShell Inside Out: Applied .NET Hacking for Enhanced Visibility by Satosh...
Active directory のセキュリティ対策 130119
あなたも「違いが分かる人」になりましょう! ~ Azure, AzureStack, AzureStack HCI ~
Run Cloud Native MySQL NDB Cluster in Kubernetes
Ad

Viewers also liked (20)

PDF
Hadoop Operations - Best practices from the field
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Hadoop & Security - Past, Present, Future
PDF
Apache Spark with Scala
PPTX
Deep Learning with Apache Spark: an Introduction
PPTX
Big Data Asset Maturity Model
PPTX
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
PDF
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
PDF
Deep dive into spark streaming
PDF
Big data, Analytics and Beyond
PDF
Fully fault tolerant real time data pipeline with docker and mesos
PDF
Hadoop security landscape
PDF
Contexti / Oracle - Big Data : From Pilot to Production
PDF
Energy analytics with Apache Spark workshop
PDF
Hadoop Security Now and Future
PDF
Launching your career in Big Data
PPTX
Hadoop bootcamp getting started
PDF
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
PPTX
Unicom Big Data Conference
PDF
Apache Sentry for Hadoop security
Hadoop Operations - Best practices from the field
Hadoop meets Agile! - An Agile Big Data Model
Hadoop & Security - Past, Present, Future
Apache Spark with Scala
Deep Learning with Apache Spark: an Introduction
Big Data Asset Maturity Model
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Deep dive into spark streaming
Big data, Analytics and Beyond
Fully fault tolerant real time data pipeline with docker and mesos
Hadoop security landscape
Contexti / Oracle - Big Data : From Pilot to Production
Energy analytics with Apache Spark workshop
Hadoop Security Now and Future
Launching your career in Big Data
Hadoop bootcamp getting started
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
Unicom Big Data Conference
Apache Sentry for Hadoop security
Ad

Similar to Apache Spark (20)

PDF
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
PPTX
Dive into spark2
PDF
Tuning and Debugging in Apache Spark
PPT
11. From Hadoop to Spark 2/2
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Apache Spark and DataStax Enablement
PPTX
20130912 YTC_Reynold Xin_Spark and Shark
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Apache Spark Fundamentals Meetup Talk
PDF
Big Data Analytics with Apache Spark
PPTX
Tuning and Debugging in Apache Spark
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PDF
Introduction to Apache Spark
PDF
Zero to Streaming: Spark and Cassandra
PPTX
Introduction to Apache Spark
PDF
Webinar: Solr & Spark for Real Time Big Data Analytics
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Dive into spark2
Tuning and Debugging in Apache Spark
11. From Hadoop to Spark 2/2
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache spark sneha challa- google pittsburgh-aug 25th
Apache Spark and DataStax Enablement
20130912 YTC_Reynold Xin_Spark and Shark
Apache spark-melbourne-april-2015-meetup
Apache Spark Fundamentals Meetup Talk
Big Data Analytics with Apache Spark
Tuning and Debugging in Apache Spark
Intro to Apache Spark
Intro to Apache Spark
Introduction to Apache Spark
Zero to Streaming: Spark and Cassandra
Introduction to Apache Spark
Webinar: Solr & Spark for Real Time Big Data Analytics
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探

More from Uwe Printz (18)

PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Lightning Talk: Agility & Databases
PDF
Hadoop 2 - More than MapReduce
PDF
Welcome to Hadoop2Land!
PDF
Hadoop 2 - Beyond MapReduce
PDF
MongoDB für Java Programmierer (JUGKA, 11.12.13)
PDF
Hadoop 2 - Going beyond MapReduce
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PDF
MongoDB for Coder Training (Coding Serbia 2013)
PDF
MongoDB für Java-Programmierer
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PDF
Introduction to Twitter Storm
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Introduction to the Hadoop Ecosystem (SEACON Edition)
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
PDF
Map/Confused? A practical approach to Map/Reduce with MongoDB
PDF
First meetup of the MongoDB User Group Frankfurt
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Lightning Talk: Agility & Databases
Hadoop 2 - More than MapReduce
Welcome to Hadoop2Land!
Hadoop 2 - Beyond MapReduce
MongoDB für Java Programmierer (JUGKA, 11.12.13)
Hadoop 2 - Going beyond MapReduce
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB für Java-Programmierer
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to Twitter Storm
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
Map/Confused? A practical approach to Map/Reduce with MongoDB
First meetup of the MongoDB User Group Frankfurt

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
1. Introduction to Computer Programming.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
1. Introduction to Computer Programming.pptx
TLE Review Electricity (Electricity).pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Network Security Unit 5.pdf for BCA BBA.
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
Tartificialntelligence_presentation.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Getting Started with Data Integration: FME Form 101
Group 1 Presentation -Planning and Decision Making .pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Apache Spark

  • 1. 2 24.11.2014 uweseiler Apache Spark
  • 2. 2 About me 24.11.2014 Big Data Nerd Hadoop Trainer NoSQL Fan Boy Photography Enthusiast Travelpirate
  • 3. 2 About us 24.11.2014 specializes on... Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 5. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 6. 2 Spark: In a tweet 24.11.2014 “Spark … is what you might call a Swiss Army knife of Big Data analytics tools” – Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
  • 7. 2 Spark: In a nutshell 24.11.2014 • Fast and general engine for large scale data processing • Advanced DAG execution engine with support for in-memory storage data locality (micro) batch streaming support • Improves usability via Rich APIs in Scala, Java, Python Interactive shell • Runs Standalone, on YARN, on Mesos, and on Amazon EC2
  • 8. 2 Spark is also… 24.11.2014 • Came out of AMPLab at UCB in 2009 • A top-level Apache project as of 2014 – http://spark.apache.org • Backed by a commercial entity: Databricks • A toolset for Data Scientist / Analysts • Implementation of Resilient Distributed Dataset (RDD) in Scala • Hadoop Compatible
  • 9. 2 Spark: Trends 24.11.2014 Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez Generated using http://www.google.com/trends/
  • 10. 2 Spark: Community 24.11.2014 https://github.com/apache/spark/pulse
  • 11. 2 Spark: Performance 24.11.2014 3X faster using 10X fewer machines http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
  • 12. 2 24.11.2014 BlinkDB MapReduce Cluster resource mgmt. + data processing HDFS Spark: Ecosystem Redundant, reliable storage Spark Core Spark SQL SQL Spark Streaming Streaming MLlib Machine Learning SparkR R on Spark GraphX Graph Computation
  • 13. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 14. 2 Spark: Core Concept 24.11.2014 • Resilient Distributed Dataset (RDD) Conceptually, RDDs can be roughly viewed as partitioned, locality aware distributed vectors RDD A11 A12 A13 • Read-only collection of objects spread across a cluster • Built through parallel transformations actions • Computation can be represented by lazy evaluated lineage DAGs composed by connected RDDs • Automatically rebuilt on failure • Controllable persistence
  • 15. 2 Spark: RDD Example 24.11.2014 Base RDD from HDFS lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(Error)) messages = errors.map(_.split('t')(2)) messages.cache() RDD in memory Iterative Processing for (str - Array(“foo”, “bar”)) messages.filter(_.contains(str)).count()
  • 16. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map
  • 17. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map(func) filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) union(otherDataset) intersection(otherDataset) distinct([numTasks])) groupByKey([numTasks]) sortByKey([ascending], [numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) sample(withReplacement,fraction, seed) repartition(numPartitions)
  • 18. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce
  • 19. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce(func) collect() count() first() countByKey() foreach(func) take(n) takeSample(withReplacement,num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) saveAsSequenceFile(path) (Only Java and Scala) saveAsObjectFile(path) (Only Java and Scala)
  • 20. 2 Spark: Dataflow 24.11.2014 All transformations in Spark are lazy and are only computed when an actions requires it.
  • 21. 2 Spark: Persistence 24.11.2014 One of the most important capabilities in Spark is caching a dataset in-memory across operations • cache() MEMORY_ONLY • persist() MEMORY_ONLY
  • 22. 2 Spark: Storage Levels 24.11.2014 • persist(Storage Level) Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, … … … Same as the levels above, but replicate each partition on two cluster nodes.
  • 23. 2 Spark: Parallelism 24.11.2014 Can be specified in a number of different ways • RDD partition number • sc.textFile(input, minSplits = 10) • sc.parallelize(1 to 10000, numSlices = 10) • Mapper side parallelism • Usually inherited from parent RDD(s) • Reducer side parallelism • rdd.reduceByKey(_ + _, numPartitions = 10) • rdd.reduceByKey(partitioner = p, _ + _) • “Zoom in/out” • rdd.repartition(numPartitions: Int) • rdd.coalesce(numPartitions: Int, shuffle: Boolean)
  • 24. 2 Spark: Example 24.11.2014 Text Processing Example Top words by frequency
  • 25. 2 Spark: Frequency Example 24.11.2014 Create RDD from external data Data Sources supported by Hadoop Cassandra ElasticSearch HDFS S3 HBase Mongo DB … I/O via Hadoop optional // Step 1. Create RDD from Hadoop text files val docs = spark.textFile(“hdfs://docs/“)
  • 26. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String]
  • 27. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] = .map(_.ToLowerCase)
  • 28. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end = // Step 2. Convert lines to lower case val lower = docs.map(line = line.ToLowerCase) hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] .map(_.ToLowerCase)
  • 29. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[Array[String]] hello spark _.split(s+) world this is spark the end
  • 30. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello spark .flatten* _.split(s+) world this is spark hello world this the end end
  • 31. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) the end .flatMap(line = line.split(“s+“)) hello world this end
  • 32. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) hello world this the end end .flatMap(line = line.split(“s+“)) // Step 3. Split lines into words val words = lower.flatMap(line = line.split(“s+“))
  • 33. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 34. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 35. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) // Step 4. Convert into tuples val counts = words.map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 36. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] .groupByKey end 1 hello 1 spark 1 1 world 1
  • 37. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 38. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b .reduceByKey((a,b) = a+b)
  • 39. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark spark end 1 1 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 // Step 5. Count all words val freq = counts.reduceByKey(_ + _) end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 40. 2 Spark: Frequency Example 24.11.2014 Top N (Prepare data) RDD[(String, Int)] end 1 hello 1 spark 2 world 1 // Step 6. Swap tupels (Partial code) freq.map(_.swap) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world .map(_.swap)
  • 41. 2 Spark: Frequency Example 24.11.2014 Top N (First Attempt) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .sortByKey
  • 42. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world local top N .top(N) local top N
  • 43. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction
  • 44. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] spark 2 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction // Step 6. Swap tupels (Complete code) val top = freq.map(_.swap).top(N)
  • 45. 2 Spark: Frequency Example 24.11.2014 val spark = new SparkContext() // Create RDD from Hadoop text file val docs = spark.textFile(“hdfs://docs/“) // Split lines into words and process val lower = docs.map(line = line.ToLowerCase) val words = lower.flatMap(line = line.split(“s+“)) val counts = words.map(word = (word,1)) // Count all words val freq = counts.reduceByKey(_ + _) // Swap tupels and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 46. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 47. 2 Spark: Streaming 24.11.2014 • Real-time computation • Similar to Apache Storm… • Streaming input split into sliding windows of RDD‘s • Input distributed to memory for fault tolerance • Supports input from Kafka, Flume, ZeroMQ, HDFS, S3, Kinesis, Twitter, …
  • 48. 2 Spark: Streaming 24.11.2014 Discretized Stream Windowed Computations
  • 49. 2 Spark: Streaming 24.11.2014 TwitterUtils.createStream() .filter(_.getText.contains(Spark)) .countByWindow(Seconds(5))
  • 50. 2 Spark: SQL 24.11.2014 • Spark SQL allows relational queries expressed in SQL, HiveQL or Scala • Uses SchemaRDD’s composed of Row objects (= table in a traditional RDBMS) • SchemaRDD can be created from an • Existing RDD • Parquet File • JSON dataset • By running HiveQL against data stored in Apache Hive • Supports a domain specific language for writing queries
  • 51. 2 Spark: SQL 24.11.2014 registerFunction(LEN, (_: String).length) val queryRdd = sql( SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ) queryRdd .map( c = sword: ${c(0)} t| total: ${c(1)}) .collect() .foreach(println)
  • 52. 2 Spark: GraphX 24.11.2014 • GraphX is the Spark API for graphs and graph-parallel computation • API’s to join and traverse graphs • Optimally partitions and indexes vertices edges (represented as RDD’s) • Supports PageRank, connected components, triangle counting, …
  • 53. 2 Spark: GraphX 24.11.2014 val graph = Graph(userIdRDD, assocRDD) val ranks = graph.pageRank(0.0001).vertices val userRDD = sc.textFile(graphx/data/users.txt) val users = userRdd. map {line = val fields = line.split(,) (fields(0).toLong, fields( 1)) } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) = (username, rank) }
  • 54. 2 Spark: MLlib 24.11.2014 • Machine learning library similar to Apache Mahout • Supports statistics, regression, decision trees, clustering, PCA, gradient descent, … • Iterative algorithms much faster due to in-memory processing
  • 55. 2 Spark: MLlib 24.11.2014 val data = sc.textFile(data.txt) val parsedData = data.map {line = val parts = line.split(',') LabeledPoint( parts( 0). toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) } val model = LinearRegressionWithSGD.train( parsedData, 100 ) val valuesAndPreds = parsedData.map {point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds .map{case(v, p) = math.pow((v - p), 2)}.mean()
  • 56. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 57. 2 Use Case: Yahoo Native Ads 24.11.2014 Logistic regression algorithm • 120 LOC in Spark/Scala • 30 min. on model creation for 100M samples and 13K features Initial version launched within 2 hours after Spark-on- YARN announcement • Compared: Several days on hardware acquisition, system setup and data movement http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 58. 2 Use Case: Yahoo Mobile Ads 24.11.2014 Learn from mobile search ads clicks data • 600M labeled examples on HDFS • 100M sparse features Spark programs for Gradient Boosting Decision Trees • 6 hours for model training with 100 workers • Model with accuracy very close to heavily-manually-tuned Logistic Regression models http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 59. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 60. 2 Spark-on-YARN (Current) 24.11.2014 Hadoop 2 Spark as YARN App Pig … In- Hive Stream Tez Spark MapReduce Execution Engine Execution Engine YARN Memory Cluster resource management HDFS Redundant, reliable storage ing Storm …
  • 61. 2 Spark-on-YARN (Future) 24.11.2014 Hadoop 2 Spark as Execution Engine Hive … Mahout YARN HDFS Pig MapReduce Execution Engine Stream ing Storm … Tez Execution Engine Spark Execution Engine Slider
  • 62. 2 Spark: Future work 24.11.2014 • Spark Core • Focus on maturity, optimization pluggability • Enable long-running services (Slider) • Give resources back to cluster when idle • Integrate with Hadoop enhancements • Timeline server • ORC File Format • Spark Eco System • Focus on adding capabilities
  • 63. 2 One more thing… 24.11.2014 Let’s get started with Spark!
  • 64. 2 Hortonworks Sandbox 2.2 24.11.2014 http://hortonworks.com/hdp/downloads/
  • 65. 2 Hortonworks Sandbox 2.2 24.11.2014 // 1. Download wget http://public-repo-1.hortonworks.com/HDP-LABS/ Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 2.4.0.tgz // 2. Untar tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz // 3. Start Spark Shell ./bin/spark-shell
  • 66. 2 Thanks for listening 24.11.2014 Twitter: @uweseiler Mail: [email protected] XING: https://www.xing.com/profile /Uwe_Seiler