Tachyon and Apache Spark

Tachyon and Apache Spark:
heralds of in-memory computing era.
Roman Shaposhnik
Director of Open Source @Pivotal
(Twitter: @rhatr)

Who’s this guy?
• Director of Open Source @Pivotal
• Apache Software Foundation guy (Member, VP of Apache
Incubator, committer on Hadoop, Giraph, Sqoop, etc)
• Used to be root@Cloudera
• Used to be PHB@Yahoo! (original Hadoop team)

20 minute to figure out
Hadoop vs. Spark

Hadoop++ == Spark

Hadoop + Spark

But wait! There’s more!
Tachyon

Long, long time ago…
HDFS
ASF Projects
FLOSS Projects
Pivotal Products
MapReduce

In a blink of an eye
MLib
Shark
GraphX
Streaming
HDFS
Crunch Mahout
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Spark
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Tachyon

A Spark view?
HDFS
MLib
Shark
YARN
GraphX
Streaming
Tachyon
Sqoop Flume
Hadoop UI
Hue
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire XD
Oozie
SolrCloud
Phoenix
HBase Spark
SpringXD

Your datacenter
…
server 1
server N

Hadoop’s view
MapReduce
server 1
server N
HDFS

HDFS: decoupled storage
…
MR
HDFS
MR

Anatomy of MapReduce
HDFS mappers reducers HDFS
a b c
d a c
a 3
b 1
c 2
a 1
b 1
c 1
a 1
c 1
a 1
a 1 1 1
b 1
c 1 1

What’s wrong with MR?
Source: UC Berkeley Spark project (just the image)

This looks familiar…
$ grep –R | awk | sort …

Spark innovations
• Resilient Distribtued Datasets (RDDs)
• Distributed on a cluster
• Manipulated via parallel operators (map, etc.)
• Automatically rebuilt on failure
• A parallel ecosystem
• A solution to iterative and multi-stage apps

RDDs
warnings = textFile(…).filter(_.contains(“warning”))
.map(_.split(‘ ‘)(1))
HadoopRDD
path = hdfs://
FilteredRDD
contains…
MappedRDD
split…

Parallel operators
• map, reduce
• sample, filter
• groupBy, reduceByKey
• join, leftOuterJoin, rightOuterJoin
• union, cross

What is really happening?
MLib
Shark
GraphX
Streaming
HDFS
Crunch Mahout
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Spark
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Tachyon

May be its not so bad
server 1
server N

But HDFS/YARN are safe?
HDFS, Ceph, S3, NAS, etc.
New
HDFS
New
YARN

Tachyon
• In-memory data-exchange layer
• A set of evolving APIs:
• filesystem
• caching
• RDDs
• Materialized views

It will be called Hadoop
MLib
Shark
GraphX
Streaming
HDFS
Crunch Mahout
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire with Tachyon
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Spark
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN

Spark/Tachyon recap
• Is it “Big Data” (Yes)
• Is it “Hadoop” (No)
• It’s one of those “in memory” things, right (Yes)
• JVM, Java, Scala (All)
• Is it Real or just another shiny technology with
a long, but ultimately small tail (Yes and ?)

Tachyon and Apache Spark

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to Tachyon and Apache Spark (20)

More from rhatr (7)

Recently uploaded (20)

Tachyon and Apache Spark