SlideShare a Scribd company logo
Tachyon and Apache Spark:  
heralds of in-memory computing era. 
Roman Shaposhnik 
Director of Open Source @Pivotal 
(Twitter: @rhatr)
Who’s this guy? 
• Director of Open Source @Pivotal 
• Apache Software Foundation guy (Member, VP of Apache 
Incubator, committer on Hadoop, Giraph, Sqoop, etc) 
• Used to be root@Cloudera 
• Used to be PHB@Yahoo! (original Hadoop team)
Dearly beloved…
20 minute to figure out 
Hadoop vs. Spark
20 minute to figure out 
Hadoop++ == Spark
20 minute to figure out 
Hadoop + Spark
But wait! There’s more! 
Tachyon
Long, long time ago… 
HDFS 
ASF Projects 
FLOSS Projects 
Pivotal Products 
MapReduce
In a blink of an eye 
MLib 
Shark 
GraphX 
Streaming 
HDFS 
Crunch Mahout 
Pig 
Sqoop Flume 
Coordination and 
workflow 
management 
Zookeeper 
Command 
Center 
ASF Projects 
FLOSS Projects 
Pivotal Products 
GemFire XD 
Oozie 
MapReduce 
Hive 
Tez 
Giraph 
Hadoop UI 
Hue 
SolrCloud 
Phoenix 
HBase 
Spark 
Impala 
HAWQ 
SpringXD 
MADlib 
Hamster 
PivotalR 
YARN 
Tachyon
A Spark view? 
HDFS 
MLib 
Shark 
YARN 
GraphX 
Streaming 
Tachyon 
Sqoop Flume 
Hadoop UI 
Hue 
Coordination and 
workflow 
management 
Zookeeper 
Command 
Center 
ASF Projects 
FLOSS Projects 
Pivotal Products 
GemFire XD 
Oozie 
SolrCloud 
Phoenix 
HBase Spark 
SpringXD
BDAS
Long, long time ago…
This is 2014
What changed?
Your datacenter 
… 
server 1 
server N
Hadoop’s view 
MapReduce 
server 1 
server N 
HDFS
HDFS: decoupled storage 
… 
MR 
HDFS 
MR
Tachyon and Apache Spark
Anatomy of MapReduce 
HDFS mappers reducers HDFS 
a b c 
d a c 
a 3 
b 1 
c 2 
a 1 
b 1 
c 1 
a 1 
c 1 
a 1 
a 1 1 1 
b 1 
c 1 1
What’s wrong with MR? 
Source: UC Berkeley Spark project (just the image)
This looks familiar… 
$ grep –R | awk | sort …
Spark innovations 
• Resilient Distribtued Datasets (RDDs) 
• Distributed on a cluster 
• Manipulated via parallel operators (map, etc.) 
• Automatically rebuilt on failure 
• A parallel ecosystem 
• A solution to iterative and multi-stage apps
RDDs 
warnings = textFile(…).filter(_.contains(“warning”)) 
.map(_.split(‘ ‘)(1)) 
HadoopRDD 
path = hdfs:// 
FilteredRDD 
contains… 
MappedRDD 
split…
Parallel operators 
• map, reduce 
• sample, filter 
• groupBy, reduceByKey 
• join, leftOuterJoin, rightOuterJoin 
• union, cross
What is really happening? 
MLib 
Shark 
GraphX 
Streaming 
HDFS 
Crunch Mahout 
Pig 
Sqoop Flume 
Coordination and 
workflow 
management 
Zookeeper 
Command 
Center 
ASF Projects 
FLOSS Projects 
Pivotal Products 
GemFire XD 
Oozie 
MapReduce 
Hive 
Tez 
Giraph 
Hadoop UI 
Hue 
SolrCloud 
Phoenix 
HBase 
Spark 
Impala 
HAWQ 
SpringXD 
MADlib 
Hamster 
PivotalR 
YARN 
Tachyon
May be its not so bad 
server 1 
server N
But HDFS/YARN are safe? 
HDFS, Ceph, S3, NAS, etc. 
New 
HDFS 
New 
YARN
Tachyon 
• In-memory data-exchange layer 
• A set of evolving APIs: 
• filesystem 
• caching 
• RDDs 
• Materialized views
Tachyon
Spark is best for cloud
It will be called Hadoop 
MLib 
Shark 
GraphX 
Streaming 
HDFS 
Crunch Mahout 
Pig 
Sqoop Flume 
Coordination and 
workflow 
management 
Zookeeper 
Command 
Center 
ASF Projects 
FLOSS Projects 
Pivotal Products 
GemFire with Tachyon 
Oozie 
MapReduce 
Hive 
Tez 
Giraph 
Hadoop UI 
Hue 
SolrCloud 
Phoenix 
HBase 
Spark 
Impala 
HAWQ 
SpringXD 
MADlib 
Hamster 
PivotalR 
YARN
Spark/Tachyon recap 
• Is it “Big Data” (Yes) 
• Is it “Hadoop” (No) 
• It’s one of those “in memory” things, right (Yes) 
• JVM, Java, Scala (All) 
• Is it Real or just another shiny technology with 
a long, but ultimately small tail (Yes and ?)
A NEW PLATFORM FOR A NEW 
ERA
Questions ?

More Related Content

PDF
Reactive app using actor model & apache spark
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
SMACK Stack 1.1
PPTX
Intro to Apache Spark
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Reactive app using actor model & apache spark
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Real time Analytics with Apache Kafka and Apache Spark
SMACK Stack 1.1
Intro to Apache Spark
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

What's hot (20)

PDF
Big Data visualization with Apache Spark and Zeppelin
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Introduction to Apache Spark
PPTX
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
Reactive dashboard’s using apache spark
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
PDF
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
Breakthrough OLAP performance with Cassandra and Spark
PPTX
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
PDF
Productionizing Spark and the Spark Job Server
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Big Data visualization with Apache Spark and Zeppelin
Real time data viz with Spark Streaming, Kafka and D3.js
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit EU talk by Miklos Christine paddling up the stream
Alpine academy apache spark series #1 introduction to cluster computing wit...
Introduction to Apache Spark
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Reactive dashboard’s using apache spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Apache Spark: The Next Gen toolset for Big Data Processing
Breakthrough OLAP performance with Cassandra and Spark
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Productionizing Spark and the Spark Job Server
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Ad

Viewers also liked (13)

PDF
Reactive Jersey Client
PDF
Akka in Practice: Designing Actor-based Applications
PPTX
xPatterns on Spark, Shark, Mesos, Tachyon
PDF
A Journey to Reactive Function Programming
PDF
Reactive programming on Android
PDF
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PDF
Reactive streams
PDF
Akka and AngularJS – Reactive Applications in Practice
PDF
Docker. Does it matter for Java developer ?
PPTX
Reactive Streams and RabbitMQ
PDF
Resilient Applications with Akka Persistence - Scaladays 2014
PPTX
Micro services, reactive manifesto and 12-factors
PDF
12 Factor App: Best Practices for JVM Deployment
Reactive Jersey Client
Akka in Practice: Designing Actor-based Applications
xPatterns on Spark, Shark, Mesos, Tachyon
A Journey to Reactive Function Programming
Reactive programming on Android
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Reactive streams
Akka and AngularJS – Reactive Applications in Practice
Docker. Does it matter for Java developer ?
Reactive Streams and RabbitMQ
Resilient Applications with Akka Persistence - Scaladays 2014
Micro services, reactive manifesto and 12-factors
12 Factor App: Best Practices for JVM Deployment
Ad

Similar to Tachyon and Apache Spark (20)

PDF
Apache Spark: killer or savior of Apache Hadoop?
PDF
Elephant in the cloud
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
PDF
Big Data Hoopla Simplified - TDWI Memphis 2014
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
PDF
Handling not so big data
PDF
OCF.tw's talk about "Introduction to spark"
PPTX
Big Data in the Microsoft Platform
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PDF
Modern Big Data Analytics Tools: An Overview
PPTX
Hackathon bonn
PPTX
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
PDF
Hadoop Conference Japan 2011 Fallに行ってきました
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PDF
Hortonworks tech workshop in-memory processing with spark
PPTX
Hadoop - Looking to the Future By Arun Murthy
PPT
Presentation
PPTX
Hadoop with Python
Apache Spark: killer or savior of Apache Hadoop?
Elephant in the cloud
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Introduction to Spark - Phoenix Meetup 08-19-2014
Scalable Hadoop with succinct Python: the best of both worlds
Handling not so big data
OCF.tw's talk about "Introduction to spark"
Big Data in the Microsoft Platform
Big Data Analytics Projects - Real World with Pentaho
Lightening Fast Big Data Analytics using Apache Spark
Modern Big Data Analytics Tools: An Overview
Hackathon bonn
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Hadoop Conference Japan 2011 Fallに行ってきました
Big Data Analytics with Hadoop, MongoDB and SQL Server
Hortonworks tech workshop in-memory processing with spark
Hadoop - Looking to the Future By Arun Murthy
Presentation
Hadoop with Python

More from rhatr (7)

PDF
Unikernels: in search of a killer app and a killer ecosystem
PDF
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
PPTX
OSv: probably the best OS for cloud workloads you've never hear of
PDF
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
PPT
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
PDF
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Unikernels: in search of a killer app and a killer ecosystem
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
OSv: probably the best OS for cloud workloads you've never hear of
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...

Recently uploaded (20)

PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
Introduction to Artificial Intelligence
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
top salesforce developer skills in 2025.pdf
PDF
Digital Strategies for Manufacturing Companies
PPTX
assetexplorer- product-overview - presentation
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Introduction to Artificial Intelligence
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
top salesforce developer skills in 2025.pdf
Digital Strategies for Manufacturing Companies
assetexplorer- product-overview - presentation
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Computer Software and OS of computer science of grade 11.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Odoo POS Development Services by CandidRoot Solutions
Design an Analysis of Algorithms I-SECS-1021-03
Upgrade and Innovation Strategies for SAP ERP Customers
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Odoo Companies in India – Driving Business Transformation.pdf
Transform Your Business with a Software ERP System
PTS Company Brochure 2025 (1).pdf.......
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
How to Choose the Right IT Partner for Your Business in Malaysia

Tachyon and Apache Spark

  • 1. Tachyon and Apache Spark: heralds of in-memory computing era. Roman Shaposhnik Director of Open Source @Pivotal (Twitter: @rhatr)
  • 2. Who’s this guy? • Director of Open Source @Pivotal • Apache Software Foundation guy (Member, VP of Apache Incubator, committer on Hadoop, Giraph, Sqoop, etc) • Used to be root@Cloudera • Used to be PHB@Yahoo! (original Hadoop team)
  • 4. 20 minute to figure out Hadoop vs. Spark
  • 5. 20 minute to figure out Hadoop++ == Spark
  • 6. 20 minute to figure out Hadoop + Spark
  • 7. But wait! There’s more! Tachyon
  • 8. Long, long time ago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • 9. In a blink of an eye MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN Tachyon
  • 10. A Spark view? HDFS MLib Shark YARN GraphX Streaming Tachyon Sqoop Flume Hadoop UI Hue Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie SolrCloud Phoenix HBase Spark SpringXD
  • 11. BDAS
  • 12. Long, long time ago…
  • 15. Your datacenter … server 1 server N
  • 16. Hadoop’s view MapReduce server 1 server N HDFS
  • 17. HDFS: decoupled storage … MR HDFS MR
  • 19. Anatomy of MapReduce HDFS mappers reducers HDFS a b c d a c a 3 b 1 c 2 a 1 b 1 c 1 a 1 c 1 a 1 a 1 1 1 b 1 c 1 1
  • 20. What’s wrong with MR? Source: UC Berkeley Spark project (just the image)
  • 21. This looks familiar… $ grep –R | awk | sort …
  • 22. Spark innovations • Resilient Distribtued Datasets (RDDs) • Distributed on a cluster • Manipulated via parallel operators (map, etc.) • Automatically rebuilt on failure • A parallel ecosystem • A solution to iterative and multi-stage apps
  • 23. RDDs warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split…
  • 24. Parallel operators • map, reduce • sample, filter • groupBy, reduceByKey • join, leftOuterJoin, rightOuterJoin • union, cross
  • 25. What is really happening? MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN Tachyon
  • 26. May be its not so bad server 1 server N
  • 27. But HDFS/YARN are safe? HDFS, Ceph, S3, NAS, etc. New HDFS New YARN
  • 28. Tachyon • In-memory data-exchange layer • A set of evolving APIs: • filesystem • caching • RDDs • Materialized views
  • 30. Spark is best for cloud
  • 31. It will be called Hadoop MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire with Tachyon Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • 32. Spark/Tachyon recap • Is it “Big Data” (Yes) • Is it “Hadoop” (No) • It’s one of those “in memory” things, right (Yes) • JVM, Java, Scala (All) • Is it Real or just another shiny technology with a long, but ultimately small tail (Yes and ?)
  • 33. A NEW PLATFORM FOR A NEW ERA