SlideShare a Scribd company logo
Achieving Separation of Compute and Storage
in a Cloud World
Dipti Borkar |Vice President, Product | Alluxio
Attendee Poll
Dipti Borkar,
VP of Product at Alluxio
Dipti has over 15 years experience in data and database
technologies across relational and non-relational data. Prior
to Alluxio, Dipti was VP of Product Marketing at Kinetica and
Couchbase. At Couchbase she held several leadership
positions there including Head of Global Technical Sales and
Head of Product Management.
Earlier in her career Dipti managed development teams at
IBM DB2 where she started her career as a database software
engineer.
Dipti holds a M.S. in Computer Science from the UC San Diego,
and an MBA from the Haas School of Business at UC Berkeley.
Today’s Speaker
Agenda
Why storage-independent compute?
AlluxioTechnology Overview
Real-world Use Cases
From mainframes to Big Data
Moving from tightly integrated to loosely integrated architectures
Application, processing, data
storage and hardware -
All-in-one tightly coupled
Client server architecture
drives application separation.
Processing and data storage
still tightly coupled
Data growth drives
distributed MPP architectures
but processing and data
storage still tightly coupled
Further data growth drives
distributed file system
architecture. Processing and
data storage co-located but
loosely coupled
The Big Data Ecosystem
Co-located compute and storage for big data workloads
§ More defined and loosely coupled
compute layer compared with relational
databases
§ But compute / data processing still runs
on the same node as where the data is
stored. MapReduce runs on HDFS across
the cluster
§ Compute layer and storage layer must be
scaled out by the same factor
CLOUD DATA
Mega trends driving the need for a new architecture
The Big Data Ecosystem Explodes
Moving from tightly integrated to loosely integrated architectures
STORAGE
COMPUTE
Why independently scale compute and storage for data-driven
applications?
Flexible compute scaling based
on application demands
Flexible storage scaling based
on data growth patterns
Compute is CPU bound Storage is I/O bound
Why independently scale compute and storage for data-driven
applications?
X
Reduced data duplication by
using same storage for
multiple compute frameworks S3
Leverage cheaper and newer
storage like object stores for
big data / AI workloads
Orchestrate & automate
compute for greater
operational efficiency
Protect & control your data on
premises and leverage public
cloud for compute
STORAGE
COMPUTE
An independently scaling Big Data Stack?
The challenges of independent scaling for data-driven workloads
Data Locality
Data Accessibility
Data Abstraction
Data is no more local to compute and
workload processing time will increase
particularly in hybrid cloud deployments
Data is in multiple storage systems in multiple
locations. Highly complex when all compute
frameworks talk to all storage systems
Data can still only be accessed using the
specific storage system APIs
STORAGE
COMPUTE
Truly independent scaling of the data stack
Data Locality Data AccessibilityData Abstraction
A new layer emerges between Compute & Storage
Attendee Poll
Alluxio Technology Overview
The Alluxio Story
Project started asTachyon, at the UC Berkley’s AMP Lab by
then Ph.D. student & now Alluxio CEO, Haoyuan (H.Y.) Li.
2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Unify Data at Memory Speed for data driven
applications such as Big Data Analytics, ML and AI.
2018
Top10 Hottest Data Storage Startup
Virtual Unified File System
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Unified
Namespace
Bring all files into a
single interface
Interact with data
using any API
Accelerate & tier
data transparently
API
Translation
Intelligent
Multi-tiering
Key Innovations of theVirtual Unified File System
Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data available
locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
Server-side API Translation: From legacy to modern
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
Intelligent Multi-tiering: Get high-value data faster
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Alluxio Reference Architecture
Alluxio
Master
Zookeeper /
RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio
Client
RAM / SSD / HDD
RAM / SSD / HDD
Under Store 1
Under Store 2
Application
WAN
Alluxio
Client
Application
Alluxio Data Path
Data Flow In Alluxio
1. Applications Read/Write data via the Alluxio Client
2. Read Scenarios
• Data not in Alluxio (i.e. first time, or no cache)
• Data on same node as client
• Data on different node from client
3. Write Scenarios
• Write only to Alluxio
• Write only to Under Store
• Write synchronously to Alluxio and Under Store
• Write to Alluxio and asynchronously write to Under Store
25
Read data in Alluxio, on same node as client
26
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
Read data not in Alluxio
27
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
Write data only to Alluxio on same node as client
28
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
Write data to Alluxio and Under Store synchronously
29
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
Metadata Path
Metadata Path: Familiar Semantics
• Alluxio also provides a local metadata to the compute
• Listing / renaming on object store can be expensive
• Alluxio speeds up these operations
• Alluxio loads and manages metadata in master
• Apps can continue assuming HDFS-like semantics
3131
Real world Use cases
Virtual
Data Lake
§ Accelerate batch, micro-
batch & streaming jobs
§ Slowly transition to
lower cost object stores
§ Run in hybrid cloud
environment with
compute in the cloud
§ Accelerate ML jobs
running on object stores
or file systems
§ Provide consistent
performance to data
scientists
§ Provide unified interface
to access all data
§ Accelerate & tier data
transparently across
storage tiers
§ Co-locate remote data
with compute for
performance
Machine Learning
Productivity
Self-service data
across hybrid cloud
Popular Technical Use Cases
China Unicom
Challenge
Desired a central view of business
data across multiple systems for big
data workloads
Solution
Alluxio integrates data across multiple storage system to be
accessed by Spark in a hybrid environment
Impact
Significantly faster workloads and faster innovation
Machine Learning Case Study
Challenge –
Gain end to end view of business
with large volume of data while
complying with regional data
regulations
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
Use Case: http://bit.ly/2oMx95W
SPARK
TERADATA
SPARK
TERADATA
Analytics Use Case – Top Retailer
Challenge –
Bottleneck in Trend Analysis of
mission critical daily sales and
inventory management
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
With Alluxio, data queries are 10X
faster
Impact –
Higher operational efficiency
Use case: http://bit.ly/2ook8Nh
SPARK
HDFS
SPARK
HDFS
Incredible Open Source Momentum with growing community
900+ contributors &
growing
3760+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
ThankYou
Questions? Email me: dipti@alluxio.com
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | Twitter: @Alluxio | Slack

More Related Content

PDF
From limited Hadoop compute capacity to increased data scientist efficiency
PDF
Scalable and High available Distributed File System Metadata Service Using gR...
PDF
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
PDF
How to Develop and Operate Cloud First Data Platforms
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Burst Presto & Spark workloads to AWS EMR with no data copies
From limited Hadoop compute capacity to increased data scientist efficiency
Scalable and High available Distributed File System Metadata Service Using gR...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
How to Develop and Operate Cloud First Data Platforms
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Hybrid data lake on google cloud with alluxio and dataproc
Burst Presto & Spark workloads to AWS EMR with no data copies

What's hot (20)

PDF
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
PDF
Improving Presto performance with Alluxio at TikTok
PDF
Apache Hudi: The Path Forward
PDF
RaptorX: Building a 10X Faster Presto with hierarchical cache
PDF
Accelerate Cloud Training with Alluxio
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Accelerating Data Computation on Ceph Objects
PDF
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Best Practices for Using Alluxio with Spark
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Speeding Up Spark Performance using Alluxio at China Unicom
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
PDF
Presto on Alluxio Hands-On Lab
PPTX
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
PDF
Alluxio Use Cases and Future Directions
PDF
Scalable Filesystem Metadata Services with RocksDB
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
PDF
The Practice of Alluxio in JD.com
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Improving Presto performance with Alluxio at TikTok
Apache Hudi: The Path Forward
RaptorX: Building a 10X Faster Presto with hierarchical cache
Accelerate Cloud Training with Alluxio
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Accelerating Data Computation on Ceph Objects
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Best Practices for Using Alluxio with Spark
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Speeding Up Spark Performance using Alluxio at China Unicom
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Presto on Alluxio Hands-On Lab
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Alluxio Use Cases and Future Directions
Scalable Filesystem Metadata Services with RocksDB
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
The Practice of Alluxio in JD.com
Ad

Similar to Achieving Separation of Compute and Storage in a Cloud World (20)

PDF
Alluxio @ Uber Seattle Meetup
PDF
Achieving compute and storage independence for data-driven workloads
PDF
The Architecture of Decoupling Compute and Storage with Alluxio
PDF
Unified Big Data Analytics: Any Stack, Any Cloud
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
PDF
Data EcoSystem 2.0
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Best Practices for Using Alluxio with Spark
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
Unify Data at Memory Speed
PDF
Enabling Apache Spark for Hybrid Cloud
PPTX
Alluxio: Unify Data at Memory Speed
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PDF
Data Orchestration for the Hybrid Cloud Era
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio @ Uber Seattle Meetup
Achieving compute and storage independence for data-driven workloads
The Architecture of Decoupling Compute and Storage with Alluxio
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Data EcoSystem 2.0
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Best Practices for Using Alluxio with Spark
Alluxio Data Orchestration Platform for the Cloud
Unify Data at Memory Speed
Enabling Apache Spark for Hybrid Cloud
Alluxio: Unify Data at Memory Speed
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Data Orchestration for the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Open Source Data Orchestration for AI, Big Data, and Cloud
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...

Recently uploaded (20)

PDF
Nekopoi APK 2025 free lastest update
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
history of c programming in notes for students .pptx
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
AutoCAD Professional Crack 2025 With License Key
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Complete Guide to Website Development in Malaysia for SMEs
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
assetexplorer- product-overview - presentation
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Nekopoi APK 2025 free lastest update
Designing Intelligence for the Shop Floor.pdf
history of c programming in notes for students .pptx
Why Generative AI is the Future of Content, Code & Creativity?
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
AutoCAD Professional Crack 2025 With License Key
CHAPTER 2 - PM Management and IT Context
17 Powerful Integrations Your Next-Gen MLM Software Needs
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Reimagine Home Health with the Power of Agentic AI​
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
Computer Software and OS of computer science of grade 11.pptx
Monitoring Stack: Grafana, Loki & Promtail
Wondershare Filmora 15 Crack With Activation Key [2025
Complete Guide to Website Development in Malaysia for SMEs
Design an Analysis of Algorithms II-SECS-1021-03
assetexplorer- product-overview - presentation
Patient Appointment Booking in Odoo with online payment
Internet Downloader Manager (IDM) Crack 6.42 Build 41

Achieving Separation of Compute and Storage in a Cloud World

  • 1. Achieving Separation of Compute and Storage in a Cloud World Dipti Borkar |Vice President, Product | Alluxio
  • 3. Dipti Borkar, VP of Product at Alluxio Dipti has over 15 years experience in data and database technologies across relational and non-relational data. Prior to Alluxio, Dipti was VP of Product Marketing at Kinetica and Couchbase. At Couchbase she held several leadership positions there including Head of Global Technical Sales and Head of Product Management. Earlier in her career Dipti managed development teams at IBM DB2 where she started her career as a database software engineer. Dipti holds a M.S. in Computer Science from the UC San Diego, and an MBA from the Haas School of Business at UC Berkeley. Today’s Speaker
  • 5. From mainframes to Big Data Moving from tightly integrated to loosely integrated architectures Application, processing, data storage and hardware - All-in-one tightly coupled Client server architecture drives application separation. Processing and data storage still tightly coupled Data growth drives distributed MPP architectures but processing and data storage still tightly coupled Further data growth drives distributed file system architecture. Processing and data storage co-located but loosely coupled
  • 6. The Big Data Ecosystem Co-located compute and storage for big data workloads § More defined and loosely coupled compute layer compared with relational databases § But compute / data processing still runs on the same node as where the data is stored. MapReduce runs on HDFS across the cluster § Compute layer and storage layer must be scaled out by the same factor
  • 7. CLOUD DATA Mega trends driving the need for a new architecture
  • 8. The Big Data Ecosystem Explodes Moving from tightly integrated to loosely integrated architectures STORAGE COMPUTE
  • 9. Why independently scale compute and storage for data-driven applications? Flexible compute scaling based on application demands Flexible storage scaling based on data growth patterns Compute is CPU bound Storage is I/O bound
  • 10. Why independently scale compute and storage for data-driven applications? X Reduced data duplication by using same storage for multiple compute frameworks S3 Leverage cheaper and newer storage like object stores for big data / AI workloads Orchestrate & automate compute for greater operational efficiency Protect & control your data on premises and leverage public cloud for compute
  • 12. The challenges of independent scaling for data-driven workloads Data Locality Data Accessibility Data Abstraction Data is no more local to compute and workload processing time will increase particularly in hybrid cloud deployments Data is in multiple storage systems in multiple locations. Highly complex when all compute frameworks talk to all storage systems Data can still only be accessed using the specific storage system APIs
  • 13. STORAGE COMPUTE Truly independent scaling of the data stack Data Locality Data AccessibilityData Abstraction A new layer emerges between Compute & Storage
  • 16. The Alluxio Story Project started asTachyon, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CEO, Haoyuan (H.Y.) Li. 2014 2015 Open Source project established & company to commercialize Alluxio founded Goal: Unify Data at Memory Speed for data driven applications such as Big Data Analytics, ML and AI. 2018 Top10 Hottest Data Storage Startup
  • 17. Virtual Unified File System Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  • 18. Unified Namespace Bring all files into a single interface Interact with data using any API Accelerate & tier data transparently API Translation Intelligent Multi-tiering Key Innovations of theVirtual Unified File System
  • 19. Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  • 20. Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  • 21. Server-side API Translation: From legacy to modern Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  • 22. Intelligent Multi-tiering: Get high-value data faster Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  • 23. Alluxio Reference Architecture Alluxio Master Zookeeper / RAFT Standby Master Alluxio Worker Alluxio Worker Alluxio Client RAM / SSD / HDD RAM / SSD / HDD Under Store 1 Under Store 2 Application WAN Alluxio Client Application
  • 25. Data Flow In Alluxio 1. Applications Read/Write data via the Alluxio Client 2. Read Scenarios • Data not in Alluxio (i.e. first time, or no cache) • Data on same node as client • Data on different node from client 3. Write Scenarios • Write only to Alluxio • Write only to Under Store • Write synchronously to Alluxio and Under Store • Write to Alluxio and asynchronously write to Under Store 25
  • 26. Read data in Alluxio, on same node as client 26 Alluxio Worker RAM / SSD / HDD Memory Speed Read of Data Application Alluxio Client Alluxio Master
  • 27. Read data not in Alluxio 27 RAM / SSD / HDD Network / Disk Speed Read of Data Application Alluxio Client Alluxio Master Alluxio WorkerUnder Store
  • 28. Write data only to Alluxio on same node as client 28 Alluxio Worker RAM / SSD / HDD Memory Speed Write of Data Application Alluxio Client Alluxio Master
  • 29. Write data to Alluxio and Under Store synchronously 29 RAM / SSD / HDD Network / Disk Speed Write of Data Application Alluxio Client Alluxio Master Alluxio Worker Under Store
  • 31. Metadata Path: Familiar Semantics • Alluxio also provides a local metadata to the compute • Listing / renaming on object store can be expensive • Alluxio speeds up these operations • Alluxio loads and manages metadata in master • Apps can continue assuming HDFS-like semantics 3131
  • 32. Real world Use cases
  • 33. Virtual Data Lake § Accelerate batch, micro- batch & streaming jobs § Slowly transition to lower cost object stores § Run in hybrid cloud environment with compute in the cloud § Accelerate ML jobs running on object stores or file systems § Provide consistent performance to data scientists § Provide unified interface to access all data § Accelerate & tier data transparently across storage tiers § Co-locate remote data with compute for performance Machine Learning Productivity Self-service data across hybrid cloud Popular Technical Use Cases
  • 34. China Unicom Challenge Desired a central view of business data across multiple systems for big data workloads Solution Alluxio integrates data across multiple storage system to be accessed by Spark in a hybrid environment Impact Significantly faster workloads and faster innovation
  • 35. Machine Learning Case Study Challenge – Gain end to end view of business with large volume of data while complying with regional data regulations Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: http://bit.ly/2oMx95W SPARK TERADATA SPARK TERADATA
  • 36. Analytics Use Case – Top Retailer Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh SPARK HDFS SPARK HDFS
  • 37. Incredible Open Source Momentum with growing community 900+ contributors & growing 3760+ Git Stars Apache 2.0 Licensed Hundreds of thousands of downloads
  • 38. ThankYou Questions? Email me: [email protected] Join the Alluxio Community www.alluxio.org | www.alluxio.com | Twitter: @Alluxio | Slack