SlideShare a Scribd company logo
SeaweedFS
Intro
2019.3
chris.lu@gmail.com
SeaweedFS Intro
● Overview
● Internal Architecture
○ Object/Blob store
○ Filer Store
○ S3/Hadoop
○ Notification/Cross-Region Replication
SeaweedFS Intro
SeaweedFS introduction
Overview: What is special?
● Distributed
● Handles large and small files
● Optimized for large amount of small files
● Random access any file
● Low-latency access any file
● Parallel processing
Overview: APIs
● REST API for object storage
● REST/gRPC API for file system storage
● Hadoop Compatible
● FUSE client to mount file system locally
● S3 API
Architecture
● Object Storage
● File Storage
● Interface/Client Layer
Volume Store
● Based on Facebook
Haystack paper
Object Storage
Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Client Write
1. Http request file id
3. Http upload file with file id
2. Get file id
Object Storage
Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Client Write
1. Http request file id
3. Http upload file with file id
2. Get file id
Example file id, 3,01637037d6
● 3 : a volume id
● 01: file key
● 637037d6: file cookie
Object Storage
Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Client Read
1. Lookup volume id
3. Http get file with file id
2. Get volume location
● Volume locations can be cached.
● Clients can also subscribe to volume
location changes.
Object Storage
File Storage
Master
Volume
Server
Volume
Server
Volume
Server
Filer Client Upload a file to a directory
File Storage
Filer
Filer
Store
Local
MySql
Postgres
Redis
Cassandra
Metadata
Blobs
S3 API
Gateway
S3 Clients
Filer Store Data Layout
/a/b/c/ Attr
/a/b/c/def.txt Attr FileChunks
Volume-Aware Clients
Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Other SeaweedFS
Volume-Aware Clients
Metadata
File Storage
Filer
Filer
Store
Local
MySql
Postgres
Redis
Cassandra
Metadata
Blobs
Hadoop Client
Mounted FUSE Client
Volume-based data placement
● Volumes are organized with different settings:
○ Collection
■ TTL
■ Replication
● Master randomly assigns a write request to one of the writable volumes.
● Strong consistent writes to all replicas.
● If one replica fails heartbeat, the master marks the volume id as read-only.
● Writes should be assigned to other writable volumes.
Object Storage
Security: per object access control with JWT
Master
Volume
Server
Volume
Server
Volume
Server
Client
1. Request FileId
3. Upload File with FileId + JWT
2. Get FileId + JWT
● A Json Web Token (JWT) has permission
to create/update/delete a file.
● Expires after 10 seconds.
Secure Volume Server
● Mutual TLS
○ Secure master to volume server admin
operations
● JWT
○ Secure object changes
Volume servers can be placed anywhere.
Any server with some free space can be a
volume server.
Master
Volume
Server
Volume
Server
Volume
Server
Mutual TLS gRPC calls
JWT authorized changes
High Availability: Master Server Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Master
Master
● Multi-Master cluster
● Leader election with Raft consensus
algorithm
High Availability: Filer Server
● Multiple stateless filer servers
● Shared filer store could be any
HA storage solution.
File Storage
Filer
Filer Store
MySql
Postgres
Redis
Cassandra
Filer Filer
Scalability: Filer
● Direct blob access.
● Filer store can be any proven store, and simple to add new store:
○ Redis
○ MySql/Postgres
○ Cassandra
○ Interface for any key-value store
● Unlimited files under one directory.
● Blob storage supports multiple filers.
File Change Notification
● All filer change notifications can
be sent to a message queue.
● Protobuf encoded notification.
● Cross-Region replication is built
on top of this.
File Storage
Filer
Filer
Store
Local
MySql
Postgres
Redis
Cassandra
Metadata
Message
Queue
notifications
Kafka
AWS SNS/SQS,
Azure Service Bus,
Google Pub/Sub,
NATS and RabbitMQ
SeaweedFS introduction
Atomicity
Operation Atomicity Note
Creating a file yes
Deleting a file yes
Renaming a file Yes with mysql/postgres.
No with
redis/leveldb/cassandra.
Implemented via database
transactions.
Renaming a directory Yes with mysql/postgres.
No with
redis/leveldb/cassandra.
Implemented via database
transactions.
Creating a single directory with
mkdir()
yes
Recursive directory deletion No
Comparing to HDFS
HDFS SeaweedFS
File Metadata Storage Single namenode Multiple stateless filers with
proven scalable filer store,
redis/cassandra/etc.
Storing small files Not recommended. Optimized for small files.
Parallel data access Yes Yes
Hadoop Compatible Yes Yes. (Atomic rename via
database transactions.)
Comparing to CEPH
CEPH SeaweedFS
Data Placement CRUSH maps of the whole
cluster, rather complicated,
especially when adding
storage.
Calculated for each object.
Volume level placement,
amortized for each object.
Storing small files Not optimized. Optimized for small files.
Scaling file system metadata MDS dynamically partition
subtree
Flat and linearly scalable.
Easy to set up Mixed reviews Yes
Design Philosophy
● Scale up each layer independently.
● Batch small files
○ Data placement (CEPH file-level, SeaweedFS volume-level)
○ Tracking (HDFS namenode track blocks, SeaweedFS track volume locations)
○ Easy move/delete/replicate operation.
Open APIs
● gRPC APIs for admin operations
● HTTP APIs for uploading and serving blobs
● gRPC for filer metadata operations
● Protocol buffer defined metadata
Future Plan
● Volume Server
○ Async Replica
○ Erasure Coding
○ Tiered Storage
● Integration
○ CSI, docker volume plugin
○ Kerberos
● Tools
○ Auto Balance
Open APIs for possible extensions
● Build a different filer with striping.
● Build a different replication
● Admin tools
● Custom Encryption
● Async Operations
○ Search
○ Secondary index
● Local cache for cloud files
● CDN

More Related Content

PDF
The-Customer-Data-Platform-Report-2023.pdf
PDF
Modern Data architecture Design
PPT
Secure shell ppt
PDF
Customer segmentation and marketing automation with Apache Unomi
PDF
Docker Introduction
PDF
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
PPTX
DBT ELT approach for Advanced Analytics.pptx
PPTX
Diabetes Mellitus
The-Customer-Data-Platform-Report-2023.pdf
Modern Data architecture Design
Secure shell ppt
Customer segmentation and marketing automation with Apache Unomi
Docker Introduction
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
DBT ELT approach for Advanced Analytics.pptx
Diabetes Mellitus

What's hot (20)

PDF
The Apache Spark File Format Ecosystem
PDF
Better than you think: Handling JSON data in ClickHouse
PDF
Dataflow with Apache NiFi
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
PDF
Distributed Lock Manager
PPTX
The Current State of Table API in 2022
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
PDF
TCAMのしくみ
PDF
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
PPTX
Introduction to Apache ZooKeeper
PPTX
Real-time Analytics with Trino and Apache Pinot
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PDF
Blazing Performance with Flame Graphs
PDF
Batch Processing at Scale with Flink & Iceberg
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
The Apache Spark File Format Ecosystem
Better than you think: Handling JSON data in ClickHouse
Dataflow with Apache NiFi
Apache Iceberg - A Table Format for Hige Analytic Datasets
High-speed Database Throughput Using Apache Arrow Flight SQL
Distributed Lock Manager
The Current State of Table API in 2022
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
TCAMのしくみ
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
Introduction to Apache ZooKeeper
Real-time Analytics with Trino and Apache Pinot
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Blazing Performance with Flame Graphs
Batch Processing at Scale with Flink & Iceberg
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Tame the small files problem and optimize data layout for streaming ingestion...
Ad

Similar to SeaweedFS introduction (20)

PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
Glusterfs and openstack
PPS
Beyond the File System - Designing Large Scale File Storage and Serving
PPS
Filesystems
PPTX
Cloud Computing - Cloud Technologies and Advancements
PPTX
Data Analytics presentation.pptx
PDF
OSDC 2015: John Spray | The Ceph Storage System
ODP
Lisa 2015-gluster fs-introduction
ODP
Apache Hadoop HDFS
PDF
Hadoop and object stores can we do it better
PDF
Hadoop and object stores: Can we do it better?
PPT
Distributed Filesystems Review
PPTX
Clustering and types of Clustering in Data analytics
PDF
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
PPTX
Big Data-Session, data engineering and scala
PDF
Hadoop data management
PPTX
Hadoop and HDFS
PPS
Web20expo Filesystems
PPS
Beyond the File System: Designing Large-Scale File Storage and Serving
PPS
Web20expo Filesystems
hdfs readrmation ghghg bigdats analytics info.pdf
Glusterfs and openstack
Beyond the File System - Designing Large Scale File Storage and Serving
Filesystems
Cloud Computing - Cloud Technologies and Advancements
Data Analytics presentation.pptx
OSDC 2015: John Spray | The Ceph Storage System
Lisa 2015-gluster fs-introduction
Apache Hadoop HDFS
Hadoop and object stores can we do it better
Hadoop and object stores: Can we do it better?
Distributed Filesystems Review
Clustering and types of Clustering in Data analytics
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Big Data-Session, data engineering and scala
Hadoop data management
Hadoop and HDFS
Web20expo Filesystems
Beyond the File System: Designing Large-Scale File Storage and Serving
Web20expo Filesystems
Ad

Recently uploaded (20)

PPTX
history of c programming in notes for students .pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
Nekopoi APK 2025 free lastest update
PDF
Salesforce Agentforce AI Implementation.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Complete Guide to Website Development in Malaysia for SMEs
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Download FL Studio Crack Latest version 2025 ?
PDF
Cost to Outsource Software Development in 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
history of c programming in notes for students .pptx
Odoo Companies in India – Driving Business Transformation.pdf
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
AutoCAD Professional Crack 2025 With License Key
Nekopoi APK 2025 free lastest update
Salesforce Agentforce AI Implementation.pdf
Reimagine Home Health with the Power of Agentic AI​
How to Choose the Right IT Partner for Your Business in Malaysia
Complete Guide to Website Development in Malaysia for SMEs
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Download FL Studio Crack Latest version 2025 ?
Cost to Outsource Software Development in 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Monitoring Stack: Grafana, Loki & Promtail
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Why Generative AI is the Future of Content, Code & Creativity?
Autodesk AutoCAD Crack Free Download 2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises

SeaweedFS introduction

  • 2. SeaweedFS Intro ● Overview ● Internal Architecture ○ Object/Blob store ○ Filer Store ○ S3/Hadoop ○ Notification/Cross-Region Replication
  • 5. Overview: What is special? ● Distributed ● Handles large and small files ● Optimized for large amount of small files ● Random access any file ● Low-latency access any file ● Parallel processing
  • 6. Overview: APIs ● REST API for object storage ● REST/gRPC API for file system storage ● Hadoop Compatible ● FUSE client to mount file system locally ● S3 API
  • 7. Architecture ● Object Storage ● File Storage ● Interface/Client Layer
  • 8. Volume Store ● Based on Facebook Haystack paper
  • 9. Object Storage Object Storage Master Volume Server Volume Server Volume Server Client Write 1. Http request file id 3. Http upload file with file id 2. Get file id
  • 10. Object Storage Object Storage Master Volume Server Volume Server Volume Server Client Write 1. Http request file id 3. Http upload file with file id 2. Get file id Example file id, 3,01637037d6 ● 3 : a volume id ● 01: file key ● 637037d6: file cookie
  • 11. Object Storage Object Storage Master Volume Server Volume Server Volume Server Client Read 1. Lookup volume id 3. Http get file with file id 2. Get volume location ● Volume locations can be cached. ● Clients can also subscribe to volume location changes.
  • 12. Object Storage File Storage Master Volume Server Volume Server Volume Server Filer Client Upload a file to a directory File Storage Filer Filer Store Local MySql Postgres Redis Cassandra Metadata Blobs S3 API Gateway S3 Clients
  • 13. Filer Store Data Layout /a/b/c/ Attr /a/b/c/def.txt Attr FileChunks
  • 14. Volume-Aware Clients Object Storage Master Volume Server Volume Server Volume Server Other SeaweedFS Volume-Aware Clients Metadata File Storage Filer Filer Store Local MySql Postgres Redis Cassandra Metadata Blobs Hadoop Client Mounted FUSE Client
  • 15. Volume-based data placement ● Volumes are organized with different settings: ○ Collection ■ TTL ■ Replication ● Master randomly assigns a write request to one of the writable volumes. ● Strong consistent writes to all replicas. ● If one replica fails heartbeat, the master marks the volume id as read-only. ● Writes should be assigned to other writable volumes.
  • 16. Object Storage Security: per object access control with JWT Master Volume Server Volume Server Volume Server Client 1. Request FileId 3. Upload File with FileId + JWT 2. Get FileId + JWT ● A Json Web Token (JWT) has permission to create/update/delete a file. ● Expires after 10 seconds.
  • 17. Secure Volume Server ● Mutual TLS ○ Secure master to volume server admin operations ● JWT ○ Secure object changes Volume servers can be placed anywhere. Any server with some free space can be a volume server. Master Volume Server Volume Server Volume Server Mutual TLS gRPC calls JWT authorized changes
  • 18. High Availability: Master Server Object Storage Master Volume Server Volume Server Volume Server Master Master ● Multi-Master cluster ● Leader election with Raft consensus algorithm
  • 19. High Availability: Filer Server ● Multiple stateless filer servers ● Shared filer store could be any HA storage solution. File Storage Filer Filer Store MySql Postgres Redis Cassandra Filer Filer
  • 20. Scalability: Filer ● Direct blob access. ● Filer store can be any proven store, and simple to add new store: ○ Redis ○ MySql/Postgres ○ Cassandra ○ Interface for any key-value store ● Unlimited files under one directory. ● Blob storage supports multiple filers.
  • 21. File Change Notification ● All filer change notifications can be sent to a message queue. ● Protobuf encoded notification. ● Cross-Region replication is built on top of this. File Storage Filer Filer Store Local MySql Postgres Redis Cassandra Metadata Message Queue notifications Kafka AWS SNS/SQS, Azure Service Bus, Google Pub/Sub, NATS and RabbitMQ
  • 23. Atomicity Operation Atomicity Note Creating a file yes Deleting a file yes Renaming a file Yes with mysql/postgres. No with redis/leveldb/cassandra. Implemented via database transactions. Renaming a directory Yes with mysql/postgres. No with redis/leveldb/cassandra. Implemented via database transactions. Creating a single directory with mkdir() yes Recursive directory deletion No
  • 24. Comparing to HDFS HDFS SeaweedFS File Metadata Storage Single namenode Multiple stateless filers with proven scalable filer store, redis/cassandra/etc. Storing small files Not recommended. Optimized for small files. Parallel data access Yes Yes Hadoop Compatible Yes Yes. (Atomic rename via database transactions.)
  • 25. Comparing to CEPH CEPH SeaweedFS Data Placement CRUSH maps of the whole cluster, rather complicated, especially when adding storage. Calculated for each object. Volume level placement, amortized for each object. Storing small files Not optimized. Optimized for small files. Scaling file system metadata MDS dynamically partition subtree Flat and linearly scalable. Easy to set up Mixed reviews Yes
  • 26. Design Philosophy ● Scale up each layer independently. ● Batch small files ○ Data placement (CEPH file-level, SeaweedFS volume-level) ○ Tracking (HDFS namenode track blocks, SeaweedFS track volume locations) ○ Easy move/delete/replicate operation.
  • 27. Open APIs ● gRPC APIs for admin operations ● HTTP APIs for uploading and serving blobs ● gRPC for filer metadata operations ● Protocol buffer defined metadata
  • 28. Future Plan ● Volume Server ○ Async Replica ○ Erasure Coding ○ Tiered Storage ● Integration ○ CSI, docker volume plugin ○ Kerberos ● Tools ○ Auto Balance
  • 29. Open APIs for possible extensions ● Build a different filer with striping. ● Build a different replication ● Admin tools ● Custom Encryption ● Async Operations ○ Search ○ Secondary index ● Local cache for cloud files ● CDN