Using ZFS file system with MySQL

MySQL on ZFS
Bajrang Panigrahi
August, 2019

ZFS Principles
● Pooled storage
● Completely eliminates the antique notion of volumes
● Does for storage what VM did for memory
● Transactional object system
● Always consistent on disk – no fsck, ever
● Provable end-to-end data integrity
● Detects and corrects silent data corruption
● Simple administration
● Concisely express your intent

FS/Volume Model vs Pooled Storage
Traditional Volumes
● Abstraction: virtual disk
● Partition/volume for each FS
● Grow/shrink by hand
● Each FS has limited bandwidth
● Storage is fragmented, stranded
ZFS Pooled Storage
● Abstraction: malloc/free
● No partitions to manage
● Grow/shrink automatically
● All bandwidth always available
● All storage in the pool is shared
Storage PoolVolume
FS
Volume
FS
Volume
FS ZFS ZFS ZFS

NFS SMB
Local
files
VFS
Filesystem
(e.g. UFS, ext3)
Volume Manager
(e.g. LVM, SVM)
NFS SMB
Local
files
VFS
DMU
(Data Management Unit)
SPA
(Storage Pool Allocator)
iSCSI FC
SCSI target
ZPL
(ZFS POSIX Layer)
ZVOL
(ZFS Volume)
Block
interface
ZFS
Block
allocate+write,
read, free
Atomic
transactions
on objects
File interface

Beneﬁts of ZFS
● Copy-on-Write (CoW) File System.
● Throttles writes.
● Data integrity and resiliency.
● Self Healing of Data on ZFS.
● Block size matching.(Allows Variable Block size)
● Snapshots & Clones
● Active development community

Copy-On-Write Transactions
1. Initial block tree 2. COW some blocks
4. Rewrite uberblock (atomic)3. COW indirect blocks

Block Pointer Structure in ZFS
First copy of data
When the
block was
written
Checksum of
data this block
points to
padding
physical birth txg
logical birth txg
fill count
256-bit checksum
BDX lvl type PSIZEcomp LSIZE
offset1
offset2
offset3
vdev1
vdev2
vdev3
ASIZE
ASIZE
ASIZE
cksum
Second copy of data
(for metadata)
Third copy of data
(pool-wide metadata)

END-to-END Data Integrity in ZFS
ZFS validates the entire I/O path
✓ Bit rot
✓ Phantom writes
✓ Misdirected reads and writes
✓ DMA parity errors
✓ Driver bugs
✓ Accidental overwrite
Disk checksum only validates media
✓ Bit rot
✓ Phantom writes
✓ Misdirected reads and writes
✓ DMA parity errors
✓ Driver bugs
✓ Accidental overwrite
Disk Block Checksums
● Checksum stored with data block
● Any self-consistent block will pass
● Can't detect stray writes
● Inherent FS/volume interface limitation
Data Data
Data
Checksum
Data
Checksum
ZFS Data Authentication
● Checksum stored in parent block pointer
● Fault isolation between data and checksum
● Checksum hierarchy forms
self-validating Merkle tree
Address
Checksum Checksum
Address
• • •
Address
Checksum Checksum
Address

Self Healing of Data in ZFS
Application
ZFS mirror
Application
ZFS mirror
Application
ZFS mirror
1. Application issues a
read. Checksum reveals
that the block is corrupt
on disk.
2. ZFS tries the next
disk. Checksum
indicates that the block
is good.
3. ZFS returns good
data to the application
and repairs the damaged
block.

Initial Use case at Zeneﬁts
We use AWS snapshot to rebuild a new DB for dev/ops; the first access to
the data is slow because “New volumes created from existing EBS
snapshots load lazily in the background”
Multiple DB clusters data needed for generating the DB for dev/ops -- We
use Multi-Source Replication.

Alternatives
Multiple EBS Volume attached as Slave MySQL, and rotate on fresh
snapshot request
Con: Additional EBS volumes, will still have the problem of initial
load of queries (Taking snap at every 15 mins)
Use Percona Xtrabackup as an Incremental Data Copy to the Spoof
Instance.
Con: Requires an additional EBS volume and MySQL Service needs to be
shutdown during the entire period the backup is restored.
Use ZFS file system as a mechanism of taking a snapshot at the file
system level

Setting up ZFS on MySQL
● Create a pool name “ZP1”
zpool create -O compression=gzip -f -o autoexpand=on "zp1" mirror "/dev/xvdm" "/dev/xvdn"
-o ashift=12
● Create a new ﬁlesystem named “data2” in POOL “ZP1”
#Create the ZFS Filesystems
- name: Create a new file system called data2 in pool zp1
zfs:
name: zp1/mysql
state: present
extra_zfs_properties:
setuid: off
compression: gzip
recordsize: 128k
atime: off
primarycache: metadata

Setting up ZFS on MySQL
● Create the required datasets to run MySQL
zp1/mysql 1.19T 4.92T 100K /zp1/mysql
zp1/mysql/data 1.18T 4.92T 1.17T /data2/data
zp1/mysql/logs 9.97G 4.92T 8.84G /data2/logs
zp1/mysql/tmp 216K 4.92T 152K /data2/tmp
● Conﬁgurations on MySQL
Innodb_doublewrite = 0
Innodb_checksum_algorithm = none
Innodb_use_native_aio = 0

ZPOOL Status
● ZPOOL status
zpool status
pool: zp1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zp1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
xvdm ONLINE 0 0 0
xvdn ONLINE 0 0 0
errors: No known data errors

ZFS List
NAME USED AVAIL REFER MOUNTPOINT
zp1 1.20T 4.92T 104K /zp1
zp1/mslave03 1.11G 4.92T 100K /zp1/mslave03
zp1/mslave03/data 1.11G 4.92T 1.17T /data3/data
zp1/mslave03/logs 308K 4.92T 340K /data3/logs
zp1/mslave03/tmp 96K 4.92T 128K /data3/tmp
zp1/mslave04 686M 4.92T 100K /zp1/mslave04
zp1/mslave04/data 686M 4.92T 1.17T /data4/data
zp1/mslave04/logs 300K 4.92T 332K /data4/logs
zp1/mslave04/tmp 96K 4.92T 128K /data4/tmp
zp1/mysql 1.19T 4.92T 100K /zp1/mysql
zp1/mysql/data 1.18T 4.92T 1.17T /data2/data
zp1/mysql/logs 10.2G 4.92T 8.78G /data2/logs
zp1/mysql/tmp 216K 4.92T 152K /data2/tmp

Incremental Send and Receive
zfs send zp1/mysql/data@monday |
ssh host
zfs receive zp1/recvd/fs
zfs send -i @monday
zp1/mysql/data@tuesday | ssh ..
“FromSnap”
“ToSnap”

ZFS - usage metrics
KEY Old_ENV New_ENV
Performance - Page Load 2-3 minutes ~15 secs
Faster Data Snapshots 15 minutes ~2 - 4 secs
Cloning / EBS attachment > 20 minutes ~ 3 - 5 secs
Costs: Higher* Lower
Monitoring / Alerting only Slack messages Jenkins + PagerDuty

ZFS - Performance Benchmarking

ZFS - Challenges
● Fragmentation.
● Complex to tweak and tune.
● Requires extra free space or pool performance can suffer.

Further ...
● High Read throughput (>= 83.88 million)
● MySQL / sec upto 76.2 K
● InnoDB file I/O write upto 150K
● Enterprise-grade transactional file system.
● Automatically reconstructs data after detecting an error.
● Multiple physical media devices into one logical volume using ZPOOL.
● Snapshot and Mirroring capabilities, and can quickly compress data.
(LZ4)
Enjoy a user-friendly, high-volume storage system.

Using ZFS file system with MySQL

Recommended

More Related Content

What's hot (20)

Similar to Using ZFS file system with MySQL (20)

More from Mydbops (20)

Recently uploaded (20)

Using ZFS file system with MySQL