ZFS¶

Note

This article focuses on ZFS on Linux (ZoL) as iBug only uses Linux.

Basic tuning¶

xattr¶

By default, xattr=on so ZFS stores extended attributes in a separate file (which is not exposed), thus requiring multiple disk seeks and reads to access. xattr=sa makes ZFS store them in the file inode, significantly improving performance.

Do note that this setting is specific to ZoL. Other implementations (OpenZFS, FreeBSD etc.) may allow you to set xattr=sa but actually behave as if it's xattr=on. Consequently, extended attributes saved from ZoL with xattr=sa are not readable elsewhere. Make sure you're not planning to import the pool on non-Linux systems.

On workloads that do not use extended attributes, consider using xattr=off instead.

Module parameters¶

Most parameters can be changed dynamically by writing to /sys/module/zfs/parameters/<parameter>. However, some parameters are either:

Can only be set at module load time (via /etc/modprobe.d/zfs.conf with an appropriate syntax), or
Can be set at any time, but only takes effect on newly imported pools.

Common parameters:

zfs_arc_min, zfs_arc_max: As obvious as they are.

Blocks¶

Main article

https://ibug.io/p/62

Logical blocks¶

ZFS handles data in logical blocks. A logical block is the smallest unit of data that ZFS can read or write.

With ZFS filesystem, the recordsize property defaults to 128 KiB, which is the maximum block size. The actual block size is determined by the file size and is always a multiple of 512 bytes (ashift matters, but at a later stage).

Multiple small files may be combined into a single block up to recordsize (which is a prime source of read/write amplification in ZFS). Files larger than recordsize are broken up into multiple blocks of recordsize each, and the last block makes no exception, with padding added as necessary. For example, a 1 MiB file on a filesystem with recordsize=128K will be stored as 8 blocks of 128 KiB each, and a 129 KiB file will be stored as 2 blocks of 128 KiB each.

Logical block is also the unit of compression. If compression is enabled, each block is compressed individually, and this allows the last block of a large file to take up only as much physical space as needed (its logical size, however, remains recordsize).

Physical blocks¶

Physical blocks is what is actually stored on the disk. The physical block size is determined by the ashift property of the vdev, which is set at pool creation time and cannot be changed afterwards. Usually ZFS is smart enough to determine the most appropriate ashift for your disks (via ioctl(BLKSSZGET) / ioctl(BLKPBSZGET)), but sometimes (with SSDs) you may need to set it manually according to some obscure specifications.

Physical block is also the unit of redundancy in RAID vdevs. For example, in RAIDZ1, each physical block is stored on a different disk, and the parity block (with the same size as data blocks) is stored on a last disk.

RAIDZ¶

See the first article in References section below.

Accounting¶

TBD

References¶

Caching¶

ZFS has a built-in caching mechanism called "Adaptive Replacement Cache" (ARC), using lots of memory to provide high performance, and to offset some inherent issues like fragmentation. ZFS benefits a lot from memory caching. By default, zfs_arc_max is half of the total RAM. If you have a dedicated storage server or otherwise have spare RAM, be sure to set increase this as needed.

ARC and L2ARC are always compressed regardless of whether compression is enabled on the original dataset.

L2ARC¶

L2ARC can be added with an SSD via zpool add cache /dev/nvme0n1, and helps to offset some load from the source disk array at high loads. L2ARC works by storing data evicted from the main ARC, so it needs warm-up time and sufficient disk load over a large working set to work efficiently. Workloads like databases are more likely to benefit from a pure-SSD array instead.

L2ARC is a read-only cache, so any failure of the cache device will not result in any data loss, although brief service degradation is to be anticipated. The write log is called "ZFS Intent Log" (ZIL), a.k.a. SLOG. Conversely, failure of an SLOG device will result in data corruption or loss, which is why ZFS recommends using RAID-1 arrays for SLOG and against using the same device for both L2ARC and SLOG.

In general, L2ARC is ineffective unless your working set exceeds the size of ARC while still being too hot for the rotational disks to handle, like high-traffic file download servers. Considering that ZFS requires around 4 GB RAM per TB of L2ARC to function efficiently, you probably don't need this for small- to mid-scale storage servers, and instead repurpose your SSD budget for some extra RAM.

Encryption¶

An encrypted dataset can be created with encryption=on (or any encryption modes as listed in zfsprops.7), but encryption cannot be changed afterwards. The keylocation determines where the encryption key is located, which defaults to prompt (i.e. ask for password on import).

An encrypted dataset can be loaded/unloaded at any time with zfs load-key and zfs unload-key. Key availability can be checked with zfs get keystatus, which will show either available or unavailable. If the key is unavailable, the dataset cannot be mounted.

Snapshots¶

ZFS is a log-structured filesystem, so snapshots are cheap and instantaneously taken.

To list snapshots, use zfs list -t snapshot. Snapshots of the same dataset are listed in "ZFS-chronological" order, i.e. the most recent snapshot is listed first, as determined by createtxg (creation transaction group).

Compression¶

man zfsprops.7:

When any setting except off is selected, compression will explicitly check for blocks consisting of only zeroes (the NUL byte). When a zero-filled block is detected, it is stored as a hole and not compressed using the indicated compression algorithm.

You should never use compression=off. At a bare minimum, use zle (zero length-encoding), which is effectively "no compression except for zeroes". With modern CPUs you should prefer lz4 or zstd.

Note that compressing zeroes and omitting entire blocks are two things: Compressed data still take up (negligible) space, while all-zero blocks are not stored at all. Zero blocks are implied by missing L0 direct blocks in the file's block table.

Any block being compressed must be no larger than 7/8 of its original size after compression, otherwise the compression will not be considered worthwhile and the block saved uncompressed.

Combined with vdev ashift, small recordsize and volblocksize benefit less from compression.

For example, 8 KiB blocks¹ on disks with 4 KiB disk sectors must compress to 1/2 or less of their original size.

Extreme compression¶

For highly compressed data (≤ 112 bytes), ZFS may choose to store the data inside the dnode itself, with absolutely no data blocks, called embedded block pointer². This will show up as EMBEDDED in zdb output, meaning this 2018 blog from Chris Siebenmann is no longer reproducible.

Permission management¶

zfs allow / zfs unallow can be used to delegate permissions to non-privileged users and groups. For example, zfs allow -u alice create,mount,destroy tank allows Alice to create, mount and destroy datasets under tank. zfs allow -u alice snapshot tank allows Alice to create snapshots of tank.

Permission sets are in the form of @name. These sets can then be delegated to users and even other sets. Sets are evaluated dynamically, so if a set is changed, any permission is immediately propagated.

Due to Linux's limitations on mounting, the commands and properties mount, unmount, mountpoint, canmount, rename, and share can be delegated to normal users, but will not have any effect.

To list delegated permissions and permission sets, use zfs allow <dataset>.

See zfs-allow.8 for details.

On Linux, non-privileged users cannot mount datasets even with zfs allow

This is a known issue with ZoL. See openzfs/zfs#10648 for details.

iBug's permission sets

Base command: zfs allow -s @setname perm,perm,... dataset

Set	Permissions
@commands	clone,create,destroy,diff,hold,load-key,change-key,mount,promote,receive,release,rename,rollback,send,share,snapshot
@userquota	userquota,userobjquota,userused,userobjused
@groupquota	groupquota,groupobjquota,groupused,groupobjused
@projectquota	projectquota,projectobjquota,projectused,projectobjused
@quota	@userquota,@groupquota,@projectquota

Known issues¶

Compression with very low quota yields terrible write performance and I/O blocking

When writing large files onto filesystems with low quota, each block must be processed sequentially to verify that it fits into the quota, and therefore causes very long processing time. The typical symptom is txg_sync in D state while still having non-trivial CPU utilization, along with one or two z_wr_iss processes in a similar situation.

Increasing (or removing) the quota solves the problem, and you may reinstate the quota after the write completes should you wish.

Extra reading¶

Chris Siebenmann's blog: This guy writes a lot of in-detail Linux sysadmin blogs, notably on storage and ZFS.
OpenZFS documentation including man pages.

recordsize is the maximum block size for filesystems, while volblocksize is the (only) block size for volumes. Files smaller than recordsize are handled as a single block and may be combined, while larger files are broken up into multiple recordsize-sized blocks. This information is not explicitly given in the man pages, but inferred and synthesized from multiple man pages. ↩
Chris Siebenmann (2018) What ZFS block pointers are and what's in them ↩