/ ZFS

Why I Use ZFS, Possibly the Best Filesystem Ever Devised

ZFS is a next generation file system that offers many advantages over traditional file systems like NTFS or ext4.

Data Integrity

ZFS's first mission was to preserve the integrity of your data. Unlike a traditional filesystem that only checksums the file metadata, ZFS stores checksums of all your data. While it is serving your applications files, if it reads data that doesn't match its checksum and you have redundant copies (whether outright through mirroring, or through computation via parity), it will automatically try to find a copy of the data that matches its checksum, serve that data to the application, and correct the incorrect data.
Additionally, you can initiate scrub operations. ZFS will scan all the data and verify the checksums. If the data fails its checksum then it will again try to restore the correct data through redundant copies. If it can, it will tell you which files have errors.

Volume Manager

No more LVM. To create a RAID5 equivalent array (ZFS calls it RAIDZ) called tank all you have to do is
zpool create tank raidz /dev/sda /dev/sdb /dev/sdc /dev/sdd
(though you generally don't want to use /dev/sdX - use /dev/disk/by-id/ instead)

No RAID5 Write Hole

These guys explain this better than I can. But in short, it's a low probability problem that can happen in traditional RAID5 that results in a loss of data.

Copy on Write (CoW)

When you modify the block of a file, ZFS doesn't modify that block in place. It instead copies the block to a new location, modifies the new block, and only when it's done it changes the pointer from the old block to the new block.

In a traditional file system, if there is a power outage or system failure during a file write operation, you're left with half a new file and half an old file. If this file is ASCII, you'll be able to recover something. If this file is binary, your file is corrupted.
But with a CoW filesystem like ZFS, you'll always have coherent data.

Efficient Snapshots

ZFS can take snapshots - a read only copy of the filesystem at the time of the snapshot. Initially the snapshot takes no space whatsover. As the files are modified via CoW, instead of moving the pointer from the old block to the new block, the pointers to the old blocks are kept around and assigned to the snapshots. This way only the differences between the snapshots are stored.

I take snapshots of my filesystems daily, and have snapshots going back several months (I grandfather them out, e.g. I keep 30 daily snapshots, 12 weekly snapshots, etc).

These snapshots are also fairly effective in preventing ransomware since they truly are read only (but more on this in a separate post).

Clones

Clones are writeable snapshots. This is great for testing purposes. You do have to be careful though - a clone is made from a snapshot, and that snapshot can never be deleted until the clone is deleted, even if the clone has fully diverged from the snapshot.

This is much easier to explain with a purposely contrived example.
Suppose you write 100 MB of random data to a ZFS filesystem. You make a clone of this filesystem, and overwrite the data with 100 MB of new random data. The clone now takes 100 MB of space. Finally, you overwrite the filesystem with 100 MB of random data. The snapshot takes 100 MB of space (to store the original data), the filesystem itself takes 100 MB of space, and the clone takes 100 MB of space.

Don't create clones if the files will persist and change for a long period of time.

Forget Partitions, it's all Dynamic

With ZFS you don't create partitions. You create datasets. So for example, if you have a pool named tank, you might organize it into

  • tank/My Documents
  • tank/Photos
  • tank/Movies

Unlike partitions, you don't have to preallocate space to each dataset. All the space in the zpool is available to all dataset. However, if you wish you can set reservations and quotas to guarantee a dataset a certain amount of space, or prevent a dataset from taking too much space, respectively.

Optimize Dataset Properties to Your Application

The above explanation of datasets might have seemed a little strange - why go through the process of creating datasest when you can just create folders? Mostly because each dataset can have different properties.

A property that can have a large impact on performance is recordsize, which controls the largest size block ZFS will use for writing. Suppose you have a database that uses 8k records and you use ZFS for storing the database files. If you leave ZFS at the default 128k recordsize, then when your database writes a record, it will most likely have to do a read modify write (most likely because it is possible for the write to go to an entirely new block) because to write 8k, it needs to read in a 128k block, modify 8k of it, and write it back out. This read modify write can kill performance.

(Incremental) Send Receive

ZFS makes it incredibly easy to make backups. As mentioned previously, you can take snapshots so you can always get an old version of a file. ZFS has a built in feature to send snapshots of your datasets through standard operating system pipes (so you can pipe it through SSH, for example). But most critically, it can send incrementals - the difference in the dataset between two snapshots. And unlike rsync, which needs to compare the files between the source and destination, a ZFS incremental send doesn't need any communication between the two systems because ZFS knows exactly which files have changed between the two snapshots. This leads to two important properties:

  • ZFS send/receive is insensitive to network latency
  • Backups to tape are possible

ZFS is also capable of recognizing renamed files and just renaming the file on the destination as opposed to deleting the file under the old name and sending the file under the new name. ZFS send/receive is the only backup system I've found that is able to do this.

Compression

ZFS can compress your files before writing them to disk. It's a little counterintuitive but this will actually increase your disk write performance, so long as your CPU can compress faster than your disk can write. So for example, if your CPU can compress data at a 2:1 ratio at 700 MB/s, and your disk can write data at 100 MB/s, then you actually end up writing data at 200 MB/s.

ZFS implements a fast compression algorithm called LZ4 that is capable of compressing at about 700 MB/s per core on a modern CPU. And surprisingly, it was able to compress my movie files by about 5%. This may not sounds like much, but even 5% adds up when you've got terabytes of files.

Encryption

Encryption was recently added to ZFS on Linux (and it will be pushed to other operating systems eventually). It's only available in the git version right now (as opposed to the version model numbered releases). Encryption will be made available in the mainline in the next version.
By ZFS handling the encryption it allows for compression before encryption (if you compress after encryption, and you actually observe compression, then your encryption is done poorly because it's supposed to output a stream of what appears to be random data).

ZFS can still do all zpool level maintenance - such as scrub, send, and receive, of encrypted datasets even if the encryption keys are not loaded.

But Windows is Not Supported

While ZFS is supported on FreeBSD, illumos, Linux, and OS X, it is not supported on Windows. I do dual boot Windows, so my solution is to have a second computer running Linux and ZFS, and to serve the files over the network via a network share drive.

Why I Use ZFS, Possibly the Best Filesystem Ever Devised
Share this

Subscribe to Seonwoo's Musings