in Backups, Systems

Optimizing Offsite Backups with ZFS + Bookmarks

My preferred backup solution is to rsync systems to a large ZFS pool. The ZFS filesystems get snapshotted once a day, and you can keep whatever retention you want. I have some systems that keep daily snaps for 2 years, others that keep dailies for a month, then monthlies for 6 months, and so on.

I find this solution gives near-optimal space utilization, which allows longer retention periods. Customizing retention periods is very easy. Restoring files is fast and simple. There’s no database or backup system app to tangle with, just go to the directory where the files used to be and copy them back to the client. If the destruction happened in the past, just cd to the snapshot date you want first. And if you need to forensically reconstruct something, it’s easy to compare files/directories over time with standard shell tools

After a while, as we started using ZFS on more of the client systems, I started using ZFS replication to back them up to the backup server (which I will call behemoth from now on). The addition of ZFS replication to behemoth will become significant later.

In most installations, we pair this strategy with a rotating set of encrypted off-site disks (currently using GELI on FreeBSD for encryption). We used to rsync the current backups from behemoth to the offsite disk. That works reliably but can be very slow, especially in edge cases like systems with 10 miliion+ files. So I went looking for a way to use ZFS replication for the copies to the offsite disks.

The Wrinkle

At first this seemed simple. For each ZFS filesystem, take a snapshot and zfs send it to the offsite disk. Keep a few “offsite” snaps on behemoth so we can do an incremental replication to the offsite disk. The problem comes from this ZFS requirement:

If an incremental stream is received, then the destination file
system must already exist, and its most recent snapshot must match
the incremental stream's source. […]

The issue is that the behemoth is both a source and destination for ZFS replication. Our nightly backup process replicates from a client’s ZFS filesystem to a matching ZFS filesystem on behemoth. If we come along and add a new snapshot to that filesystem on behemoth then the next night the replication from the client to behemoth will fail because the newest snapshot on behemoth is now one that the client doesn’t know about.

The solution is to convert that new snapshot we made on behemoth to a bookmark. A bookmark is kind of like a snapshot in that it pinpoints the state of the filesystem at a specific point in time, and ZFS can send new blocks that were added since that point. You can’t mount or access a bookmark in any way, but you CAN do an incremental zfs send using the bookmark as the starting point. AND, the existence of a bookmark does not violate the rule about the latest snapshot having to match on the sender and receiver, so we are able to continue snapshotting from client-to-behemoth while still keeping other “pseudo-snapshots” (bookmarks) of the filesystem state to use with the offsite disks.

There is still a window of vulnerability:

  • Bookmarks are created from snapshots. So we have to first create the snapshot on behemoth, which could cause replication TO behemoth to fail.
  • zfs send can only use a bookmark in place of the “origin” snapshot, not the “destination” snapshot. So, after we create the snapshot, we have to then complete the zfs send operation before we can convert it to a bookmark. If a client machine tries to replicate to behemoth during that time of vulnerability, it would fail.

In practice, this has never been a problem. The replications from clients happen overnight, and we tend to write to offsite disks during the day, so they don’t overlap much. Also, the copy to offsite disks with this method is SO FAST compared to the old rsync method, that reduces overlap as well.

And, There’s More

There is a secondary benefit to using bookmarks for this. Depending on how many offsite disks are in rotation, weeks or months will pass before you see the same offsite disk again. In order to do an incremental replication, we need to have kept the snapshot that was made the last time that disk was used. That means you may be holding on to data blocks that are only referenced by that one snapshot, for longer than you want, which wastes space.

Every now and then, I have a “crisis of faith” where I can’t convince myself that I really understand how bookmarks work and why what I’m doing is OK. I mean, if bookmarks don’t hold on to those old data blocks, how can they let me do the replication I need to make the offsite disks “current”?

The central fact to remember is, the bookmark is only used in place of the origin snapshot of the incremental zfs send. That means we are looking for all NEW data SINCE that snapshot/bookmark was created. The data blocks that existed at the time the snapshot was made are from the past. They were previously replicated to the offsite disk. The fact that they have now possibly disappeared is of no consequence as we only want to replicate data added after that point.

Some Practical Concerns

All the theory is nice, but I find a few concrete examples to be helpful. Here is the process of making an offsite backup with this method.

The offsite disks are named “offsiteNN” where NN is 01, 02, 03, etc. The snapshots we create on behemoth and the offsite disk are each specific to that disk. They include the disk name and a datestamp, e.g. “offsite01.2019-07-21”, “offsite02.2019-07-27”, etc.

The script I use is given a ZFS pool name on behemoth, e.g. tank, and it transmits all filesystems under that root to the offsite disk.

If a filesystem doesn’t exist on the offsite disk yet, it makes a snapshot on the source, and uses zfs send to copy that snapshot to the offsite disk. Then it converts that snapshot to a bookmark on behemoth. (It remains a snapshot on the offsite disk).

If a filesystem already exists on the offsite disk, the script looks for the latest snapshot. Then it looks for a bookmark on behemoth which matches that snapshot on the destination. If found, it then makes a new snapshot on behemoth, then does an incremental zfs send from that bookmark to the new snapshot, to the offsite disk.

Here is the situation for one filesystem on behemoth right before making an offsite backup. This is on March 19.

behemoth:/root# zfs list -o name -s name -t all | grep mailman
[...]
tank/mailman
tank/mailman#offsite01.2021-02-05
tank/mailman#offsite01.2021-02-26
tank/mailman#offsite02.2021-01-22
tank/mailman#offsite02.2021-02-12
[...]
tank/mailman@2021-03-18
tank/mailman@2021-03-17
tank/mailman@2021-03-16
tank/mailman@2021-03-15
tank/mailman@2021-03-14
[...]
offsite01/tank/mailman
offsite01/tank/mailman@offsite01.2021-02-05
offsite01/tank/mailman@offsite01.2021-02-26

Note that bookmarks are displayed just like snapshots but with a “#” instead of “@”. Also note that behemoth’s tank has bookmarks from offsite01 and offsite02, different disks in the rotation. Here are the steps for this filesystem:

  1. Latest snap on offsite01 is offsite01.2021-02-26, and we have a matching bookmark tank/mailman#offsite01.2021-02-26 on behemoth. Good!
  2. Make a new snapshot: zfs snapshot tank/mailman@offsite01.2021-03-19
  3. Do an incremental send: zfs send -i tank/mailman#offsite01.2021-02-26 \
    tank/mailman@offsite01.2021-03-19 |\
    zfs recv -F offsite01/tank/mailman
  4. Convert the snapshot to a bookmark: zfs bookmark \
    tank/mailman@offsite01.2021-03-19 \
    tank/mailman#offsite01.2021-03-19
  5. Destroy the snapshot: zfs destroy tank/mailman@offsite01.2021-03-19

Now offsite01 has a new 2021-03-19 snapshot with current data for the mailman filesystem. behemoth has the 2021-03-19 bookmark left behind to be used as the replication origin the next time offsite01 comes our way.