ZFS Replication


As I’ve been investigating ZFS for use on production systems, I’ve been making a great deal of notes, and jotting down little "cookbook recipies" for various tasks. One of the coolest systems I’ve created recently utilised the zfs send & receive commands, along with incremental snapshots to create a replicated ZFS environment across two different systems. True, all this is present in the zfs manual page, but sometimes a quick demonstration makes things easier to understand and follow.

This page may contain outdated information and/or broken links. It is included on this site in an effort to preserve historical information only.

While this isn’t true filesystem replication (you’d have to look at something like StorageTek AVS for that) it does provide periodic snapshots and incremental updates; these can be run every minute if you’re driving this from cron - or, at even more granular intervals if you write your own daemon. Nonetheless, this suffices for disaster recovery and redundancy if you don’t need up-to-the second replication between systems.

I’ve typed up my notes in blog format so you can follow along with this example yourself, all you’ll need is a Solaris system running ZFS. Read more for the full demonstration…

First, as with my last walkthrough, I’ll create a couple of files to use for testing purposes. In a real-life scenario, these would most likely be pools of disks in a RAIDZ configuration, and the two pools would also be on physically separate systems. I’m only using 100Mb files for each, as that’s all I need for this proof of concept.

[[email protected]]$ mkfile 100m master
[[email protected]]$ mkfile 100m slave
[[email protected]]$ zpool create master $PWD/master
[[email protected]]$ zpool create slave $PWD/slave
[[email protected]]$ zpool list
master 95.5M 84.5K 95.4M 0% ONLINE -
slave 95.5M 52.5K 95.4M 0% ONLINE -
[[email protected]]$ zfs list
master 77K 63.4M 24.5K /master
slave 52.5K 63.4M 1.50K /slave

There we go. The naming should be pretty self-explanatory : The "master" is the primary storage pool, which will replicate and push data through to the backup "slave" pool. Now, I’ll create a ZFS filesystem and add something to it. I had a few source tarballs knocking around, so I just unpacked one (GNU grep) to give me a set of files to use as a test :

[[email protected]]$ zpool create master/data
[[email protected]]$ cd /master/data/
[[email protected]]$ gtar xzf ~/grep-2.5.1.tar.gz
[[email protected]]$ ls

We can also see from "zfs list" we’ve now taken up some space :

[[email protected]]$ zfs list
master 3.24M 60.3M 25.5K /master
master/data 3.15M 60.3M 3.15M /master/data
slave 75.5K 63.4M 24.5K /slave

Now, we’ll transfer all this over to the "slave", and start the replication going. We first need to take an initial snapshot of the filesystem, as that’s what "zfs send" works on. It’s also worth noting here that in order to transfer the data to the slave, I simply piped it to "zfs receive". If you’re doing this between two physically separate systems, you’d most likely just pipe this through SSH between the systems and set up keys to avoid the need for passwords. Anyway, enough talk :

[[email protected]]$ zfs snapshot master/[email protected]
[[email protected]]$ zfs send master/[email protected] | zfs receive slave/data

This now sent it through to the slave. It’s also worth pointing out that I didn’t have to recreate the exact same pool or zfs structure on the slave (which may be useful if you are replicating between dissimilar systems), but I chose to keep the filesystem layout the same for the sake of legibility in this example. I also simply used a numeric identifier for each snapshot; in a production system, timestamps may be more appropriate. Anyway, let’s take a quick look at "zfs list", where we’ll see the slave has now gained a snapshot utilising exactly the same amount of space as the master :

[[email protected]]$ zfs list
master 3.25M 60.3M 25.5K /master
master/data 3.15M 60.3M 3.15M /master/data
master/[email protected] 0 - 3.15M -
slave 3.25M 60.3M 24.5K /slave
slave/data 3.15M 60.3M 3.15M /slave/data
slave/[email protected] 0 - 3.15M -

Now, here comes a big "gotcha". You now have to set the "readonly" attribute on the slave. I discovered that if this was not set, even just cd-ing into the slave’s mountpoints would cause things to break in subsequent replication operations; presumably down to metadata (access times and the like) being altered.

[[email protected]]$ zfs set readonly=on slave/data

So, let’s look in the slave to see if our files are there :

[[email protected]]$ ls /slave/data

Excellent stuff! However, the real coolness starts with the incremental transfers - instead of transferring the whole lot again, we can just send only the bits of data that actually changed - this will drastically reduce bandwidth and the time taken to replicate data, making a "cron" based system of periodic snapshots and transfers feasable. To demonstrate this, I’ll unpack another tarball (this time, GNU bison) on the master so I have some more data to send :

[[email protected]]$ cd /master/data
[[email protected]]$ gtar xzf ~/bison-2.3.tar.gz

And we’ll now make a second snapshot, and transfer differences between this one and the last :

[[email protected]]$ zfs snapshot master/[email protected]
[[email protected]]$ zfs send -i master/[email protected] master/[email protected] | zfs receive slave/data

Checking to see what’s happened, we see the slave has gained another snapshot:

[[email protected]]$ zfs list
master 10.2M 53.3M 25.5K /master
master/data 10.1M 53.3M 10.1M /master/data
master/[email protected] 32.5K - 3.15M -
master/[email protected] 0 - 10.1M -
slave 10.2M 53.3M 25.5K /slave
slave/data 10.1M 53.3M 10.1M /slave/data
slave/[email protected] 32.5K - 3.15M -
slave/[email protected] 0 - 10.1M -

And our new data is now there as well :

[[email protected]]$ ls /slave/data/
bison-2.3 grep-2.5.1

And that’s it. All that remains to turn this into a production system between two hosts is for a periodic cron job to be written that runs at the appropriate intervals (daily, or even every minute if need be) and snapshots the filesystem before transferring it. You’ll also likely want to have another job that clears out old snapshots, or maybe archives them off somewhere.