While browsing the ZFS man page recently, I made an interesting discovery: ZFS can export block devices from a zpool, which means you can separate "ZFS the volume manager" from "ZFS the filesystem". This may well be old news to many; however I haven't seen many references to this on the web, so thought I'd post a quick blog update.
The example used in this post is the creation of a mirrored zpool which is then used to create a block device, on top of which I'll create a UFS filesystem. The reasons for doing this are many and varied : you may have an application that needs UFS (particularly forcedirectio); you may need to create a block device for some reason but all your storage is currently tied up in zpools; or you just need a quick block device to use for testing.
Using ZFS as a volume manager also has it's advantages over something like SVM (formerly "DiskSuite"). The management features are much improved (along with a browser-based GUI, if that's your thing) and you also gain access to ZFS features which operate at the volume manager layer and aren't dependant on the filesystem parts of ZFS. This includes features such as end-to-end error checking and recovery, along with snapshots.
Read on for the full update...
First, as I don't have any spare disks available, I'll create some files to use as pseudo disks. One of the many cool things about ZFS is that you can use it on anything from files like this for testing, to USB keyfobs, to honking great storage arrays. As a proof of concept, all I need is a couple of 100Mb files, which I'll then use to create a ZFS mirrored pool. I don't actually gain any redundancy here, but you can see how this could be used in a real-life scenario :
[mark@solaris:~] # mkfile 100m disk1
[mark@solaris:~] # mkfile 100m disk2
[mark@solaris:~] # sudo zpool create test mirror $PWD/disk1 $PWD/disk2
Now we can see the "test" pool is online and view it's status :
[mark@solaris:~] # zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 95.5M 52.5K 95.4M 0% ONLINE -
[mark@solaris:~] # zpool status test
pool: test
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
/export/home/mark/disk1 ONLINE 0 0 0
/export/home/mark/disk2 ONLINE 0 0 0
errors: No known data errors
[mark@solaris:~] # zfs list
NAME USED AVAIL REFER MOUNTPOINT
test 75.5K 63.4M 24.5K /test
So I have around 63Mb to play with. Right now, I could just proceed as normal and create a few ZFS filesystems, but instead I'll create a block volume. This is done by using the "-V" flag with "zfs create", and specifying a size for our block device :
[mark@solaris:~] # sudo zfs create -V 60m test/testvol
This now creates an entry under /dev/zvol, much like Veritas creates things under /dev/vx/. We can then format it and mount it under /mnt/testvol :
[mark@solaris:~] # sudo newfs /dev/zvol/rdsk/test/testvol
newfs: construct a new file system /dev/zvol/rdsk/test/testvol: (y/n)? y
/dev/zvol/rdsk/test/testvol: 122880 sectors in 20 cylinders of 48 tracks, 128 sectors
60.0MB in 2 cyl groups (14 c/g, 42.00MB/g, 20160 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 86176,
[mark@solaris:~] # sudo mkdir /mnt/testvol
[mark@solaris:~] # sudo mount -F ufs /dev/zvol/dsk/test/testvol /mnt/testvol
Let's check it's really there :
[mark@solaris:~] # df -h /mnt/testvol
Filesystem size used avail capacity Mounted on
/dev/zvol/dsk/test/testvol
55M 1.0M 49M 3% /mnt/testvol
Now let's try snapshoting it. I'll create a file, snapshot the filesystem and then delete it to prove we could get it back.
[mark@solaris:~] # sudo touch /mnt/testvol/testfile
[mark@solaris:~] # ls -lh /mnt/testvol/testfile
-rw-r--r-- 1 root root 0 2007-07-06 22:57 /mnt/testvol/testfile
[mark@solaris:~] # sudo zfs snapshot test/testvol@1
Now we've got our snapshot, we can delete that "testfile", as it's safely on the snapshot. The process of taking the snapshot created another block device under /dev/zvol, which can be mounted read-only :
[mark@solaris:~] # sudo rm /mnt/testvol/testfile
[mark@solaris:~] # sudo mkdir /mnt/testsnapshot
[mark@solaris:~] # sudo mount -F ufs -o ro /dev/zvol/dsk/test/testvol@1 /mnt/testsnapshot
[mark@solaris:~] # ls /mnt/testsnapshot/
lost+found/ testfile
And just to confirm, zfs now shows our snapshotted volume :
[mark@solaris:~] # zfs list
NAME USED AVAIL REFER MOUNTPOINT
test 60.1M 3.39M 24.5K /test
test/testvol 5.14M 58.2M 5.06M -
test/testvol@1 74.5K - 5.06M -
How cool is that ?
Tuesday, September 11. 2007 at 12:04 (Link) (Reply)
The main issue is that some file data may still be waiting to be flushed from the page cache to the parent zvol (so some file contents may be randomly stale or zero in the snapshot).
In your example, if you do the following ...
# date >/mnt/testvol/testfile; zfs snapshot test/testvol@1
... the snapshot will probably contain the file, but will probably NOT contain the expected data (i.e. the date). This is because the metadata describing the file gets written immediately, but the file data has to wait for fsflush.
By default, fsflush aims to scan the entire page cache every 30 seconds. This means that unless you sync the file contents explicitly, you could be waiting for up to 30 seconds or more for new file data to make it to non-volatile storage.
On systems with a large amount of RAM, or lots of dirty pages, or both, fsflush may not manage to clean the entire page cache within the default 30 seconds. Indeed, on systems with lots of RAM fsflush is often configured (via the /etc/system parameters autoup and tune_t_fsflushr) to take much longer (typically 5 minutes) so that it doesn't use too much CPU.
If your file data hasn't made it to storage at the point the snapshot was taken, you just won't see it in the snapshot (or ever again in the original filesystem if the system loses power or crashes before the flush occurs).
However, if you do something like this ...
# date >/mnt/testvol/testfile; lockfs -wf /mnt/testvol; zfs snapshot test/testvol@1; lockfs -u /mnt/testvol
... all should be well. The downside is that the snapshot operation now takes much longer, such that it may disrupt other users of the source filesystem.
Monday, March 10. 2008 at 04:36 (Reply)
I've just set up a raidz zpool and then created a volume as described for UFS.
This way I get the required UFS filesystem over 4 disks without the RAID5 write-hole problem i would have faced using Solaris Volume Manager.
Plus it was a snap to configure thanks to the zfs commands.
Thursday, October 16. 2008 at 19:52 (Reply)
Thursday, October 16. 2008 at 22:06 (Reply)
This was when I tried with Solaris 10 Update 4.
Wednesday, January 7. 2009 at 19:08 (Reply)
From lockfs(1M):
The options are mutually exclusive: wndheuf. If you do
specify more than one of these options on a lockfs command
line, the utility does not protest and invokes only the last
option specified. In particular, you cannot specify a flush
(-f) and a lock (for example, -w) on the same command line.
However, all locking operations implicitly perform a flush,
so the -f is superfluous when specifying a lock.
Wednesday, December 29. 2010 at 01:41 (Reply)
I have come across a case where I need to use UFS for an application that has been migrated from solaris 8 to solaris 10 using ZFS, but the application doesn't work on ZFS. There is a later version of this app that works on ZFS, but upgrading to that is part of a future project.