Building a redundant iSCSI and NFS cluster with Debian - Part 2


Note : This page may contain outdated information and/or broken links; some of the formatting may be mangled due to the many different code-bases this site has been through in over 20 years; my opinions may have changed etc. etc.

This is part 2 of a series on building a redundant iSCSI and NFS SAN with Debian.

Part 1 - Overview, network layout and DRBD installation
Part 2 - DRBD and LVM
Part 3 - Heartbeat and automated failover
Part 4 - iSCSI and IP failover
Part 5 - Multipathing and client configuration
Part 6 - Anything left over!


Configuring DRBD

Following on from part one, where we covered the basic architecture and got DRBD installed, we’ll proceed to configuring and then initialising the shared storage across both nodes. The configuration file for DRBD (/etc/drbd.conf) is very simple, and is the same on both hosts. The full configuration file is below - you can copy and paste this in; I’ll go through each line afterwards and explain what it all means. Many of these sections and commands can be fine tuned - see the man pages on drbd.conf and drbdsetup for more details.


global {
resource r0 {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  startup {
    wfc-timeout  0;
  disk {
    on-io-error   detach;
  net {
    on-disconnect reconnect;
  syncer {
    rate 30M;
  on weasel {
    device     /dev/drbd0;
    disk       /dev/md3;
    meta-disk  internal;
  on otter {
    device    /dev/drbd0;
    disk      /dev/md3;
    meta-disk internal;

The structure of this file should be pretty obvious - sections are surrounded by curly braces, and there are two main sections - a global one, in which nothing is defined, and a resource section, where a shared resource named “r0” is defined.

The global section only has a few options available to it - see the DRBD website for more information; though it’s pretty safe to say you can ignore this part of the configuration file when you’re getting started. The next section is the one where we define our shared resource, r0. You can obviously define more than one resource within the configuration file, and you don’t have to call them “r0”, “r1” etc. They can have any arbitary name - but the name MUST only contain alphanumeric characters. No whitespace, no punctuation, and no characters with accents.

So, r0 is our shared storage resource. The next line says that we’ll be using protocol “C” - and this is what you should be using in pretty much any scenario. DRBD can operate in one of three modes, which provide various levels of replication integrity. Consider what happens in the event of some data being written to disk on the primary :

  • Protocol A considers writes on the primary to be complete if the write request has reached the local disk, and the local TCP network buffer. Obviously, this provides no guarantee whatsoever it will reach the secondary node: any hardware or software failure could result in you loosing data. This option should only be chosen if you don’t care obout the integrity of your data, or are replicating over a high-latency link.
  • Protocol B goes one step further, and considers the write to be a complete if it reaches the secondary node. This is a little safer, but you could still loose data in the event of power or equipment failure as there is no guarantee the data will have reached the disk on the secondary.
  • Protocol C is the safest of the three, as it will only consider the write request completed once the secondary node has safely written it to disk. This is therefore the most common mode of operation, but it does mean that you will have to take into account network latency when considering the performance characteristics of your storage. If you have the fastest drives in the world, it won’t help you if you have to wait for a slow secondary node to complete it’s write over a high-latency connection. The next somewhat cryptic-looking line (“incon-degr-cmd…”) in the configuration simply defines what to do in the event of the primary node starting up, and discovering it’s copy of the data is inconsistent. We obviously wouldn’t want this propogated over to the secondary, so this statement tells it to display an error message, wait 60 seconds and then shutdown. 

We then come to the startup section - here we simply state this node should wait for ever for the other node to connect.

In the disk section, we tell DRBD  that if there is an error, or a problem with the underlying storage that it should disconnect it. The network section states that if we experience a communication problem with the other node, instead of going into standalone mode we should try to reconnect.

The syncer controls various aspects of the actual replication - the main paramater to tune is the rate at which we’ll replicate data between the nodes. While it may be tempting at first glance to enter a rate of 100M for a Gigabit network, it is advisable to reserve bandwidth. The documentation ( says that a good rule of thumb is to use about 30% of the available replication bandwidth, taking into account local IO systems, and not just the network.

We now come to the “meat” of this example - we define the two storage nodes - “otter” and “weasel”. Refer back to part one and the diagram there, if you need a quick reminder on the network layout and host names. For each node, we define what the device name is (/dev/drbd0); the underlying storage (/dev/md3); the address and port that DRBD will be listening on (10.0.0.x:7788); and what to use to store the metadata for the device. “meta-disk internal” just stores the data in the last 128MB of the underlying storage - you could use another disk for this, but it makes sense to use the same device unless it’s causing performance issues.

Initialising DRBD

Now we’ve got DRBD configured, we need to start it up - but first, we need to let the two nodes know which is going to be the primary. If you start DRBD on both nodes :

/etc/init.d/drbd start

You will notice that the command appears to hang for a while as DRBD initialises the volumes, but if you check the output of “dmesg” on another terminal, you’ll see something similar to : 

drbd0: Creating state block
drbd0: resync bitmap: bits=33155488 words=1036110
drbd0: size = 126 GB (132621952 KB)
drbd0: 132621952 KB now marked out-of-sync by on disk bit-map.
drbd0: Assuming that all blocks are out of sync (aka FullSync)
drbd0: 132621952 KB now marked out-of-sync by on disk bit-map.
drbd0: drbdsetup [4087]: cstate Unconfigured --> StandAlone
drbd0: drbdsetup [4113]: cstate StandAlone --> Unconnected
drbd0: drbd0_receiver [4114]: cstate Unconnected --> WFConnection
drbd0: drbd0_receiver [4114]: cstate WFConnection --> WFReportParams
drbd0: Handshake successful: DRBD Network Protocol version 74
drbd0: Connection established.
drbd0: I am(S): 0:00000001:00000001:00000001:00000001:00
drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:00
drbd0: drbd0_receiver [4114]: cstate WFReportParams --> Connected
drbd0: I am inconsistent, but there is no sync? BOTH nodes inconsistent!
drbd0: Secondary/Unknown --> Secondary/Secondary

 And if you also check the contents of /proc/drbd on both nodes, you’ll see something similar to : 

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@weasel, 2008-02-19 10:10:14
 0: cs:Connected st:Secondary/Secondary ld:Inconsistent
    ns:0 nr:0 dw:0 dr:0 al:0 bm:24523 lo:0 pe:0 ua:0 ap:0

So you can see, both nodes have started up and have connected to each other - but have reached deadlock, as neither knows which one should be the primary. Pick one of the nodes to be the primary (in my case, I chose “otter”) and issue the following command on it:

drbdadm -- --do-what-I-say primary all

And then recheck /proc/drbd (or use “/etc/init.d/drbd status”). On the primary you’ll see it change to something like this :

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@otter, 2008-02-19 10:07:52
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:169984 nr:0 dw:0 dr:169984 al:0 bm:16200 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  0.2% (129347/129513)M
        finish: 3:15:56 speed: 11,260 (10,624) K/sec 

Whilst the secondary will show :

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@weasel, 2008-02-19 10:10:14
 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
    ns:0 nr:476216 dw:476216 dr:0 al:0 bm:24552 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  0.4% (129048/129513)M
        finish: 3:32:40 speed: 10,352 (9,716) K/sec

If you have a large underlying pool of storage, you’ll find that this will take a fair amount of time to complete. Once it’s done, we can try manually failing over between the two nodes.

Manual failover and testing

On the primary node, create a filesystem on the DRBD volume, mount it and create some files on it for testing :

mke2fs -j /dev/drbd0
mount /dev/drbd0 /mnt

Your replicated storage should now be mounted under /mnt. Copy some data into it, make some directories, unpack a few tarballs - and check to see how /proc/drbd updates on both nodes. Once you’re done, unmount the volume, and then demote the primary node to secondary :

umount /mnt
drbdadm secondary r0

Now, both nodes are marked as being "secondary". If we then log into "weasel", we can promote that node :

 drbdadm primary r0
Check /proc/drbd to see that the nodes have in fact swapped over ("otter" should be the secondary now, and "weasel" should be the new primary). You can then mount the volume on weasel and see the data there from earlier. That's it for DRBD - you have a functioning replicated pool of storage shared between both nodes. At this point, I'd recommend you take a breather, experiment with the two systems, read the manpages, and see what happens when you reboot each one. It's also an interesting experiment to see what happens when you try to mount the volume on a non-primary node - you should see that DRBD does the "right thing" in all of these cases, and prevents you from corrupting your data.  LVM As we'll be using LVM for greater flexibility in defining, managing and backing up our replicated storage, a few words may be needed at this point regarding managing LVM on top of DRBD, as it introduces a few extra complexities. I assume you're familiar with LVM - if not, the LVM HOWTO is a very good place to start. To begin with, we'll set up the DRBD volume as "physical volume" for use with LVM, so install the LVM tools :
apt-get install lvm2
We now need to tell the LVM subsystem to ignore the underlying device (/dev/md3) - otherwise on the primary node, LVM would see two identical physical volumes - one on the underlying disk, and the DRBD volume itself. Edit /etc/lvm/lvm.conf, and inside the "devices" section, add a couple of lines that read :
# Exclude DRBD underlying device
filter = [ "r|/dev/md3|" ]
Now, create a LVM "physical volume" on the DRBD device:
pvcreate /dev/drbd0
Then the volume group, called "storage" :
vgcreate storage /dev/drbd0
And then finally, a 10Gb test volume called "test", which we'll then format and mount :
lvcreate -L10G -ntest storage
mke2fs -j /dev/storage/test
mount /dev/storage/test /mnt
And again, test the volume by creating a few files under /mnt. Now, when you want to failover, you will need to add a couple of steps. First, on the primary node, unmount your logical volume, and change the volume group state to unavailable :
umount /mnt
vgchange -a n storage
  0 logical volume(s) in volume group "storage" now active
Now, demote the primary node to secondary :
drbdadm secondary r0
Now, you can promote the other node to primary, and you should be able to activate the volume group and logical volume :
drbdadm primary r0
vgchange -a y storage
And you should then be able to mount /dev/storage/test on the new primary node. Obviously, this becomes a major hassle each time you want to failover between nodes, and doesn't provide any automatic monitoring or failover in the event of hardware failure on one node. We'll cover that next in Part 3, where we'll add the Heartbeat package from the Linux-HA project to provide automatic node failover and redundancy.