Dual Primary DRBD + OCFS2
In this HOWTO I try to describe howto set up a dual primary DRBD with the cluster filesystem OCFS2 that can be used as shared storage.
As I am very limited in time I will try to add parts as I find spare time. Please note that this is still a kind of experimental setup (at least for me). Please feel free to correct errors you find in this HOWTO or to mail me. Thanks for your careful review!
Contents
Architecture
In this setup the cluster filesystem (OCFS2) uses the distributed lock manager (DLM). Please see the [1] for a sketch of the usage of the lock manager by the various cluster storage solutions. As you will see from the figure the setup only works with OpenAIS, not with heartbeat as the cluster stack.
Base System
In this setup I use openSUSE 11.1 for the installation. Since the development of the cluster software is driven by SUSE/Novell the latest developments show up in this distributions first. So this distribution has all working components my the setup and I do not have to compile too much myself. Just do a base install on the software.
Please be sure that you end up with a fully patched system.
DRBD
Here comes the first problem. We need DRBD version 8.3.2 for the included resource agent that can deal with two primaries of DRBD. As far as I know there is no precompiled package of this version for openSUSE 11.1. So you will have to compile it yourself. Get the latest sources from linbit [2] and unzip and untar it:
# tar xvzv drbd-8.3.2.tar.gz
For the compilation there is an excellent documentation from linbit [3]. I just want to describe what worked for me:
- Get the kernel sources, make, gcc, automake, autoconf and flex
# cd drbd-8.3.2 # make clean all # make install
After you built and installed the drbd kernel module you can start to configure the DRBDs on the two servers. Please see linbits documentation for this [4]. When you got you Primary/Secondary configuration running you can start configuring the dual-primary mode. All nescessary steps are described again in the docs of linbit [5]. Basically it uncommenting the options allow-two-primaries
and become-primary-on-both
in both config files.
Please test this setup manually! Please only go further when it really works.
During your test and setup you will run into a lot of split brain situations. You will want to consider optimizing split-brain behavior of the DRBD. See [6] for more information.
Software Repository
After the DRBD really works (does it really work?) you can start to install the cluster components. I used Lars repository [7] but the openSUSE Buld Server should do as well. With YaST add a new Software Repository by entering the URL http://download.opensuse.org/repositories/server:/ha-clustering/openSUSE_11.1 . From the new repository you now can install:
- pacemaker: The cluster manager
- pacemaker-mgmt: Server that the GUI connects to.
- pacemaker-mgmt-client: Optional. Better install this on your laptop or use the command line.
- cluster-glue: Useful remains of the heartbeat project.
- libdlm2: Distributed Lock Manager
- ocfs2-tools: Userland tools for the cluster filesystem. The kernel modules are installed automatically.
- openais: The cluster communication stack.
- pacemaker: The cluster resource manager.
- resource agents: All other resource agents of the heartbeat project.
Setup the Cluster
Well, if you really want to setup a dual primary DRBD with the cluster filesystem your should have some experience in building clusters. So this step should be no problem for you. But you also can have a look at http://www.clusterlabs.org/wiki/Initial_Configuration#OpenAIS.
Configure the Resources
Now for the real challenge. Most of this section is described in the HAE documentation for SLES 11 [8].
DRBD
First you need the dual primary DRBD. In the crm notation the resource would be:
crm(live)configure# primitive resDRBD ocf:linbit:drbd \ crm(live)configure# params drbd_resource="<your_resource_here>" \ crm(live)configure# operations $id="resDRBD-operations \ crm(live)configure# op monitor interval="20" role="Master" timeout="20" \ crm(live)configure# op monitor interval="30" role="Slave" timeout="20" crm(live)configure# ms msDRBD resDRBD \ crm(live)configure# meta resource-stickines="100" notify="true" master-max="2" interleave="true"
Please note that the monitor
-Operation of the primitive resource is not only nice to have but essential to make the setup work. Otherwise the second secondary resource will not get promoted after a failure.
The Distributed Lock Manager
The DLM is used by the cluster filesystem to lock files in use throughout the cluster. So this is the next essential building block in out setup.
crm(live)configure# primitive resDLM ocf:pacemaker:controld op monitor interval="120s" crm(live)configure# clone cloneDLM resDLM meta globally-unique="false" interleave="true"
I also create two constraints the tie the DLM to an active DRBD instance:
crm(live)configure# colocation colDLMDRBD inf: cloneDLM msDRBD:Master crm(live)configure# order ordDRBDDLM 0: msDRBD:promote cloneDLM
The o2cb Service
The next step is to configure the o2cb
service of the ocfs2. Also two constraints are created that tie this service to node where DLM is running.
crm(live)configure# primitive resO2CB ocf:ocfs2:o2cb op monitor interval="120s" crm(live)configure# clone cloneO2CB resO2CB meta globally-unique="false" interleave="true" crm(live)configure# colocation colO2CBDLM inf: cloneO2CB cloneDLM crm(live)configure# order ordDLMO2CB 0: cloneDLM cloneO2CB
The Filesystem
The last element is the filesystem resource which mounts the block devices so applications can use the data. First you need to format the devices. A quick
# mkfs.ocfs2 /dev/<your_drbd_device_here>
does this job. Of course this command only needs to be enterend on one node, since the changes are automatically copied to the second DRBD device. Now the resource can be created in the cluster. I also use constraints to place the filesystem on a node where the o2cb
service is running.
crm(live)configure# primitive resFS ocf:heartbeat:Filesystem \ crm(live)configure# params device="/dev/<your_drbd_device_here>" directory="<your_mountpoint_here>" fstype="ocfs2" \ crm(live)configure# op monitor interval="120s" crm(live)configure# clone cloneFS resFS meta interleave="true" ordered="true" crm(live)configure# colocation colFSO2CB inf: cloneFS cloneO2CB crm(live)configure# order ordO2CBFS 0: cloneO2CB cloneFS
Please note that you have to insert the correct values for your DRBD device and your mount point.
Tests
The output of crm_mon
should now look like:
Online: [ suse1 suse2 ] Master/Slave Set: Masters: [ suse1 suse2 ] Clone Set: cloneDLM Started: [ suse2 suse1 ] Clone Set: cloneO2CB Started: [ suse2 suse1 ] Clone Set: cloneFS Started: [ suse1 suse2 ]
You really have to test this setup if you want to use it in a production environment. First of all, be sure that your STONITH system works. Shared data without a reliable fencing system tend to mess up.
First of all I test the setup setting one node to standby and bring it online again. All resource should be switched off and turned on again. If this simple test doesn't work, go and search for the real cause of the problem.
Then you could test your setup by concurrent write access to the same data on the clustered filesystem.
Test the operation of your cluster by pulling plugs or killing processes on the nodes. In short, test everything that can happen in real life. Do not only test realistic scenarios but also esoteric causes of errors. If your cluster does not survive: Improve the configuration! Please also mail me about improvements of the setup, so I can keep this HOWTO up to date.
Have fun!
Michael Schwartzkopff