Nagios3 on Pacemaker DRBD

What is this about

This is about the configuration and customization of nagios for fitting into a pacemaker/corosync/drbd active passive cluster. I came along some problems while trying to use Nagios on that setup and i will share my expierences within this howto.

I used Pacemaker Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75 and Nagios 3.0.6 on a compiled DRBD Version Version: 8.3.7 (api:88) on a basic Debian Lenny.

I assume that you already got pacemaker with a shared clusterip and drbd/filesystem constraints up and running.

Making the Nagios3 Init Script LSB Compatible

The Startscript fails at Test Nr. 5 as described here: Is This init Script LSB Compatible? and returns Code 6 instead of Error Code 3. The Reason for this is, that NAgios doesn't seem to delte it's PID File and this causes the Script to return a error. Solution is simple, just delete the pid on stopping Nagios.

This is how the new Stop section should look like (i just added the rm command).

 stop () {
   killproc -p $THEPIDFILE
   if [ `pidof nagios3 | wc -l ` -gt 0 ]; then
       echo -n "Waiting for $NAME daemon to die.."
       while [ `pidof nagios3 | wc -l ` -gt 0 ]; do
           cnt=`expr "$cnt" + 1`
           if [ "$cnt" -gt 15 ]; then
               kill -9 `pidof nagios3`
           sleep 1
           echo -n "."
   rm -f $THEPIDFILE
   if ! check_named_pipe; then
     rm -f $nagiospipe
   if [ -n "$ret" ]; then
     return $ret
     return $?


Now our Init Script is prepared.

Preparing the Config Files and Directories

We need to deploy our Nagios Configs on all Nodes and link the nagios folders to our shared storage (/mnt/cluster). And we also copy the configs on the passive node into the /mnt/cluster, because nagios will search for its configs even if it is not running (for example when pacemaker issues a status command on the passive node it will fail because the symlink points to a not existing folder).

 cp -pRv /etc/nagios3/ /mnt/cluster/etc/nagios3 
 cp -pRv /var/lib/nagios3 /mnt/cluster/var/lib/nagios3
 /etc/init.d/nagios3 stop
 cd /etc
 mv nagios3 nagios3_bak
 ln -s /mnt/cluster/etc/nagios3 /etc/nagios3
 cd /var/lib
 mv nagios3 nagios3_bak
 ln -s /mnt/cluster/var/lib/nagios3 /var/lib/nagios3

Now the folderstructure should look like this:

ll /etc/nagios3* /var/lib/nagios3*
lrwxrwxrwx 1 root   root    25 23. Jun 13:54 /etc/nagios3 -> /mnt/cluster/etc/nagios3/
lrwxrwxrwx 1 root   root    29 23. Jun 14:04 /var/lib/nagios3 -> /mnt/cluster/var/lib/nagios3/

insgesamt 88K
drwxr-xr-x  4 root root    146 23. Jun 13:54 .
drwxr-xr-x 75 root root   4,0K 23. Jun 15:32 ..
-rw-r--r--  1 root root   1,9K 30. Jun 2009  apache2.conf
-rw-r--r--  1 root root    11K 23. Jun 13:49 cgi.cfg
-rw-r--r--  1 root root   2,4K  2. Jul 2009  commands.cfg
drwxr-xr-x  2 root root   4,0K  7. Jun 19:16 conf.d
-rw-r--r--  1 root root     20 23. Jun 13:49 htpasswd.users
-rw-r--r--  1 root root    42K  2. Jul 2009  nagios.cfg
-rw-r-----  1 root nagios 1,3K 30. Jun 2009  resource.cfg
drwxr-xr-x  2 root root   4,0K  7. Jun 19:16 stylesheets

insgesamt 20K
drwxr-x---  4 nagios nagios     47 23. Jun 14:02 .
drwxr-xr-x 33 root   root     4,0K 23. Jun 14:04 ..
-rw-------  1 nagios www-data  14K 23. Jun 14:02 retention.dat
drwx------  2 nagios www-data    6  2. Jul 2009  rw
drwxr-x---  3 nagios nagios     25  7. Jun 19:16 spool

Now try to start nagios on each nodes, if it is not failing then we can proceed.

Configuring the Resource

This is easy and straight forward.

 crm configure edit
primitive res_Nagios lsb:nagios3 \
        operations $id="res_Nagios-operations" \
        op monitor interval="15s" timeout="20s"

Now just issue

 crm_verify -LV


 tail -fn 1000 /var/log/syslog | egrep 'res_Nagios|ERROR|WARN'

and you shouldn't see any errors.