Nagios3 on Pacemaker DRBD

What is this about
This is about the configuration and customization of nagios for fitting into a pacemaker/corosync/drbd active passive cluster. I came along some problems while trying to use Nagios on that setup and i will share my expierences within this howto.

I used Pacemaker Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75 and Nagios 3.0.6 on a compiled DRBD Version Version: 8.3.7 (api:88) on a basic Debian Lenny.

I assume that you already got pacemaker with a shared clusterip and drbd/filesystem constraints up and running.

Making the Nagios3 Init Script LSB Compatible
The Startscript fails at Test Nr. 5 as described here: Is This init Script LSB Compatible? and returns Code 6 instead of Error Code 3. The Reason for this is, that NAgios doesn't seem to delte it's PID File and this causes the Script to return a error. Solution is simple, just delete the pid on stopping Nagios.

This is how the new Stop section should look like (i just added the rm command).

stop { killproc -p $THEPIDFILE ret=$? if [ `pidof nagios3 | wc -l ` -gt 0 ]; then echo -n "Waiting for $NAME daemon to die.." cnt=0 while [ `pidof nagios3 | wc -l ` -gt 0 ]; do           cnt=`expr "$cnt" + 1` if [ "$cnt" -gt 15 ]; then kill -9 `pidof nagios3` break fi           sleep 1 echo -n "." done fi   rm -f $THEPIDFILE echo if ! check_named_pipe; then rm -f $nagiospipe fi   if [ -n "$ret" ]; then return $ret else return $? fi }

Now our Init Script is prepared.

Preparing the Config Files and Directories
We need to deploy our Nagios Configs on all Nodes and link the nagios folders to our shared storage (/mnt/cluster). And we also copy the configs on the passive node into the /mnt/cluster, because nagios will search for its configs even if it is not running (for example when pacemaker issues a status command on the passive node it will fail because the symlink points to a not existing folder).

cp -pRv /etc/nagios3/ /mnt/cluster/etc/nagios3 cp -pRv /var/lib/nagios3 /mnt/cluster/var/lib/nagios3

/etc/init.d/nagios3 stop cd /etc mv nagios3 nagios3_bak ln -s /mnt/cluster/etc/nagios3 /etc/nagios3 cd /var/lib mv nagios3 nagios3_bak ln -s /mnt/cluster/var/lib/nagios3 /var/lib/nagios3

Now the folderstructure should look like this:

ll /etc/nagios3* /var/lib/nagios3* lrwxrwxrwx 1 root  root    25 23. Jun 13:54 /etc/nagios3 -> /mnt/cluster/etc/nagios3/ lrwxrwxrwx 1 root  root    29 23. Jun 14:04 /var/lib/nagios3 -> /mnt/cluster/var/lib/nagios3/

/etc/nagios3_bak: insgesamt 88K drwxr-xr-x 4 root root    146 23. Jun 13:54. drwxr-xr-x 75 root root  4,0K 23. Jun 15:32 .. -rw-r--r-- 1 root root   1,9K 30. Jun 2009 apache2.conf -rw-r--r-- 1 root root    11K 23. Jun 13:49 cgi.cfg -rw-r--r-- 1 root root   2,4K  2. Jul 2009 commands.cfg drwxr-xr-x 2 root root   4,0K  7. Jun 19:16 conf.d -rw-r--r-- 1 root root     20 23. Jun 13:49 htpasswd.users -rw-r--r-- 1 root root    42K  2. Jul 2009 nagios.cfg -rw-r- 1 root nagios 1,3K 30. Jun 2009 resource.cfg drwxr-xr-x 2 root root   4,0K  7. Jun 19:16 stylesheets

/var/lib/nagios3_bak: insgesamt 20K drwxr-x--- 4 nagios nagios     47 23. Jun 14:02. drwxr-xr-x 33 root  root     4,0K 23. Jun 14:04 .. -rw--- 1 nagios www-data  14K 23. Jun 14:02 retention.dat drwx-- 2 nagios www-data    6  2. Jul 2009 rw drwxr-x---  3 nagios nagios     25  7. Jun 19:16 spool

Now try to start nagios on each nodes, if it is not failing then we can proceed.

Configuring the Resource
This is easy and straight forward.

crm configure edit

primitive res_Nagios lsb:nagios3 \ operations $id="res_Nagios-operations" \ op monitor interval="15s" timeout="20s"

Now just issue

crm_verify -LV

and

tail -fn 1000 /var/log/syslog | egrep 'res_Nagios|ERROR|WARN'

and you shouldn't see any errors.