Nagios3 on Pacemaker DRBD
Contents
What is this about
This is about the configuration and customization of nagios for fitting into a pacemaker/corosync/drbd active passive cluster. I came along some problems while trying to use Nagios on that setup and i will share my expierences within this howto.
I used Pacemaker Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75 and Nagios 3.0.6 on a compiled DRBD Version Version: 8.3.7 (api:88) on a basic Debian Lenny.
I assume that you already got pacemaker with a shared clusterip and drbd/filesystem constraints up and running.
Making the Nagios3 Init Script LSB Compatible
The Startscript fails at Test Nr. 5 as described here: Is This init Script LSB Compatible? and returns Code 6 instead of Error Code 3. The Reason for this is, that NAgios doesn't seem to delte it's PID File and this causes the Script to return a error. Solution is simple, just delete the pid on stopping Nagios.
This is how the new Stop section should look like (i just added the rm command).
stop () { killproc -p $THEPIDFILE ret=$? if [ `pidof nagios3 | wc -l ` -gt 0 ]; then echo -n "Waiting for $NAME daemon to die.." cnt=0 while [ `pidof nagios3 | wc -l ` -gt 0 ]; do cnt=`expr "$cnt" + 1` if [ "$cnt" -gt 15 ]; then kill -9 `pidof nagios3` break fi sleep 1 echo -n "." done fi rm -f $THEPIDFILE echo if ! check_named_pipe; then rm -f $nagiospipe fi if [ -n "$ret" ]; then return $ret else return $? fi
}
Now our Init Script is prepared.
Preparing the Config Files and Directories
We need to deploy our Nagios Configs on all Nodes and link the nagios folders to our shared storage (/mnt/cluster). And we also copy the configs on the passive node into the /mnt/cluster, because nagios will search for its configs even if it is not running (for example when pacemaker issues a status command on the passive node it will fail because the symlink points to a not existing folder).
cp -pRv /etc/nagios3/ /mnt/cluster/etc/nagios3 cp -pRv /var/lib/nagios3 /mnt/cluster/var/lib/nagios3
/etc/init.d/nagios3 stop cd /etc mv nagios3 nagios3_bak ln -s /mnt/cluster/etc/nagios3 /etc/nagios3 cd /var/lib mv nagios3 nagios3_bak ln -s /mnt/cluster/var/lib/nagios3 /var/lib/nagios3
Now the folderstructure should look like this:
ll /etc/nagios3* /var/lib/nagios3* lrwxrwxrwx 1 root root 25 23. Jun 13:54 /etc/nagios3 -> /mnt/cluster/etc/nagios3/ lrwxrwxrwx 1 root root 29 23. Jun 14:04 /var/lib/nagios3 -> /mnt/cluster/var/lib/nagios3/ /etc/nagios3_bak: insgesamt 88K drwxr-xr-x 4 root root 146 23. Jun 13:54 . drwxr-xr-x 75 root root 4,0K 23. Jun 15:32 .. -rw-r--r-- 1 root root 1,9K 30. Jun 2009 apache2.conf -rw-r--r-- 1 root root 11K 23. Jun 13:49 cgi.cfg -rw-r--r-- 1 root root 2,4K 2. Jul 2009 commands.cfg drwxr-xr-x 2 root root 4,0K 7. Jun 19:16 conf.d -rw-r--r-- 1 root root 20 23. Jun 13:49 htpasswd.users -rw-r--r-- 1 root root 42K 2. Jul 2009 nagios.cfg -rw-r----- 1 root nagios 1,3K 30. Jun 2009 resource.cfg drwxr-xr-x 2 root root 4,0K 7. Jun 19:16 stylesheets /var/lib/nagios3_bak: insgesamt 20K drwxr-x--- 4 nagios nagios 47 23. Jun 14:02 . drwxr-xr-x 33 root root 4,0K 23. Jun 14:04 .. -rw------- 1 nagios www-data 14K 23. Jun 14:02 retention.dat drwx------ 2 nagios www-data 6 2. Jul 2009 rw drwxr-x--- 3 nagios nagios 25 7. Jun 19:16 spool
Now try to start nagios on each nodes, if it is not failing then we can proceed.
Configuring the Resource
This is easy and straight forward.
crm configure edit
primitive res_Nagios lsb:nagios3 \ operations $id="res_Nagios-operations" \ op monitor interval="15s" timeout="20s"
Now just issue
crm_verify -LV
and
tail -fn 1000 /var/log/syslog | egrep 'res_Nagios|ERROR|WARN'
and you shouldn't see any errors.