Debugging Resource Failures

From ClusterLabs

If you have a resource that fails to start, and there's nothing obvious in the logs (look for "lrmd", "LRM operation", etc.), you can try starting it manually to diagnose the problem further. Likewise for failed stop and monitor ops. Here's how:

  • Unmanage the resource, so Pacemaker won't try to do anything with it:
# crm resource unmanage <RESOURCE>
  • Configure environment:
# export OCF_ROOT=/usr/lib/ocf
# export OCF_RESKEY_<param>=<value>
# ... (likewise for all other resource parameters, run
       "crm configure show <RESOURCE>" to verify what
       params you need to set here)
  • Run the op:
# /usr/lib/ocf/resource.d/heartbeat/<RA> start ; echo $?
  • Look for helpful error messages, and check the return code
  • If that doesn't help, try using sh -x or bash -x to see exactly what the RA is doing. Do a stop first just in case, then try the start again:
# /usr/lib/ocf/resource.d/heartbeat/<RA> stop
# sh -x /usr/lib/ocf/resource.d/heartbeat/<RA> start ; echo $?

Note that for standard scripts within resource-agents, you may enable the tracing the designated way:

 # export OCF_TRACE_RA=1

prior to running the agent at hand directly, see details

  • Once you've figured out what the problem is and solved it, give the resource back to Pacemaker:
# crm resource manage <RESOURCE>