Release Testing
Ensuring Quality
Each release undergoes a series of automated and manual testing to ensure the quality of the finished product. In order to maximize our ability to find bugs before users do, we conduct a battery of tests designed to exercise as much functionality as possible. These tests emphasize variety over raw quantity, however with each feature release requiring approximately a week of round-the-clock testing, there is plenty of quantity too.
Types of variety introduced by Pacemaker testing
- Two different cluster stacks
- Under/over-powered cluster nodes
- Virtual and non-virtual machines
- Small and large clusters
- Order of test-cases chosen at random
- Manual and automatic testing
Regression Testing
The main suite of regression tests is for the Policy Engine which was designed to be highly suited to this type of testing. Known inputs, representing past problems or tests for specific features, are fed to the PE and its outputs are compared to previous recorded ones. To facilitate this, the running PE will save the inputs it operated on so that they can be later used for analysis. If the analysis indicates a bug, the testcase is added to the regression set to ensure the problem does not re-occur.
Test | Description |
---|---|
Policy Engine | Simulate various conditions and configuration and ensures the cluster reacts correctly |
CLI Tools | Ensures they continue to produce the same output, return codes and results |
Shell |
Setup
For now, you'll need Pacemaker installed and access to the matching source code. The intention is to eventually include the necessary pieces as part of the pacemaker-devel package.
The details for Yum-based systems are below:
# Install Pacemaker itself yum install -y pacemaker yum-utils # Download and install the matching sources yumdownloader --source pacemaker rpm -qlp pacemaker-*.src.rpm # Look for the tarball name rpm -Uvh pacemaker-*.src.rpm cd ~/rpmbuild/SOURCES/ # Decompress the tarball tar jxvf *.bz2
Running
Go to the top of the source tree and then run the tests:
# PEngine pengine/regression.sh # CLI tools/regression.sh
# Shell (must be run as root) /usr/share/pacemaker/shelltest/regression.sh
Sample Output
Command Line Tools:
[03:23 PM] beekhof@mobile ~/Development/pacemaker/devel # tools/regression.sh * Passed: cibadmin - Require --force for CIB erasure * Passed: cibadmin - Allow CIB erasure with --force * Passed: cibadmin - Query CIB * Passed: crm_attribute - Set cluster option * Passed: cibadmin - Query new cluster option * Passed: cibadmin - Query cluster options ... * Passed: crm_resource - Set a resource's fail-count * Passed: crm_resource - Require a destination when migrating a resource that is stopped * Passed: crm_resource - Don't support migration to non-existant locations * Passed: crm_resource - Migrate a resource * Passed: crm_resource - Un-migrate a resource --- tools/regression.exp 2009-11-21 10:07:54.000000000 +0100 +++ tools/regression.out 2010-01-26 15:23:52.000000000 +0100 @@ -585,7 +585,7 @@ </status> </cib> * Passed: crm_resource - Create a resource attribute -dummy (ocf::pacemaker:Dummy) Stopped + dummy (ocf::pacemaker:Dummy) Stopped <cib epoch="16" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.0" > <configuration> <crm_config> @@ -652,7 +652,7 @@ </status> </cib> * Passed: crm_resource - Set a resource's fail-count -Resource dummy not migrated: not-active and no prefered location specified. +Resource dummy not moved: not-active and no prefered location specified. Error performing operation: cib object missing <cib epoch="16" num_updates="2" admin_epoch="0" validate-with="pacemaker-1.0" > <configuration> Tests passed but diff failed
Policy Engine:
[03:23 PM] beekhof@mobile ~/Development/pacemaker/devel # pengine/regression.sh Generating test outputs for these tests... Done. Performing the following tests... Test simple1 : Offline Test simple2 : Start Test simple3 : Start 2 Test simple4 : Start Failed Test simple6 : Stop Start Test simple7 : Shutdown Test simple11 : Priority (ne) Test simple12 : Priority (eq) ... Test 1494 : OSDL #1494 - Clone stability Test unrunnable-1 : Unrunnable Test stonith-0 : Stonith loop - 1 * Failed (PE : raw) Test stonith-1 : Stonith loop - 2 * Failed (PE : raw) Test stonith-2 : Stonith loop - 3 * Failed (PE : raw) Test stonith-3 : Stonith startup Test bug-1572-1 : Recovery of groups depending on master/slave Test bug-1572-2 : Recovery of groups depending on master/slave when the master is never re-promoted ... Test utilization : Placement Strategy - utilization Test minimal : Placement Strategy - minimal Test balanced : Placement Strategy - balanced Results of 10 failed tests are in /Users/beekhof/Development/pacemaker/devel/pengine/.regression.failed.diff.... Use pengine/regression.sh -v to display them automatically.
Shell
[04:16 PM] root@f12 ~ # /usr/share/pacemaker/shelltest/regression.sh starting lrmd starting stonithd confbasic. checking... PASS confbasic-xml. checking... PASS node........... checking... FAIL resource................... checking... FAIL file..... checking... PASS shadow. checking... PASS ra. checking... FAIL seems like some tests failed or else something not expected check crmtestout/regression.out and diff files in crmtestout in case you wonder what lrmd was doing, read crmtestout/crm.log and crmtestout/crm.debug stopping lrmd stopping stonithd
Automated Testing
Automated testing is done with CTS, a python Cluster Test Suite which was originally written to test the Heartbeat 2-node cluster manager. CTS defines a series of testcases, which are performed in a random order, and a number of audits that are performed after each test is executed. Most tests look for known patterns in a centralized log file (typically syslog-ng is used to send logs from cluster nodes to the test master) and, at the conclusion of each test, CTS also scans (and reports) the logs for logs matching pre-defined BadNews patterns which may indicate a problem.
Test | Description |
---|---|
Stop | Stop a node if it is running |
Start | Start a node if it is stopped |
FlipTest | Stop a node if it is running, start it if it was stopped |
Restart | Stop a node then start it again |
StartOnebyOne | Make sure all nodes are stopped then start them in order |
SimulStart | Make sure all nodes are stopped then start them all at once |
SimulStop | Make sure all nodes are running then stop them all at once |
StopOnebyOne | Make sure all nodes are running then stop them in order |
RestartOnebyOne | Make sure all nodes are running then execute the RestartTest on them in order |
StandbyTest | Place a node in standby mode (check that resources are migrated away), and the take it out of standby (check that resources migrate back) |
ResourceRecover | Kill a resource and make sure the cluster recovers it |
ComponentFail | Kill a cluster component (TE, PE, CIB, CRMd...) and make sure the cluster recovers |
PartialStart | Start a node and then, before it finishes starting up, tell it to shutdown |
Stonithd | Request a node be fenced, make sure it gets shot and watch the cluster recover |
SpecialTest1 | A particular combination of tests that proved troublesome in the past: SimulStop + Start + Start all remaining nodes simultaneously |
Reattach | Simulate a popular upgrade strategy. Tell the cluster to stop managing services, stop all nodes, start them up again and re-enable resource management. Ensure resources are re-detected correctly. |
NearQuorumPointTest | Randomly decide to stop, start or leave each node. Results in approximately half the nodes being up and the rest down. |
Audit | Description |
---|---|
LogAudit | Check the centralized logging is functional |
DiskAudit | Check each node is not out of disk space |
ResourceAudit | Try and verify the location of cluster resources |
CrmdStateAudit | Check there is only one DC per partition |
CIBAudit | Verify the CIB is synchronized between nodes |
PartitionAudit | Check the cluster membership is consistent (and that only one partition exists) |
Setup
A new tool has been written to simplify the process of setting up CTS and verifying existing CTS installations.
It can be found at:
http://hg.clusterlabs.org/pacemaker/devel/file/tip/cts/cluster_test
Please send feedback via the mailing list.
Essentially it,
- Sets up remote logging from the cluster nodes to the test master using syslog-ng
- Sets up password-less ssh access from the test master to the cluster nodes
- Asks if you want to use a sample or existing resource configuration
- Asks for the details of your fencing device(s)
- Displays the command to initiate testing
It assumes the rest of the cluster software (corosync+pacemaker) is installed, configured and functional on the cluster nodes.
Running
Options
usage: ./CTSlab.py [options] number-of-iterations Common options: [--at-boot (1|0)], does the cluster software start at boot time [--nodes 'node list'], list of cluster nodes separated by whitespace [--limit-nodes max], only use the first 'max' cluster nodes supplied with --nodes [--stack (heartbeat|ais)], which cluster stack is installed [--logfile path], where should the test software look for logs from cluster nodes [--syslog-facility name], which syslog facility should the test software log to [--choose testcase-name], run only the named test [--list-tests], list the valid tests [--benchmark], add the timing information Options for release testing: [--clobber-cib | -c ] Erase any existing configuration [--populate-resources | -r] Generate a sample configuration [--test-ip-base ip] Offset for generated IP address resources [--schema (pacemaker-0.6|pacemaker-1.0|hae)] Which configuration version to use Additional (less common) options: [--trunc (truncate logfile before starting)] [--xmit-loss lost-rate(0.0-1.0)] [--recv-loss lost-rate(0.0-1.0)] [--standby (1 | 0 | yes | no)] [--fencing (1 | 0 | yes | no)] [--stonith (1 | 0 | yes | no)] [--stonith-type type] [--stonith-args name=value] [--bsc] [--once], run all valid tests once [--no-loop-tests], dont run looping/time-based tests [--no-unsafe-tests], dont run tests that are unsafe for use with ocfs2/drbd [--valgrind-tests], include tests using valgrind [--experimental-tests], include experimental tests [--oprofile 'node list'], list of cluster nodes to run oprofile on] [--seed random_seed] [--set option=value]
Helpers
These are two Bash functions that I use regularly for kicking off test runs from a known, sane, state. The cleanup function assumes pdsh is installed and that syslog-ng is used
function cts-run() { python ./CTSlab.py --trunc -L $cluster_log --facility daemon --nodes "$cluster_hosts" --at-boot 0 --schema pacemaker-1.0 $* } function cts-cleanup() { echo `date` ": Cleaning up: $cluster_hosts" target="-l root -w `echo $cluster_hosts | tr ' ' ','`" if [ "x$1" = "x--kill" ]; then echo "Cleaning processes" pdsh $target "killall -q -9 corosync aisexec heartbeat ccm stonithd ha_logd lrmd crmd pengine attrd pingd mgmtd cib" &> /dev/null pdsh $target "rm -rf /var/lib/heartbeat/crm/cib*" fi cat /dev/null > $cluster_log /etc/init.d/syslog-ng restart pdsh $target "rm -rf /var/lib/heartbeat/crm/cib-* /var/lib/heartbeat/cores/*/core.* /var/lib/heartbeat/hostcache /var/lib/openais/core.*" pdsh $target "rm -rf /var/lib/oprofile/samples/cts.*" pdsh $target "find /var/lib/pengine -name '*.bz2' -exec rm -f \{\} \;" pdsh $target "rm -f /var/log/messages* /var/log/localmessages* /var/log/cluster*.log" pdsh $target "/etc/init.d/syslog-ng restart" 2>&1 > /dev/null pdsh $target "logger -i -p daemon.info __clean_logs__" echo `date` ": Clean complete" }
Invoking
To then kick off 500 CTS iterations for a cluster of four virtual machines running Pacemaker with Corosync's FlatIron branch, using fence_xvm for fencing, and a sample (generated) configuration...
Define some variables needed by the helper scripts:
cluster_log=/var/log/messages cluster_hosts="pcmk-1 pcmk-2 pcmk-3 pcmk-4"
Start the tests:
cts-cleanup --kill cts-run --test-ip-base 192.168.100.180 --clobber-cib --populate-resources --stack flatiron --stonith-type fence_xvm --stonith-args pcmk_host_check=dynamic-list,pcmk_arg_map=domain:uname 500
Sample Output
Jan 21 23:20:43 f12 CTS: >>>>>>>>>>>>>>>> BEGINNING 500 TESTS Jan 21 23:20:43 f12 CTS: System log files: /var/log/cluster-virt1.log Jan 21 23:20:43 f12 CTS: Stack: corosync (flatiron) Jan 21 23:20:43 f12 CTS: Schema: pacemaker-1.0 Jan 21 23:20:43 f12 CTS: Random Seed: 1264112443 Jan 21 23:20:43 f12 CTS: Enable Stonith: 1 Jan 21 23:20:43 f12 CTS: Enable Fencing: 1 Jan 21 23:20:43 f12 CTS: Enable Standby: 1 Jan 21 23:20:43 f12 CTS: Enable Resources: 1 Jan 21 23:20:43 f12 CTS: Cluster nodes: Jan 21 23:20:43 f12 CTS: * pcmk-1 Jan 21 23:20:43 f12 CTS: * pcmk-2 Jan 21 23:20:43 f12 CTS: * pcmk-3 Jan 21 23:20:43 f12 CTS: * pcmk-4 Jan 21 23:20:45 f12 CTS: Audit LogAudit passed. Jan 21 23:20:46 f12 CTS: Audit DiskspaceAudit passed. Jan 21 23:20:48 f12 CTS: Stopping Cluster Manager on all nodes Jan 21 23:20:48 f12 CTS: Starting Cluster Manager on all nodes. Jan 21 23:24:32 f12 CTS: Executing tests at random Jan 21 23:24:32 f12 CTS: Running test ComponentFail (pcmk-4) [ 1] Jan 21 23:25:40 f12 CTS: Running test Reattach (pcmk-4) [ 2] Jan 21 23:27:05 f12 CTS: Running test SimulStart (pcmk-3) [ 3] Jan 21 23:28:34 f12 CTS: Running test SimulStart (pcmk-2) [ 4] Jan 21 23:29:59 f12 CTS: Running test Flip (pcmk-3) [ 5] Jan 21 23:30:37 f12 CTS: Running test ResourceRecover (pcmk-1) [ 6] Jan 21 23:30:54 f12 CTS: Running test ResourceRecover (pcmk-1) [ 7] Jan 21 23:31:00 f12 CTS: Running test SimulStart (pcmk-2) [ 8] Jan 21 23:32:31 f12 CTS: Running test ComponentFail (pcmk-2) [ 9] Jan 21 23:33:38 f12 CTS: Running test StopOnebyOne (pcmk-4) [ 10] Jan 21 23:33:59 f12 CTS: Running test Stonithd (pcmk-2) [ 11] Jan 21 23:36:37 f12 CTS: Running test ComponentFail (pcmk-1) [ 12] Jan 21 23:39:01 f12 CTS: Running test SimulStart (pcmk-1) [ 13] Jan 21 23:40:25 f12 CTS: Running test RestartOnebyOne (pcmk-2) [ 14] Jan 21 23:41:35 f12 CTS: Running test SpecialTest1 (pcmk-4) [ 15] ... Jan 22 01:55:55 f12 CTS: Running test PartialStart (pcmk-3) [138] Jan 22 01:56:06 f12 CTS: Running test Restart (pcmk-2) [139] Jan 22 01:57:13 f12 CTS: Running test SimulStart (pcmk-4) [140] Jan 22 02:03:37 f12 CTS: Node status for pcmk-2 is down but we think it should be up Jan 22 02:03:37 f12 CTS: Warn: Node pcmk-2 not stable Jan 22 02:03:38 f12 CTS: Test SimulStartLite FAILED: Unstable cluster nodes exist: ['pcmk-2'] Jan 22 02:03:38 f12 CTS: Test SimulStart FAILED: Startall failed Jan 22 02:03:42 f12 CTS: Running test SimulStop (pcmk-1) [141] Jan 22 02:04:06 f12 CTS: Running test SimulStart (pcmk-2) [142] Jan 22 02:05:24 f12 CTS: Running test StartOnebyOne (pcmk-1) [143] Jan 22 02:08:09 f12 CTS: Running test RestartOnebyOne (pcmk-4) [144] ... Jan 22 02:30:27 f12 CTS: Running test Reattach (pcmk-3) [169] Jan 22 02:32:06 f12 CTS: Running test ComponentFail (pcmk-2) [170] Jan 22 02:33:20 f12 CTS: Running test SpecialTest1 (pcmk-1) [171] Jan 22 02:34:51 f12 CTS: BadNews: Jan 22 02:33:22 pcmk-2 crmd: [8397]: ERROR: verify_stopped: Resource stateful-1:0 active at shutdown. You may ignore this error if it is unmanaged. Jan 22 02:34:51 f12 CTS: BadNews: Jan 22 02:33:22 pcmk-2 crmd: [8397]: ERROR: verify_stopped: Resource ping-1:0 was at shutdown. You may ignore this error if it is unmanaged. Jan 22 02:34:55 f12 CTS: Running test StartOnebyOne (pcmk-3) [172] Jan 22 02:37:38 f12 CTS: Running test RestartOnebyOne (pcmk-4) [173] Jan 22 02:38:52 f12 CTS: Running test SimulStart (pcmk-3) [174] ... Jan 22 10:36:30 f12 CTS: Running test ResourceRecover (pcmk-3) [495] Jan 22 10:36:55 f12 CTS: Running test Reattach (pcmk-3) [496] Jan 22 10:38:22 f12 CTS: Running test Flip (pcmk-3) [497] Jan 22 10:38:59 f12 CTS: Running test SimulStop (pcmk-1) [498] Jan 22 10:39:21 f12 CTS: Running test PartialStart (pcmk-2) [499] Jan 22 10:39:26 f12 CTS: Running test RestartOnebyOne (pcmk-2) [500] Jan 22 10:41:55 f12 CTS: Stopping Cluster Manager on all nodes Jan 22 10:42:15 f12 CTS: **************** Jan 22 10:42:15 f12 CTS: Overall Results:{'failure': 2, 'skipped': 0, 'success': 498, 'BadNews': 2} Jan 22 10:42:15 f12 CTS: **************** Jan 22 10:42:15 f12 CTS: Test Summary Jan 22 10:42:15 f12 CTS: Test Flip: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 36} Jan 22 10:42:15 f12 CTS: Test Restart: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 31} Jan 22 10:42:15 f12 CTS: Test Stonithd: {'auditfail': 0, 'failure': 1, 'skipped': 0, 'calls': 29} Jan 22 10:42:15 f12 CTS: Test StartOnebyOne: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 38} Jan 22 10:42:15 f12 CTS: Test SimulStart: {'auditfail': 0, 'failure': 1, 'skipped': 0, 'calls': 46} Jan 22 10:42:15 f12 CTS: Test SimulStop: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 35} Jan 22 10:42:15 f12 CTS: Test StopOnebyOne: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 26} Jan 22 10:42:15 f12 CTS: Test RestartOnebyOne: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 43} Jan 22 10:42:15 f12 CTS: Test PartialStart: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 29} Jan 22 10:42:15 f12 CTS: Test Standby: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 27} Jan 22 10:42:15 f12 CTS: Test ResourceRecover: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 29} Jan 22 10:42:15 f12 CTS: Test ComponentFail: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 29} Jan 22 10:42:15 f12 CTS: Test Reattach: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 35} Jan 22 10:42:15 f12 CTS: Test SpecialTest1: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 40} Jan 22 10:42:15 f12 CTS: Test NearQuorumPoint: {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 27} Jan 22 10:42:15 f12 CTS: <<<<<<<<<<<<<<<< TESTS COMPLETE
Interpreting Results
Failures
Audit Failures
BadNews
Reporting Problems
If in doubt, make a report of any bugs or suspicious results in Bugzilla:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Please add a hb_report archive to the bug as well as the relevant part of the test output.
Creating an archive is simple, just use the -f option to indicate the test number to hb_report and invoke it on the test master. Eg:
hb_report -l /var/log/messages -f cts:140 /tmp/cts-test-140
Feature Releases
Automated Testing
For the Heartbeat cluster stack
- 2-nodes : 1000 Iterations
- 4-nodes : 1000 Iterations
- 6-nodes : 1000 Iterations
For the OpenAIS cluster stack
- 2-nodes : 1000 Iterations
- 4-nodes : 1000 Iterations
- 8-nodes : 1000 Iterations
Total Iterations: 6,000
Total Estimated Cluster Transitions: 23,000
Manual Testing
- crm_mon
Regression Testing
- Perform Policy Engine Regression tests
- Perform CLI Regression tests
- cibadmin
- crm_standby
- crm_attribute
- crm_failcount
- crm_resource
Maintenance Releases
Automated Testing
For the Heartbeat cluster stack
- 2-nodes : 500 Iterations
- 4-nodes : 500 Iterations
For the OpenAIS cluster stack
- 2-nodes : 500 Iterations
- 4-nodes : 500 Iterations
- 8-nodes : 500 Iterations
Total Iterations: 2,500
Total Estimated Cluster Transitions: 10,000
Manual Testing
- TBA
Regression Testing
- Perform Policy Engine Regression tests
- Perform CLI Regression tests
- cibadmin
- crm_standby
- crm_attribute
- crm_failcount
- crm_resource