Release Testing

From ClusterLabs
(Redirected from ReleaseTesting)
Jump to: navigation, search

Ensuring Quality

Each release undergoes a series of automated and manual testing to ensure the quality of the finished product. In order to maximize our ability to find bugs before users do, we conduct a battery of tests designed to exercise as much functionality as possible. These tests emphasize variety over raw quantity, however with each feature release requiring approximately a week of round-the-clock testing, there is plenty of quantity too.

Types of variety introduced by Pacemaker testing

  • Two different cluster stacks
  • Under/over-powered cluster nodes
  • Virtual and non-virtual machines
  • Small and large clusters
  • Order of test-cases chosen at random
  • Manual and automatic testing

Regression Testing

The main suite of regression tests is for the Policy Engine which was designed to be highly suited to this type of testing. Known inputs, representing past problems or tests for specific features, are fed to the PE and its outputs are compared to previous recorded ones. To facilitate this, the running PE will save the inputs it operated on so that they can be later used for analysis. If the analysis indicates a bug, the testcase is added to the regression set to ensure the problem does not re-occur.

List of Regression Tests

Test Description
Policy Engine Simulate various conditions and configuration and ensures the cluster reacts correctly
CLI Tools Ensures they continue to produce the same output, return codes and results
Shell

Setup

For now, you'll need Pacemaker installed and access to the matching source code. The intention is to eventually include the necessary pieces as part of the pacemaker-devel package.

The details for Yum-based systems are below:

 # Install Pacemaker itself
 yum install -y pacemaker yum-utils
 
 # Download and install the matching sources
 yumdownloader --source pacemaker
 rpm -qlp pacemaker-*.src.rpm # Look for the tarball name
 rpm -Uvh pacemaker-*.src.rpm
 cd ~/rpmbuild/SOURCES/
 
 # Decompress the tarball
 tar jxvf *.bz2

Running

Go to the top of the source tree and then run the tests:

 # PEngine
 pengine/regression.sh
 
 # CLI
 tools/regression.sh
 # Shell (must be run as root)
 /usr/share/pacemaker/shelltest/regression.sh

Sample Output

Command Line Tools:

 [03:23 PM] beekhof@mobile ~/Development/pacemaker/devel # tools/regression.sh
 * Passed: cibadmin       - Require --force for CIB erasure
 * Passed: cibadmin       - Allow CIB erasure with --force
 * Passed: cibadmin       - Query CIB
 * Passed: crm_attribute  - Set cluster option
 * Passed: cibadmin       - Query new cluster option
 * Passed: cibadmin       - Query cluster options
 ...
 * Passed: crm_resource   - Set a resource's fail-count
 * Passed: crm_resource   - Require a destination when migrating a resource that is stopped
 * Passed: crm_resource   - Don't support migration to non-existant locations
 * Passed: crm_resource   - Migrate a resource
 * Passed: crm_resource   - Un-migrate a resource
 --- tools/regression.exp	2009-11-21 10:07:54.000000000 +0100
 +++ tools/regression.out	2010-01-26 15:23:52.000000000 +0100
 @@ -585,7 +585,7 @@
    </status>
  </cib>
  * Passed: crm_resource   - Create a resource attribute
 -dummy	(ocf::pacemaker:Dummy) Stopped 
 + dummy	(ocf::pacemaker:Dummy) Stopped 
  <cib epoch="16" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.0" >
    <configuration>
      <crm_config>
 @@ -652,7 +652,7 @@
    </status>
  </cib>
  * Passed: crm_resource   - Set a resource's fail-count
 -Resource dummy not migrated: not-active and no prefered location specified.
 +Resource dummy not moved: not-active and no prefered location specified.
  Error performing operation: cib object missing
  <cib epoch="16" num_updates="2" admin_epoch="0" validate-with="pacemaker-1.0" >
    <configuration>
 Tests passed but diff failed

Policy Engine:

 [03:23 PM] beekhof@mobile ~/Development/pacemaker/devel # pengine/regression.sh
 Generating test outputs for these tests...
 Done.
 
 Performing the following tests...  
 
 Test simple1	:	Offline     
 Test simple2	:	Start       
 Test simple3	:	Start 2     
 Test simple4	:	Start Failed
 Test simple6	:	Stop Start  
 Test simple7	:	Shutdown    
 Test simple11	:	Priority (ne)
 Test simple12	:	Priority (eq)
 ...
 Test 1494	:	OSDL #1494 - Clone stability
 Test unrunnable-1	:	Unrunnable
 Test stonith-0	:	Stonith loop - 1
 	* Failed (PE : raw)
 Test stonith-1	:	Stonith loop - 2  
 	* Failed (PE : raw)
 Test stonith-2	:	Stonith loop - 3
 	* Failed (PE : raw)
 Test stonith-3	:	Stonith startup
 Test bug-1572-1	:	Recovery of groups depending on master/slave
 Test bug-1572-2	:	Recovery of groups depending on master/slave when the master is never re-promoted
 ...
 Test utilization	:	Placement Strategy - utilization
 Test minimal	:	Placement Strategy - minimal
 Test balanced	:	Placement Strategy - balanced
 
 Results of 10 failed tests are in /Users/beekhof/Development/pacemaker/devel/pengine/.regression.failed.diff....
 Use pengine/regression.sh -v to display them automatically.

Shell

 [04:16 PM] root@f12 ~ # /usr/share/pacemaker/shelltest/regression.sh
 starting lrmd
 starting stonithd
 confbasic. checking... PASS
 confbasic-xml. checking... PASS
 node........... checking... FAIL
 resource................... checking... FAIL
 file..... checking... PASS
 shadow. checking... PASS
 ra. checking... FAIL
 seems like some tests failed or else something not expected
 check crmtestout/regression.out and diff files in crmtestout
 in case you wonder what lrmd was doing, read crmtestout/crm.log and crmtestout/crm.debug
 stopping lrmd
 stopping stonithd

Automated Testing

Automated testing is done with CTS, a python Cluster Test Suite which was originally written to test the Heartbeat 2-node cluster manager. CTS defines a series of testcases, which are performed in a random order, and a number of audits that are performed after each test is executed. Most tests look for known patterns in a centralized log file (typically syslog-ng is used to send logs from cluster nodes to the test master) and, at the conclusion of each test, CTS also scans (and reports) the logs for logs matching pre-defined BadNews patterns which may indicate a problem.

List of Automated Test Cases

Test Description
Stop Stop a node if it is running
Start Start a node if it is stopped
FlipTest Stop a node if it is running, start it if it was stopped
Restart Stop a node then start it again
StartOnebyOne Make sure all nodes are stopped then start them in order
SimulStart Make sure all nodes are stopped then start them all at once
SimulStop Make sure all nodes are running then stop them all at once
StopOnebyOne Make sure all nodes are running then stop them in order
RestartOnebyOne Make sure all nodes are running then execute the RestartTest on them in order
StandbyTest Place a node in standby mode (check that resources are migrated away), and the take it out of standby (check that resources migrate back)
ResourceRecover Kill a resource and make sure the cluster recovers it
ComponentFail Kill a cluster component (TE, PE, CIB, CRMd...) and make sure the cluster recovers
PartialStart Start a node and then, before it finishes starting up, tell it to shutdown
Stonithd Request a node be fenced, make sure it gets shot and watch the cluster recover
SpecialTest1 A particular combination of tests that proved troublesome in the past: SimulStop + Start + Start all remaining nodes simultaneously
Reattach Simulate a popular upgrade strategy. Tell the cluster to stop managing services, stop all nodes, start them up again and re-enable resource management. Ensure resources are re-detected correctly.
NearQuorumPointTest Randomly decide to stop, start or leave each node. Results in approximately half the nodes being up and the rest down.


List of Automated Post-Test Audits

Audit Description
LogAudit Check the centralized logging is functional
DiskAudit Check each node is not out of disk space
ResourceAudit Try and verify the location of cluster resources
CrmdStateAudit Check there is only one DC per partition
CIBAudit Verify the CIB is synchronized between nodes
PartitionAudit Check the cluster membership is consistent (and that only one partition exists)

Setup

A new tool has been written to simplify the process of setting up CTS and verifying existing CTS installations.

It can be found at:

 http://hg.clusterlabs.org/pacemaker/devel/file/tip/cts/cluster_test

Please send feedback via the mailing list.

Essentially it,

  • Sets up remote logging from the cluster nodes to the test master using syslog-ng
  • Sets up password-less ssh access from the test master to the cluster nodes
  • Asks if you want to use a sample or existing resource configuration
  • Asks for the details of your fencing device(s)
  • Displays the command to initiate testing

It assumes the rest of the cluster software (corosync+pacemaker) is installed, configured and functional on the cluster nodes.

Running

Options

 usage: ./CTSlab.py [options] number-of-iterations
 
 Common options: 
        [--at-boot (1|0)],         does the cluster software start at boot time
        [--nodes 'node list'],     list of cluster nodes separated by whitespace
        [--limit-nodes max],       only use the first 'max' cluster nodes supplied with --nodes
        [--stack (heartbeat|ais)], which cluster stack is installed
        [--logfile path],          where should the test software look for logs from cluster nodes
        [--syslog-facility name],  which syslog facility should the test software log to
        [--choose testcase-name],  run only the named test
        [--list-tests],            list the valid tests
        [--benchmark],             add the timing information
        
 Options for release testing: 
        [--clobber-cib | -c ]       Erase any existing configuration
        [--populate-resources | -r] Generate a sample configuration
        [--test-ip-base ip]         Offset for generated IP address resources 
        [--schema (pacemaker-0.6|pacemaker-1.0|hae)] Which configuration version to use
          
 Additional (less common) options: 
        [--trunc (truncate logfile before starting)]
        [--xmit-loss lost-rate(0.0-1.0)]
        [--recv-loss lost-rate(0.0-1.0)]
        [--standby (1 | 0 | yes | no)]
        [--fencing (1 | 0 | yes | no)]
        [--stonith (1 | 0 | yes | no)]
        [--stonith-type type]
        [--stonith-args name=value]
        [--bsc]
        [--once],                 run all valid tests once
        [--no-loop-tests],        dont run looping/time-based tests
        [--no-unsafe-tests],      dont run tests that are unsafe for use with ocfs2/drbd
        [--valgrind-tests],       include tests using valgrind
        [--experimental-tests],   include experimental tests
        [--oprofile 'node list'], list of cluster nodes to run oprofile on]
        [--seed random_seed]
        [--set option=value]

Helpers

These are two Bash functions that I use regularly for kicking off test runs from a known, sane, state. The cleanup function assumes pdsh is installed and that syslog-ng is used

 function cts-run() {
   python ./CTSlab.py --trunc -L $cluster_log --facility daemon --nodes "$cluster_hosts" --at-boot 0 --schema pacemaker-1.0 $*
 }
 
 function cts-cleanup() {
   echo `date` ": Cleaning up: $cluster_hosts"
   target="-l root -w `echo $cluster_hosts | tr ' ' ','`"
 
   if [ "x$1" = "x--kill" ]; then
 	echo "Cleaning processes"
 	pdsh $target "killall -q -9 corosync aisexec heartbeat ccm stonithd ha_logd lrmd crmd pengine attrd pingd mgmtd cib" &> /dev/null 
 	pdsh $target "rm -rf /var/lib/heartbeat/crm/cib*"
   fi
 
   cat /dev/null > $cluster_log
   /etc/init.d/syslog-ng restart
 
   pdsh $target "rm -rf /var/lib/heartbeat/crm/cib-*  /var/lib/heartbeat/cores/*/core.* /var/lib/heartbeat/hostcache /var/lib/openais/core.*"
   pdsh $target "rm -rf /var/lib/oprofile/samples/cts.*"
   pdsh $target "find /var/lib/pengine -name '*.bz2' -exec rm -f \{\} \;"
   pdsh $target "rm -f /var/log/messages* /var/log/localmessages* /var/log/cluster*.log"
   pdsh $target "/etc/init.d/syslog-ng restart"  2>&1 > /dev/null
   pdsh $target "logger -i -p daemon.info __clean_logs__"
 
   echo `date` ": Clean complete"
 }

Invoking

To then kick off 500 CTS iterations for a cluster of four virtual machines running Pacemaker with Corosync's FlatIron branch, using fence_xvm for fencing, and a sample (generated) configuration...

Define some variables needed by the helper scripts:

 cluster_log=/var/log/messages
 cluster_hosts="pcmk-1 pcmk-2 pcmk-3 pcmk-4"

Start the tests:

 cts-cleanup --kill
 cts-run --test-ip-base 192.168.100.180 --clobber-cib --populate-resources  --stack flatiron --stonith-type fence_xvm --stonith-args pcmk_host_check=dynamic-list,pcmk_arg_map=domain:uname 500

Sample Output

 Jan 21 23:20:43 f12 CTS: >>>>>>>>>>>>>>>> BEGINNING 500 TESTS
 Jan 21 23:20:43 f12 CTS: System log files: /var/log/cluster-virt1.log
 Jan 21 23:20:43 f12 CTS: Stack:            corosync (flatiron)
 Jan 21 23:20:43 f12 CTS: Schema:           pacemaker-1.0
 Jan 21 23:20:43 f12 CTS: Random Seed:      1264112443
 Jan 21 23:20:43 f12 CTS: Enable Stonith:   1
 Jan 21 23:20:43 f12 CTS: Enable Fencing:   1
 Jan 21 23:20:43 f12 CTS: Enable Standby:   1
 Jan 21 23:20:43 f12 CTS: Enable Resources: 1
 Jan 21 23:20:43 f12 CTS: Cluster nodes:
 Jan 21 23:20:43 f12 CTS: * pcmk-1
 Jan 21 23:20:43 f12 CTS: * pcmk-2
 Jan 21 23:20:43 f12 CTS: * pcmk-3
 Jan 21 23:20:43 f12 CTS: * pcmk-4
 Jan 21 23:20:45 f12 CTS: Audit LogAudit passed.
 Jan 21 23:20:46 f12 CTS: Audit DiskspaceAudit passed.
 Jan 21 23:20:48 f12 CTS: Stopping Cluster Manager on all nodes
 Jan 21 23:20:48 f12 CTS: Starting Cluster Manager on all nodes.
 Jan 21 23:24:32 f12 CTS: Executing tests at random
 Jan 21 23:24:32 f12 CTS: Running test ComponentFail          (pcmk-4)      [  1]
 Jan 21 23:25:40 f12 CTS: Running test Reattach               (pcmk-4)      [  2]
 Jan 21 23:27:05 f12 CTS: Running test SimulStart             (pcmk-3)      [  3]
 Jan 21 23:28:34 f12 CTS: Running test SimulStart             (pcmk-2)      [  4]
 Jan 21 23:29:59 f12 CTS: Running test Flip                   (pcmk-3)      [  5]
 Jan 21 23:30:37 f12 CTS: Running test ResourceRecover        (pcmk-1)      [  6]
 Jan 21 23:30:54 f12 CTS: Running test ResourceRecover        (pcmk-1)      [  7]
 Jan 21 23:31:00 f12 CTS: Running test SimulStart             (pcmk-2)      [  8]
 Jan 21 23:32:31 f12 CTS: Running test ComponentFail          (pcmk-2)      [  9]
 Jan 21 23:33:38 f12 CTS: Running test StopOnebyOne           (pcmk-4)      [ 10]
 Jan 21 23:33:59 f12 CTS: Running test Stonithd               (pcmk-2)      [ 11]
 Jan 21 23:36:37 f12 CTS: Running test ComponentFail          (pcmk-1)      [ 12]
 Jan 21 23:39:01 f12 CTS: Running test SimulStart             (pcmk-1)      [ 13]
 Jan 21 23:40:25 f12 CTS: Running test RestartOnebyOne        (pcmk-2)      [ 14]
 Jan 21 23:41:35 f12 CTS: Running test SpecialTest1           (pcmk-4)      [ 15]
 ...
 Jan 22 01:55:55 f12 CTS: Running test PartialStart           (pcmk-3)      [138]
 Jan 22 01:56:06 f12 CTS: Running test Restart                (pcmk-2)      [139]
 Jan 22 01:57:13 f12 CTS: Running test SimulStart             (pcmk-4)      [140]
 Jan 22 02:03:37 f12 CTS: Node status for pcmk-2 is down but we think it should be up
 Jan 22 02:03:37 f12 CTS: Warn: Node pcmk-2 not stable
 Jan 22 02:03:38 f12 CTS: Test SimulStartLite                 FAILED: Unstable cluster nodes exist: ['pcmk-2']
 Jan 22 02:03:38 f12 CTS: Test SimulStart                     FAILED: Startall failed
 Jan 22 02:03:42 f12 CTS: Running test SimulStop              (pcmk-1)      [141]
 Jan 22 02:04:06 f12 CTS: Running test SimulStart             (pcmk-2)      [142]
 Jan 22 02:05:24 f12 CTS: Running test StartOnebyOne          (pcmk-1)      [143]
 Jan 22 02:08:09 f12 CTS: Running test RestartOnebyOne        (pcmk-4)      [144]
 ...
 Jan 22 02:30:27 f12 CTS: Running test Reattach               (pcmk-3)      [169]
 Jan 22 02:32:06 f12 CTS: Running test ComponentFail          (pcmk-2)      [170]
 Jan 22 02:33:20 f12 CTS: Running test SpecialTest1           (pcmk-1)      [171]
 Jan 22 02:34:51 f12 CTS: BadNews: Jan 22 02:33:22 pcmk-2 crmd: [8397]: ERROR: verify_stopped: Resource stateful-1:0 active at shutdown.  You may ignore this error if it is unmanaged.
 Jan 22 02:34:51 f12 CTS: BadNews: Jan 22 02:33:22 pcmk-2 crmd: [8397]: ERROR: verify_stopped: Resource ping-1:0 was at shutdown.  You may ignore this error if it is unmanaged.
 Jan 22 02:34:55 f12 CTS: Running test StartOnebyOne          (pcmk-3)      [172]
 Jan 22 02:37:38 f12 CTS: Running test RestartOnebyOne        (pcmk-4)      [173]
 Jan 22 02:38:52 f12 CTS: Running test SimulStart             (pcmk-3)      [174]
 ...
 Jan 22 10:36:30 f12 CTS: Running test ResourceRecover        (pcmk-3)      [495]
 Jan 22 10:36:55 f12 CTS: Running test Reattach               (pcmk-3)      [496]
 Jan 22 10:38:22 f12 CTS: Running test Flip                   (pcmk-3)      [497]
 Jan 22 10:38:59 f12 CTS: Running test SimulStop              (pcmk-1)      [498]
 Jan 22 10:39:21 f12 CTS: Running test PartialStart           (pcmk-2)      [499]
 Jan 22 10:39:26 f12 CTS: Running test RestartOnebyOne        (pcmk-2)      [500]
 Jan 22 10:41:55 f12 CTS: Stopping Cluster Manager on all nodes
 Jan 22 10:42:15 f12 CTS: ****************
 Jan 22 10:42:15 f12 CTS: Overall Results:{'failure': 2, 'skipped': 0, 'success': 498, 'BadNews': 2}
 Jan 22 10:42:15 f12 CTS: ****************
 Jan 22 10:42:15 f12 CTS: Test Summary
 Jan 22 10:42:15 f12 CTS: Test Flip:                {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 36}
 Jan 22 10:42:15 f12 CTS: Test Restart:             {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 31}
 Jan 22 10:42:15 f12 CTS: Test Stonithd:            {'auditfail': 0, 'failure': 1, 'skipped': 0, 'calls': 29}
 Jan 22 10:42:15 f12 CTS: Test StartOnebyOne:       {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 38}
 Jan 22 10:42:15 f12 CTS: Test SimulStart:          {'auditfail': 0, 'failure': 1, 'skipped': 0, 'calls': 46}
 Jan 22 10:42:15 f12 CTS: Test SimulStop:           {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 35}
 Jan 22 10:42:15 f12 CTS: Test StopOnebyOne:        {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 26}
 Jan 22 10:42:15 f12 CTS: Test RestartOnebyOne:     {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 43}
 Jan 22 10:42:15 f12 CTS: Test PartialStart:        {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 29}
 Jan 22 10:42:15 f12 CTS: Test Standby:             {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 27}
 Jan 22 10:42:15 f12 CTS: Test ResourceRecover:     {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 29}
 Jan 22 10:42:15 f12 CTS: Test ComponentFail:       {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 29}
 Jan 22 10:42:15 f12 CTS: Test Reattach:            {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 35}
 Jan 22 10:42:15 f12 CTS: Test SpecialTest1:        {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 40}
 Jan 22 10:42:15 f12 CTS: Test NearQuorumPoint:     {'auditfail': 0, 'failure': 0, 'skipped': 0, 'calls': 27}
 Jan 22 10:42:15 f12 CTS: <<<<<<<<<<<<<<<< TESTS COMPLETE

Interpreting Results

Failures

Audit Failures

BadNews

Reporting Problems

If in doubt, make a report of any bugs or suspicious results in Bugzilla:

 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Please add a hb_report archive to the bug as well as the relevant part of the test output.

Creating an archive is simple, just use the -f option to indicate the test number to hb_report and invoke it on the test master. Eg:

  hb_report -l /var/log/messages -f cts:140 /tmp/cts-test-140

Feature Releases

Automated Testing

For the Heartbeat cluster stack

  • 2-nodes : 1000 Iterations
  • 4-nodes : 1000 Iterations
  • 6-nodes : 1000 Iterations

For the OpenAIS cluster stack

  • 2-nodes : 1000 Iterations
  • 4-nodes : 1000 Iterations
  • 8-nodes : 1000 Iterations

Total Iterations: 6,000

Total Estimated Cluster Transitions: 23,000

Manual Testing

    • crm_mon

Regression Testing

  • Perform Policy Engine Regression tests
  • Perform CLI Regression tests
    • cibadmin
    • crm_standby
    • crm_attribute
    • crm_failcount
    • crm_resource

Maintenance Releases

Automated Testing

For the Heartbeat cluster stack

  • 2-nodes : 500 Iterations
  • 4-nodes : 500 Iterations

For the OpenAIS cluster stack

  • 2-nodes : 500 Iterations
  • 4-nodes : 500 Iterations
  • 8-nodes : 500 Iterations

Total Iterations: 2,500

Total Estimated Cluster Transitions: 10,000

Manual Testing

  • TBA

Regression Testing

  • Perform Policy Engine Regression tests
  • Perform CLI Regression tests
    • cibadmin
    • crm_standby
    • crm_attribute
    • crm_failcount
    • crm_resource