Release Testing

Ensuring Quality
Each release undergoes a series of automated and manual testing to ensure the quality of the finished product. In order to maximize our ability to find bugs before users do, we conduct a battery of tests designed to exercise as much functionality as possible. These tests emphasize variety over raw quantity, however with each feature release requiring approximately a week of round-the-clock testing, there is plenty of quantity too.

Types of variety introduced by Pacemaker testing
 * Two different cluster stacks
 * Under/over-powered cluster nodes
 * Virtual and non-virtual machines
 * Small and large clusters
 * Order of test-cases chosen at random
 * Manual and automatic testing

Manual Testing
Manual testing consists of performing the regression suite tests and exercising hard to script tests by hand.

The main suite of regression tests is for the Policy Engine which was designed to be highly suited to this type of testing. Known inputs, representing past problems or tests for specific features, are fed to the PE and its outputs are compared to previous recorded ones. To facilitate this, the running PE will save the inputs it operated on so that they can be later used for analysis. If the analysis indicates a bug, the testcase is added to the regression set to ensure the problem does not re-occur.

Automated Testing
Automated testing is done with CTS, a python Cluster Test Suite which was originally written to test the Heartbeat 2-node cluster manager. CTS defines a series of testcases, which are performed in a random order, and a number of audits that are performed after each test is executed. Most tests look for known patterns in a centralized log file (typically syslog-ng is used to send logs to the test master) and, at the conclusion of each test, CTS also scans (and reports) the logs for logs matching BadNews patterns which may indicate a problem.

{| cellpadding=1

List of Automated Test Cases
! Test !! Description
 * StopTest || Stop a node if it is running
 * StartTest || Start a node if it is stopped
 * FlipTest || Stop a node if it is running, start it if it was stopped
 * RestartTest || Stop a node then start it again
 * StonithTest
 * StonithdTest
 * StartOnebyOne || Make sure all nodes are stopped then start them in order
 * SimulStart || Make sure all nodes are stopped then start them all at once
 * SimulStop || Make sure all nodes are running then stop them all at once
 * StopOnebyOne || Make sure all nodes are running then stop them in order
 * RestartOnebyOne || Make sure all nodes are running then execute the RestartTest on them in order
 * PartialStart || Start a node and then, before it finishes starting up, tell it to shutdown
 * StandbyTest || Place a node in standby mode (check that resources are migrated away), and the take it out of standby (check that resources migrate back)
 * ResourceRecover || Kill a resource and make sure the cluster recovers it
 * ComponentFail || Kill a cluster component (TE, PE, CIB, CRMd...) and make sure the cluster recovers
 * SpecialTest1
 * NearQuorumPointTest
 * }
 * StopOnebyOne || Make sure all nodes are running then stop them in order
 * RestartOnebyOne || Make sure all nodes are running then execute the RestartTest on them in order
 * PartialStart || Start a node and then, before it finishes starting up, tell it to shutdown
 * StandbyTest || Place a node in standby mode (check that resources are migrated away), and the take it out of standby (check that resources migrate back)
 * ResourceRecover || Kill a resource and make sure the cluster recovers it
 * ComponentFail || Kill a cluster component (TE, PE, CIB, CRMd...) and make sure the cluster recovers
 * SpecialTest1
 * NearQuorumPointTest
 * }
 * ResourceRecover || Kill a resource and make sure the cluster recovers it
 * ComponentFail || Kill a cluster component (TE, PE, CIB, CRMd...) and make sure the cluster recovers
 * SpecialTest1
 * NearQuorumPointTest
 * }
 * NearQuorumPointTest
 * }
 * }
 * }

{| cellpadding=1

List of Automated Post-Test Audits
! Audit !! Description
 * LogAudit || Check the centralized logging is functional
 * DiskAudit || Check each node is not out of disk space
 * ResourceAudit || Try and verify the location of cluster resources
 * CrmdStateAudit || Check there is only one DC per partition
 * CIBAudit || Verify the CIB is synchronized between nodes
 * PartitionAudit || Check the cluster membership is consistent (and that only one partition exists)
 * }
 * CrmdStateAudit || Check there is only one DC per partition
 * CIBAudit || Verify the CIB is synchronized between nodes
 * PartitionAudit || Check the cluster membership is consistent (and that only one partition exists)
 * }
 * PartitionAudit || Check the cluster membership is consistent (and that only one partition exists)
 * }

Setting Up CTS
A new tool has been written to simplify the process of setting up CTS and verifying existing CTS installations.

It can be found at: http://hg.clusterlabs.org/pacemaker/dev/file/tip/cts/cluster_test

Please send feedback via the mailing list.

Essentially it,
 * Sets up remote logging from the cluster nodes to the test master using syslog-ng
 * Sets up password-less ssh access from the test master to the cluster nodes
 * Asks for the details of your fencing device(s)
 * Gives you the command to initiate testing

Automated Testing
For the Heartbeat cluster stack
 * 2-nodes : 1000 Iterations
 * 4-nodes : 1000 Iterations
 * 6-nodes : 1000 Iterations

For the OpenAIS cluster stack
 * 2-nodes : 1000 Iterations
 * 4-nodes : 1000 Iterations
 * 8-nodes : 1000 Iterations

Total Iterations: 6,000

Total Estimated Cluster Transitions: 23,000

Manual Testing

 * crm_mon

Regression Testing

 * Perform Policy Engine Regression tests
 * Perform CLI Regression tests
 * cibadmin
 * crm_standby
 * crm_attribute
 * crm_failcount
 * crm_resource

Automated Testing
For the Heartbeat cluster stack
 * 2-nodes : 500 Iterations
 * 4-nodes : 500 Iterations

For the OpenAIS cluster stack
 * 2-nodes : 500 Iterations
 * 4-nodes : 500 Iterations
 * 8-nodes : 500 Iterations

Total Iterations: 2,500

Total Estimated Cluster Transitions: 10,000

Manual Testing

 * TBA

Regression Testing

 * Perform Policy Engine Regression tests
 * Perform CLI Regression tests
 * cibadmin
 * crm_standby
 * crm_attribute
 * crm_failcount
 * crm_resource