Pacemaker redesign possibilities

From ClusterLabs

Developers who work with the Pacemaker code base may use this as a forum for discussion of possible changes in Pacemaker code design choices, including pin-pointing existing and sometimes life-long design flaws.

Validate version compatibility before respawning failed components

poki: The Pacemaker main process, pacemakerd, will try to restart any auxiliary daemon when its non-bailing-out failure is detected. What's not accounted, however, is the situation the daemon's binary has been updated in the meantime, meaning the version consistency is possibly broken -- inter-daemon API + assumptions are not guaranteed to be stable, for the most part, it's a private thing for given release.

Possible solutions, either:

  • ensure the "forever" compatibility, do extensive testing
  • make initial protocol interchange some version numbers + conditionalize
  • make pacemakerd check the version string and/or build-unique seed value/cookie for the daemon in question first
  • remember checksum of the binary when run for the first time, then recheck first (checksums can also be hardcoded, strictly requiring binaries from the very same build, but that'd be problematic for couple of reasons: debug symbol stripping, prelink, ...), alternatively just file metadata (file size, mtime) can be taken into account, or the digest from .bss section, which might be constant
  • run the full circle: stop all other daemons, reexec pacemakerd, bring the daemons back up, where they should assurably understand each other now since they should be from the same release/build now; graceful marshalling of the daemons' internal state may be desired, e.g. to skip reprobing etc. and in turn to speed such a holistic recovery up
  • RPM only: make installation touch /run/pacemaker/.updated (generally, must be volatile memory storage), then pacemakerd will check that file prior to re-executing any daemon; at the startup, it would ensure that file is not present

Truly asynchronous (i.e. out-of-mainloop) processing of certain events

poki: Pacemaker daemons are built on "main loops" as the bottom-most building blocks. That's fine except for cases when there are really urgent inputs to proceed, here I mean delivered signals in particular. The problem arises when one of the loop's callbacks gets stuck in a tight cycling, never returning to back to the loop's control logic. Even if it's not just an infinite loop, this breaks down any notion of assuredly timely fairness to inputs handling (non-starvation). Some daemons are naturally cross-connected with their peers, so for them, remote liveness attestation is possible (if not performed already, not sure now). More trouble is with the "local-only" terminal processors (e.g. pacemaker-execd). For them, only following solutions are occurring to me:

  • promote pacemakerd to a proper supervisor (finally!), attesting liveness of its surrogates directly, through IPC or signals(?), and restarting them when they appear stuck
  • use alarm(2) and have respective SIGALRM handled outside of the mainloop (i.e. fully asynchronously), which would check whether there are any signals of priority (SIGTERM, or perhaps whichever signal) pending (->trigger == TRUE?), and when that happens twice (or so) in row, it would trigger the signal's handler directly (possibly integrated directly into the customized main loop)

Formalizing and exposing inter-component IPC XML schema

poki: At the presence of messaging protocol as the lowest-level interface to pacemaker, external "programmatic" users of pacemaker shall not be reliant on (to this day unstable) pacemaker libraries or, contrary, overhead-imposing CLI interface (which is not too stable either), since a subset of these messages are forcibly stable to serve various versions of pacemaker-remote. With properly documented messages (actionable formats incl. RelaxNG schemas or custom purpose-specific declarative forms strongly preferred!), people could write their custom bindings, just as it happens with HTTP/REST, XML-RPC, etc. without relying on a single binary bit from pacemaker itself.

Triggering and related discussion (that message and on): GH PR #1603 comment

Note that point 1. above is also highly-relevant, with stable enough messages, daemon-wise granular update would be (more of) a non-issue.

Provide more support to "natively" clustered applications

poki: Seemingly, more and more software components will rather implement their own clustering layer, slowly obviating a need for heavy-weight cluster stacks. Sadly, this "niche" was completely ignored so far, with the exception of corosync that will allow such interested external projects to progress pretty far. Still, they would need to deliver/reinvent the high-level primitives of the distributed systems:

  • coordinator election
  • 2-phase/3-phase commits convenient wrappers
  • unified monitoring
  • fencing

Pacemaker could accommodate these needs, making together with corosync for an attractive suite to power what I call "private clustering".

This would be a bit akin to systemd exposing sd_notify library function that makes quite a difference for programs run under supervision of systemd (this itself is something that would be really useful when originally pioneered by OCF, since the reliable readiness protocol is an everlasting problem, even more so when ordering of resources needs to be as *reliable* as possible, as otherwise it's hardly HA...).

Triggering and related discussion (that message and on): clusterlabs users ML post

Relatively large codebase without any unit test confinements

While it's highly desirable there are many component-level tests, lack of finer-grained unit tests is what makes it suffer from:

  • no thinking about the corner-cases ahead of time, and subsequently exercising them systemically (rather a fantasy with component-level tests, since they, at best, do not cover possible future contexts of use for particular functions)
  • missing controlled non-violation of some assumptions set forth originally
  • missing more-trustwothy-than-any-documentation-ever way to allow for grasping of the external impact (and in turn, intentions) of the fuction at hand

Net result is a suboptimal maintenance of the code, with less confidence about the changes plus related fallouts. Especially since the maintenance and development is shared amongst multiple people, the integrity on projected vs. actual effects at rather an atomic level becomes of importance.

General lack of dedicated "node health checking" concept

  • external: while many intentions can be realized using resources and/or with setting #health prefixed node attributes, finer state change triggers are not natively available as to whether to reboot/fence the node for instance, without resorting to rule-conditionalizing new resources solely meant as one-off action triggers
  • internal: it'd be vital to check some conditions implicitly, e.g., when supporting systemd, that systemd itself hasn't failed, which is only recoverable with machine reboot currently (see example of reliably provoked systemd crash until fixed, which won't necessarily manifest immediately, but silently with things like being blocked on PAM related callouts forever)

Very superficial systemd integration

  • see the healthchecking point above (there's also a question of using watchdog: real one and its "multiplexing" by systemd)
  • it perhaps makes sense to actively block units that are not supposed to be running in pacemaker predestined state of affairs, see related ML post
  • what about apriori deadlock/circular dependency detection? e.g., see Restart=on-failure impeachment for taking over pacemaker's responsibilities -- does it impact any assumptions on pacemaker's side?

Suboptimal compression method choice

Bzip2 is not entirely best-in-class, in any aspect: compression ration, speed, resource usage. Moreover, that are ways to do substantially better, and also to be noted, many compressions methods allow for pre-populated dictionaries, which could be utilized easily, since the tag names are known at build time at latest (possibly gaining substantially better compression at low-end settings of the algorithms?).

  • one of many benchmarks

Design-wise, there needs to be:

  • a clear path for quick recognition of the compression used
  • some sort of hand-shake to negotiate optimum for both communicating sides
  • consider using fast compression/decompression also for big local IPC chunks?

Not anywhere close to conceptually/formally verified design

Pacemaker fails to assurably meet basic distributed computing (both inner, for it's own inner governance, and outer, for the robust distributed systems built atop) reliability criteria that seem to be otherwise implicitly assumed:

  • lack of fail-safe considerations throughout the design(!)
    • in some circumstances, pacemaker may leave running resource behind: ML post + more generic one
    • problems with "random" pacemaker process getting suspended: ML post
  • corosync provides just relatively weak guarantees regarding extended virtual synchrony, which threatens to break the current state of pacemaker's integration fully apart (universal guarantees, or generally none at all), see also ML discussion

These are next to impossible to practically check, hence the established practice is to have the concept/design verified/formally proved, and only then to metamorph that into the end executive form. Pacemaker lacks any evidence the former was ever performed, making the current implementation encumbered with significant technical debt and lack of any guarantees in a strict sense of a word.

Ad-hoc design decisions

  • it is suspicious there are cycles/redundant channels in the data flow amongst the local daemons (e.g. that fenced attaches back to cib whereas the natural flow is that scheduler feeds/instruct fenced as appropriate since there's no ordering guaranteed between the redundant messaging channels -- that would be eliminated with due attention to the design

[flaw class] Preoccupation regarding properties of abstract data types

Use of hash tables implementation from glib2 is ubiquitous in pacemaker, yet some properties (e.g. lack of any sort of stability beside store-fetch) were apparently neglected/naively assumed for the design purposes. While such unpronounced properties used to hold (by mere luck), headaches emerged once they stopped to. The biggest problem is general undecidability what's the actual impact, what all may go wrong because of this. To get that picture is a tonne of extra work. Also, doubly-linked lists are used where extra reverse chain tracking is plain superfluous and singly-linked lists would be a better choice (getting back to the theme of uninformed used of data types).

Sadly, it doesn't end with true abstract types, but the same uninformed approach is used also around dealing with strings. For instance, for the purpose of compressing interchanged XMLs, in-memory XML tree is first serialized, then passed into BZip2 library. Instead of having libxml return the size of the effective buffer size, redundant and ineffective strlen is used instead onto the returned buffer.

Bright points

poki: Enough of dark mood, there are some decisions that the time can only confirm:

  • using XML format as opposed to:
    • human unreadable binary struct sequences (no external toolings, algebra-like formalisms hence full predictability, cf. Relaxng, XSLT, ...) requiring one to build (and maintain!) tree like hierarchy of atomic parts in the memory anyway (unlike with conveniently available DOM tree built by XML parser)
    • fancy relaxed/downright overkill structured encodings like YAML (see also discussion, and mysterious issue fix)
    • OTOH: XML is plain overkill for internal data exchange, but this use is understandable when it excessively predates binary formalisms like Protocol Buffers; this may change in the future outside of CIB itself, see discussion wrt. external data exchange format