Pacemaker redesign possibilities
Developers who work with the Pacemaker code base may use this as a forum for discussion of possible changes in Pacemaker code design choices, including pin-pointing existing and sometimes life-long design flaws.
Validate version compatibility before respawning failed components
poki: The Pacemaker main process, pacemakerd
, will try to restart any auxiliary
daemon when its non-bailing-out failure is detected. What's not accounted, however,
is the situation the daemon's binary has been updated in the meantime, meaning the
version consistency is possibly broken -- inter-daemon API + assumptions are not
guaranteed to be stable, for the most part, it's a private thing for given release.
Possible solutions, either:
- ensure the "forever" compatibility, do extensive testing
- make initial protocol interchange some version numbers + conditionalize
- make
pacemakerd
check the version string and/or build-unique seed value/cookie for the daemon in question first
- remember checksum of the binary when run for the first time, then recheck first (checksums can also be hardcoded, strictly requiring binaries from the very same build, but that'd be problematic for couple of reasons: debug symbol stripping, prelink, ...), alternatively just file metadata (file size, mtime) can be taken into account, or the digest from
.bss
section, which might be constant
- run the full circle: stop all other daemons, reexec
pacemakerd
, bring the daemons back up, where they should assurably understand each other now since they should be from the same release/build now; graceful marshalling of the daemons' internal state may be desired, e.g. to skip reprobing etc. and in turn to speed such a holistic recovery up
- RPM only: make installation
touch /run/pacemaker/.updated
(generally, must be volatile memory storage), thenpacemakerd
will check that file prior to re-executing any daemon; at the startup, it would ensure that file is not present
Truly asynchronous (i.e. out-of-mainloop) processing of certain events
poki: Pacemaker daemons are built on "main loops" as the bottom-most building blocks.
That's fine except for cases when there are really urgent inputs to proceed,
here I mean delivered signals in particular. The problem arises when one of
the loop's callbacks gets stuck in a tight cycling, never returning to back to
the loop's control logic. Even if it's not just an infinite loop, this breaks
down any notion of assuredly timely fairness to inputs handling
(non-starvation).
Some daemons are naturally cross-connected with their peers, so for them,
remote liveness attestation is possible (if not performed already, not sure
now). More trouble is with the "local-only" terminal processors (e.g.
pacemaker-execd
). For them, only following solutions are
occurring to me:
- promote
pacemakerd
to a proper supervisor (finally!), attesting liveness of its surrogates directly, through IPC or signals(?), and restarting them when they appear stuck
- use
alarm(2)
and have respectiveSIGALRM
handled outside of the mainloop (i.e. fully asynchronously), which would check whether there are any signals of priority (SIGTERM
, or perhaps whichever signal) pending (->trigger == TRUE
?), and when that happens twice (or so) in row, it would trigger the signal's handler directly (possibly integrated directly into the customized main loop)
Formalizing and exposing inter-component IPC XML schema
poki: At the presence of messaging protocol as the lowest-level interface to
pacemaker, external "programmatic" users of pacemaker shall not be reliant
on (to this day unstable) pacemaker libraries or, contrary, overhead-imposing
CLI interface (which is not too stable either), since a subset of these
messages are forcibly stable to serve various versions of
pacemaker-remote
. With properly documented messages
(actionable formats incl. RelaxNG schemas or custom purpose-specific
declarative forms strongly preferred!), people could write their
custom bindings, just as it happens with HTTP/REST, XML-RPC, etc.
without relying on a single binary bit from pacemaker itself.
Triggering and related discussion (that message and on): GH PR #1603 comment
Note that point 1. above is also highly-relevant, with stable enough messages, daemon-wise granular update would be (more of) a non-issue.
Provide more support to "natively" clustered applications
poki: Seemingly, more and more software components will rather implement their own clustering layer, slowly obviating a need for heavy-weight cluster stacks. Sadly, this "niche" was completely ignored so far, with the exception of corosync that will allow such interested external projects to progress pretty far. Still, they would need to deliver/reinvent the high-level primitives of the distributed systems:
- coordinator election
- 2-phase/3-phase commits convenient wrappers
- unified monitoring
- fencing
Pacemaker could accommodate these needs, making together with corosync for an attractive suite to power what I call "private clustering".
This would be a bit akin to systemd
exposing
sd_notify
library function that makes quite a difference
for programs run under supervision of systemd
(this itself is something that would be really useful when originally
pioneered by OCF, since the reliable readiness protocol is an everlasting
problem, even more so when ordering of resources needs to be as *reliable*
as possible, as otherwise it's hardly HA...).
Triggering and related discussion (that message and on): clusterlabs users ML post
Relatively large codebase without any unit test confinements
While it's highly desirable there are many component-level tests, lack of finer-grained unit tests is what makes it suffer from:
- no thinking about the corner-cases ahead of time, and subsequently exercising them systemically (rather a fantasy with component-level tests, since they, at best, do not cover possible future contexts of use for particular functions)
- missing controlled non-violation of some assumptions set forth originally
- missing more-trustwothy-than-any-documentation-ever way to allow for grasping of the external impact (and in turn, intentions) of the fuction at hand
Net result is a suboptimal maintenance of the code, with less confidence about the changes plus related fallouts. Especially since the maintenance and development is shared amongst multiple people, the integrity on projected vs. actual effects at rather an atomic level becomes of importance.
General lack of dedicated "node health checking" concept
- external: while many intentions can be realized using resources and/or with setting
#health
prefixed node attributes, finer state change triggers are not natively available as to whether to reboot/fence the node for instance, without resorting to rule-conditionalizing new resources solely meant as one-off action triggers
- internal: it'd be vital to check some conditions implicitly, e.g., when supporting systemd, that systemd itself hasn't failed, which is only recoverable with machine reboot currently (see example of reliably provoked systemd crash until fixed, which won't necessarily manifest immediately, but silently with things like being blocked on PAM related callouts forever)
Very superficial systemd integration
- see the healthchecking point above (there's also a question of using watchdog: real one and its "multiplexing" by systemd)
- it perhaps makes sense to actively block units that are not supposed to be running in pacemaker predestined state of affairs, see related ML post
- what about apriori deadlock/circular dependency detection? e.g., see
Restart=on-failure
impeachment for taking over pacemaker's responsibilities -- does it impact any assumptions on pacemaker's side?
Suboptimal compression method choice
Bzip2 is not entirely best-in-class, in any aspect: compression ration, speed, resource usage. Moreover, that are ways to do substantially better, and also to be noted, many compressions methods allow for pre-populated dictionaries, which could be utilized easily, since the tag names are known at build time at latest (possibly gaining substantially better compression at low-end settings of the algorithms?).
- respective ML post
- Linux kernel possibly ditching bzip2
- one of many benchmarks
Design-wise, there needs to be:
- a clear path for quick recognition of the compression used
- some sort of hand-shake to negotiate optimum for both communicating sides
- consider using fast compression/decompression also for big local IPC chunks?
Not anywhere close to conceptually/formally verified design
Pacemaker fails to assurably meet basic distributed computing (both inner, for it's own inner governance, and outer, for the robust distributed systems built atop) reliability criteria that seem to be otherwise implicitly assumed:
- lack of fail-safe considerations throughout the design(!)
- in some circumstances, pacemaker may leave running resource behind: ML post + more generic one
- problems with "random" pacemaker process getting suspended: ML post
- corosync provides just relatively weak guarantees regarding extended virtual synchrony, which threatens to break the current state of pacemaker's integration fully apart (universal guarantees, or generally none at all), see also ML discussion
These are next to impossible to practically check, hence the established practice is to have the concept/design verified/formally proved, and only then to metamorph that into the end executive form. Pacemaker lacks any evidence the former was ever performed, making the current implementation encumbered with significant technical debt and lack of any guarantees in a strict sense of a word.
Ad-hoc design decisions
- it is suspicious there are cycles/redundant channels in the data flow amongst the local daemons (e.g. that
fenced
attaches back tocib
whereas the natural flow is thatscheduler
feeds/instructfenced
as appropriate since there's no ordering guaranteed between the redundant messaging channels -- that would be eliminated with due attention to the design
[flaw class] Preoccupation regarding properties of abstract data types
Use of hash tables implementation from glib2 is ubiquitous in pacemaker, yet some properties (e.g. lack of any sort of stability beside store-fetch) were apparently neglected/naively assumed for the design purposes. While such unpronounced properties used to hold (by mere luck), headaches emerged once they stopped to. The biggest problem is general undecidability what's the actual impact, what all may go wrong because of this. To get that picture is a tonne of extra work. Also, doubly-linked lists are used where extra reverse chain tracking is plain superfluous and singly-linked lists would be a better choice (getting back to the theme of uninformed used of data types).
Sadly, it doesn't end with true abstract types, but the same uninformed
approach is used also around dealing with strings.
For instance, for the purpose of compressing interchanged XMLs,
in-memory XML tree is first serialized, then passed into BZip2
library. Instead of having libxml
return the size
of the effective buffer size, redundant and ineffective
strlen
is used instead onto the returned buffer.
Bright points
poki: Enough of dark mood, there are some decisions that the time can only confirm:
- using XML format as opposed to:
- human unreadable binary struct sequences (no external toolings, algebra-like formalisms hence full predictability, cf. Relaxng, XSLT, ...) requiring one to build (and maintain!) tree like hierarchy of atomic parts in the memory anyway (unlike with conveniently available DOM tree built by XML parser)
- fancy relaxed/downright overkill structured encodings like YAML (see also lobste.rs discussion, and mysterious issue fix)
- OTOH: XML is plain overkill for internal data exchange, but this use is understandable when it excessively predates binary formalisms like Protocol Buffers; this may change in the future outside of CIB itself, see discussion wrt. external data exchange format