Complex systems are prone to fail and, knowing that, we build into them every manner of defense against such outcomes. If we consider complex systems as networks, with the number and variety of nodes of any type being some sort of measure of overall complexity, the nodes, edges and paths all hold a binary potential: to push the system towards correct/expected behaviour, or failure. In the absence of design, each of these possible points or paths through the system would have an equal likelihood of an expression leading to one of these two outcomes, so that there would be equal opportunities for failure and success. In designed systems, we stack the odds to lower the overall probability of actual failure.
I think we tend to forget about this ‘alter network’ and it’s ‘failure processing’; it bites us when we least expect it. It can be thought of as a sort of inverse of the understood, ‘as-designed’ system. It anti-operates / anti-runs alongside the latter, and is inextricably tangled with it. The best we can do is siphon power away from it.
A non-dualistic approach to system design & failure analysis
This way of seeing a system suggests that we might benefit from modelling incorrect behaviour in pretty much the same way that we model correct/required behaviour. We have a tendency to view correct behaviour as a synergy of positive, stable interactions between functioning subsystems. When it comes to faults, however, we like to think of these as somehow singular, restricted to a specific source or cause. In other words failure is not seen as an equally concerted effort of different moving parts. We assume that failure has a fixed locus, and that faults are simply to be located and rooted out. And naturally, we look inside our system boundary first, as if the failure network ever had any interest or knowledge of that boundary. Corollary: since the system boundary is for all intents and purposes a designed artifact, it follows that it is the designed (required) system that is aware of it. The failure system itself “couldn’t care less” about the big box around your system diagram.
Failures have their own synergy
I think that failures have a certain synergy: faults interact and accumulate, the same way that the slightly-more-fortuitous operations within the system do. And here I’m not just referring to the simplistic notion of ‘chain reactions’ or domino-effects that lead to catastrophe. It’s true that it never rains and then it pours, but far more mundane interactions attend malfunctions in general. It follows that faults can also ‘accidentally’ dampen their own systemic ‘failure signal’, or cancel each other out. In this way, lesser failures can go un-manifested (and unnoticed), leading to surprise and shock when they finally reveal themselves as part of a larger fiasco.
Even then, they might remain invisible. They can hide under cover of transient environmental conditions (again, the irreverence toward the system boundary)… and cause the failure analyst to submit a ‘no fault found’ (NFF) valuation. At best, the failure network allows one or two ‘faulty nodes’ to be sacrificed (found and excised) during failure analysis, while the ‘failure network’ itself remains intact.
Failures at the human scale
Engineers may nibble at the edges of failure, trying to get a grip on it… but up and down the country organisations made up of humans and machines, along with myriad processes that unfold at human-speed, fail spectacularly everyday and the failure analyses – where they take place at all – are laughable.
In the business world failure analysis takes place as a self-cannibalising 2-pronged attack:
- Divergence analysis (“this was the target, we missed it by this much; what gives?!”)
- Firing of staff / Resigning of staff (“somebody’s gotta take the fall”).
Very rarely are the subtleties of personality types, lines of command, control and communication, team dynamics, speed of growth, competitor activity, recent events, timing of events, spread of talent and so on ever taken into consideration. The ‘alter network’ of divergent aspirations, watercooler talk, gossip, egos, shortcuts taken by adapting (and adaptable) humans when presented with unclear tasks; lack of direction and mismatched skillsets and so on combine synergetically to create a failure network / system that chugs along happily: the expected organism, only in flipmode.
Without the kind of planning ahead that we bring to bear on engineering projects, without stacking the odds in favour of the required system behaviour, we can end up with a failure signal propagating through an organisation. Human intuition detects this, but it is also human to do nothing, or not know what to do, until that magical coincidence of badly-done tasks yields a sufficiently botched outcome.
One thing is for sure: we refuse to pay attention to, or study failure as a system / a holistic force in its own right. Perhaps we fear that to do so might give it too much power.
image: tweaked tiny portion of the circuit diagram for the Z80 microprocessor.