The long way towards resilience - Part 6

Surprises and antagonists

November 22, 2024 Uwe Friedrichsen

13 minute read

The long and winding road towards resilience - Part 6

In the previous post, we discussed the plateau of robustness, the second interim stop on the journey towards resilience, what it is good for, what its limitations are and what it means to get there.

In this post, we will discuss what it means also to prepare for surprises, the additional realizations needed to guide us to the next plateau – and we will meet the probably biggest obstacle on our way.

Surprise, surprise

Most people do not like surprises – at least not negative ones. Companies tend to hate surprises. Most managers tend to hate surprises because surprises mess with their carefully designed plans and introduce unexpected risks. Employees tend to hate surprises because they usually mean stressful extra work and sometimes even put their employment at risk.

However, surprises happen no matter how hard we try to control everything. But then there are people who claim that surprises just mean you did not plan hard enough …

Annotated known-unknown quadrant. See text for details.

However, if we look at the known-unknown quadrant, it becomes clear that even the best planning cannot protect us completely from surprises:

Known knowns – the things we know and are aware of. The explicit knowledge. We usually take these things into account. The more carefully we plan, the more complete our list of known knowns will be.
Unknown knowns – the things we implicitly know but are not aware of. This is our implicit knowledge. If things go wrong, they often direct our “gut feeling” in the right direction. We may take these things into account. However, as it is not easy to make implicit knowledge explicit, chances are high we miss some of them.
Known unknowns – the things we do not know and are aware of we do not know them. If we work hard, we may be able to identify some of them and turn them into known knowns. Maybe some other people can help us to fill (some of) the gaps. However, chances are high we miss at least some of them.
Unknown unknowns – the things we do not know and are not aware of we do not know them. No matter how hard we try, we will miss them.

If we work really, really hard and are lucky, we may identify all potential adverse events and situations that hide in the first three quadrants. Usually, we do not. Usually, we will miss some of them even if we work very hard. I mean, e.g., try to list all items of your kitchen on command. Most likely you will miss a few things. And these are all things, you definitely know and probably have seen many times. Figuring out unknown knowns or known unknowns is a lot harder.

But even if we were lucky and would be able to identify all kinds of adverse events and situations that belong to the first three quadrants, we are out of luck if it comes to the last quadrant. The definition of unknown unknowns is that we do not have any clue they exist. If they strike, they take us by surprise. The only thing we know about them is that they exist and that they are not an empty set. In other words: There are always adverse events and situations, that may hit us and we do not have any idea they even existed until we see them.

The fourth insight

This immediately leads to our fourth insight:

Surprises are inevitable.

As soon as we accept this, we are ready to continue our journey.

The limits of technical systems

Accepting that surprises are inevitable, we start to ponder how to handle them successfully. How can we build our IT systems in a way that they can handle surprises successfully?

Well, we cannot.

It is not possible to build IT systems (or any technical system) in a way that they can deal with surprises. Just think about the corresponding requirement for a moment: “Write code that handles an adverse situation nobody knows what it is and how it manifests”. Good luck with that!

We can build the ability to handle expected adverse events and situations into technical systems. This is what we mainly do at the plateau of robustness. At best, we can try to make sure our system falls back into a defined safe state if something unexpected happens it does not know how to handle (this is part of safety design, a discipline the regular enterprise software engineer is notoriously bad in). But we cannot build the ability to handle unexpected, i.e., still unknown adverse events and situations into our technical systems. ¹

But how can we handle adverse surprises successfully if we cannot build it into our technical systems?

Image with the socio-technical system encompassing the technical system. While the technical system can be made robust, we need the socio-technical system for resilience. See text for details.

If we ponder this question for a moment, we realize we need to have humans in the loop because currently only humans have the capabilities to successfully deal with surprises. The whole human history is a story of adverse surprises and the ability of humans to successfully overcome them. Otherwise, we would not exist anymore as a species.

This brings us to the concept of so-called “socio-technical systems”. The term has become quite fashionable lately. However, it does not mean anything new. It simply describes a technical system including its encompassing social system(s). For IT systems, depending on the context the encompassing social systems are the ordering organization, the development organization, the operating organization, the bug-fixing organization, the users of the system including the organization they are embedded in, and sometimes even more parties.

A key aspect of a socio-technical system is how the technical and social system influence, complement and limit each other. E.g., if people talk about Conway’s law, it is about the effect the communication paths in the development organization have on the design of the IT system.

When it comes to handling surprises, we need to broaden our view and include the relevant encompassing organization(s) and their processes in our reasoning because only with human creativity in the loop we can respond to adverse events and situations we do not even know they exist before they hit us.

The fifth insight

This leads to our fifth insight:

Surprises cannot be handled at the technical level alone. We need to leverage the whole socio-technical system.

This insight and the previous one lead us to our next interim stop, the high-plateau of basic resilience.

Efficiency obsession

But before we are ready to proceed to the high plateau of basic resilience, we first need to discuss the biggest antagonist of resilience: Efficiency obsession.

The probably biggest obstacle on our way towards resilience is the need to balance resilience and efficiency. In the end, it all comes down to let go of a bit of efficiency in order to be able to also deal successfully with surprises. This idea alone gives most companies and decision makers a hard time at best. Usually, they determinedly reject the idea. Reducing efficiency is nothing they can envisage.

In my blog post “Forget efficiency”, I discussed why companies are so efficiency-obsessed. In short, it boils down to deeply ingrained habits at an individual as well as at a company level that stem from a time with different market conditions: The biggest part of the last century, most markets were industrial markets which are a lot more predictable than the meanwhile dominating post-industrial markets. Also, at least for the western hemisphere, the overarching environment was a lot more predictable:

Relatively clear political situation.
(Most) western countries being economically dominant with a continuously growing GDP.
Ecological situation still quite stable.
And so on …

Of course, not everything was peace and sunshine. But overall, the biggest challenge for most companies was not preparing for surprises but to cost-efficiently increase production in order to maximize profits. Based on this situation, a whole body of knowledge formed on how to be successful in such a setting.

The most important ingredient in this optimization process was increasing efficiency. The underlying implicit assumption was that the right thing was produced and big surprises would not happen – which often was true for many companies at the time this body of knowledge formed. As a consequence, most business leaders were pushing efficiency because it was the key to success (and because you do not get fired for pushing something everyone agrees upon).

Over time, increasing efficiency became an end in itself. It became ingrained into the DNA and culture of the companies: Here is a new idea. Does it increase efficiency? Yes? Okay, let’s do it. No? Dismissed.

It even went further. It became the sole problem solving response of many companies. They basically forgot the other means to address problems and challenges. Whatever happened, increasing efficiency was the response.

Of course, not only the company culture became efficiency-obsessed. Also, the organization and processes were set up to maximize efficiency:

Maximize division of labor.
Foster specialization.
Maximize utilization.
Minimize slack.
Enforce unthinking adherence to rules.
Reduce variance.
Minimize resources until any further reduction would bring the system to a standstill – even without any unexpected events.
…

All these measures, being refined over the course of decades, lead to highly – sometimes overly – optimized organizations and processes where everything breathes efficiency. In short: Most companies are efficiency-obsessed. Everything is about increasing efficiency. All problems are addressed by attempting to increase efficiency.

Efficiency as a resilience antagonist

Enter resilience.

Resilience is the ability to also cope with unexpected adverse events and situations – with surprises. Surprises do not care about your carefully reduced variance, resource minimization, and all those other efficiency optimization measures that only function as long as nothing unexpected happens. Surprises are the unexpected.

The underlying assumption of efficiency-obsessed people is that with enough hard work and willpower surprises can be eradicated.

However, they cannot.

Like failures in technical systems, surprises are inevitable.

But what does it take to deal with surprises?

It takes, e.g.:

Spare resources
Slack in the system, especially some wiggle room for the people involved
Creativity
Outside-the-box thinking
Spontaneous collaborative working
…

If you show this list to efficiency advocates, they will cringe and wail in pain. This is exactly what they consider bad. If you try to maximize efficiency, you need to suppress and eradicate all these things. This means:

Efficiency-obsession and resilience are antagonists.

If you increase efficiency beyond some low-hanging fruits, you will compromise resilience. Also, if you want to be resilient, you cannot arbitrarily increase efficiency.

Note that this is not a binary either-or. It means that solely focusing on increasing efficiency will compromise your resilience. It does not mean that efficiency must be completely given up to become resilient. It means there is a sweet spot at which you are fairly efficient but still have the required prerequisites in place you need to successfully cope with adverse surprises.

However, the efficiency advocates will still cringe. After all, they have been conditioned over many years that nothing is as important as increasing efficiency. Therefore, they sometimes come up with a seemingly smart idea:

“Let us establish a bimodal model. Everything runs in high-efficiency mode as long as everything works as expected. We only switch to resilience mode if something unexpected and adverse happens. This way, we do not need to compromise efficiency.”

While this is a seemingly valid thought, it does not work in practice. Highly efficient organizations are extremely rigid and inert. They are great doing what they were designed for. But they are extremely resistant to any deviation from the norm. And they have no idea how to respond to situations they were designed for.

However, responding to a surprise requires exactly this ability. It requires the organization to very quickly and flexibly respond to an unexpected and unknown situation. An organization, optimized for maximum efficiency over many years, has lost this ability.

Additionally, the people working in highly efficient organizations are conditioned to not think and act in ways that are required to respond to surprises. Abilities like creativity, outside-the-box thinking or spontaneous collaboration are not incentivised. Quite the opposite: Such traits are actively penalized and either you learn to fit in and unlearn those traits or you better leave the company.

Therefore, such a bimodal model will not work. You cannot easily switch modes from highly efficient to resilient when needed. You have to build the ability to respond to adverse surprises into your (regular) processes and organization. Again, you do not have to give up efficiency for that. You only need to balance it with the needs to be resilient and find the sweet spot.

Nevertheless, efficiency-obsessed people and organizations will consider the required measures to become resilient as a threat – not seldomly even as an existential threat – and therefore fight them.

Hence, the hardest part of the work of becoming resilient usually is breaking the efficiency obsession and bringing it back to a level that does not compromise resilience. Or using our mountain-climb metaphor: This part of the journey may feel like climbing up a steep face with an overhang. ²

Summing up

Now we are ready to proceed to the high-plateau of basic resilience. But as this post already is long enough, we will postpone our ascend until the next post.

We discussed that surprises are inevitable, no matter how hard we try to eradicate them. Therefore, we need to accept them. We also discussed that we cannot handle surprises at a technical level alone but that we need the whole socio-technical system for it. We need human creativity in the loop to cope with surprises.

Finally, we discussed the biggest obstacle on our way towards resilience: The omnipresent efficiency obsession we can observe in most companies. We can only successfully become resilient if we are willing to balance resilience and efficiency. If our sole focus is increasing efficiency, we will become rigid and fragile instead of resilient.

In the next post, we will explore the high-plateau of basic resilience. Stay tuned … ;)

For the sake of completeness: Organic computing and residuality theory both claim, it would be possible to build the ability to deal with unexpected and still unknown adverse events and situations into technical systems, applying their respective techniques to system design. However, organic computing is still a research area and leads to very different system designs than the ones we typically see in enterprise computing. And at the time of the writing of this post, residuality theory is still a one-man show in a much earlier research state than organic computing. Many proofs regarding its underlying assumptions are not yet publicly available and thus lack public validation. Therefore, currently I would take both approaches at least with a grain of salt. Nevertheless, even if I personally consider both far from being “production-ready”, I still think they are both worth a look. Additionally, an existing adversity handling routine may accidentally be able to handle a not yet known adverse surprise, too (which is the core claim, residuality theory is built upon). However, this is something that may happen. But it still does not guarantee that our systems will be able to handle all surprises they may be confronted with. ↩︎
As I discussed in my blog post “Forget efficiency”, it also makes a lot of sense moving away from that omnipresent efficiency-obsession for many other reasons. ↩︎

blog

Home

About

Blog

Resources

Categories

Contact

Recent Posts

AI and the ironies of automation - Part 1

It is your fault if your application is down

Solving the wrong problem

The process deadlock

A note about eventual consistency - Part 2