The long way towards resilience - Part 7

The high-plateau of basic resilience

Uwe Friedrichsen

9 minute read

A meadow just below a mountain peak

The long and winding road towards resilience - Part 7

In the previous post, we discussed what it means to also prepare for surprises, the realizations needed to guide us to the next plateau, including the probably biggest obstacle in our way: efficiency obsession.

In this post, we will discuss what we find at the third plateau, the high-plateau of basic resilience.

The high-plateau of basic resilience

The high-plateau of basic resilience is the third interim stop, companies tend to reach on their journey towards resilience.

A sketch of “Mt.Resilience” with an arrow pointing to the third plateau labeled “High-plateau of basic resilience”

Quite few companies are there – at least from an IT point of view. The main reasons are:

  • It takes time to accept surprises are inevitable.
  • Preparing for surprises usually requires changes in the processes and organization – and change is notoriously hard.
  • Resilience and efficiency are sort of antagonists. Highly efficient organizations cannot be resilient and resilient organizations are not highly efficient. The keyword is “highly”. Resilient organizations are usually efficient but they do not try to squeeze the last bit of efficiency out of their organization. Efficiency and resilience need to be balanced which tends to be extremely hard because most organizations are so efficiency-obsessed that their answer to all problems is “more efficiency” which prohibits becoming resilient. We already discussed this point in the previous post.

As before, if we move on to the high-plateau of basic resilience, it does not mean the lower plateaus become irrelevant. It is quite pointless to be able to handle surprises if we cannot handle expected adverse events and situations properly. Hence, we still need the concepts from the lower plateaus.

Balance robustness measures

The concepts of the new plateau create another layer and complement the prior concepts. They also help to readjust and fine-tune priorities. At the plateau of robustness, we learned that trying to excessively maximize MTTF is not useful. At the high-plateau of basic resilience, we learn that excessively building robustness into our applications is not useful.

The reason for this is twofold:

  1. Each robustness measure adds complexity to the solution. The code becomes more complex. The runtime environment becomes more complex. However, complexity is an antagonist of robustness. The simpler a solution, the more robust it tends to be: Easier to understand, less moving parts at runtime, less error-prone. This means, you need to balance solution simplicity and the amount of robustness measures you build in. As so often, there is a sweet spot between the two extremes.
  2. Robustness measures can sometimes create unexpected, emergent behavior. I.e., while handling the kind of adverse situation well, it was designed for, under different conditions it may lead to surprises like, e.g., metastable failures 1. Of course, we do not want to introduce new adverse surprises by handling expected adverse situations. Therefore, we need to carefully ponder which robustness measures we want to build how into our systems and which parts we want to leave to the organization to handle. However, hitting the sweet spot is not easy and it may take a while to get there.

A good rule of thumb is to start with a rather small amount of robustness measures that complement each other and then see, if additional measures are required. Complementing measure are measures that are able to handle more adverse events and situations in conjunction than they could do in isolation, i.e., together they create some kind of “the whole is more than the sum of its parts” constellation. Such a set of complementing measure may take a while to figure out but in the end, it is worth the effort.

Evaluation of the high-plateau of basic resilience

With those introductory remarks, let us explore the high-plateau of basic resilience in more detail using the evaluation schema we already know.

Core driver

The core driver is to expect the unexpected.

With the acceptance that surprises are inevitable comes the insight we need to prepare for them if we want to reduce the risk to be knocked over by the next surprise that hits us.

Leading questions

This leads to a quite different set of typical leading questions like, e.g.:

  • How can I maximize the odds of detecting and responding quickly to an unexpected error before it turns into a failure?
  • How can I organize best to be able to respond quickly and successfully to adverse surprises?
  • Which resources does my IT organization require to be able to respond quickly and successfully to adverse surprises?
  • How do I balance resilience and efficiency?

It is important to understand that we cannot guarantee that surprises will never knock us over, no matter how hard we try. We can only reduce the probability. However, a 50% chance to withstand or quickly recover from an adverse surprise is a lot better than a 0% chance and if you deal with a safety-critical system, each additional percent will eventually prevent human harm or even save human lives.

Resilience also affects the organization and processes. Typically coming from organizations that are optimized for efficiency, we most likely need to change our processes and organizations to a certain degree to become resilient. The good part of the story is that resilient organizations usually are a perfect fit for economies of speed. Thus, if economies of speed are an issue (which they usually are), going for resilience ultimately pushes in the same direction. 2

We already discussed in the previous post that we need to balance efficiency and resilience. Optimizing for efficiency means getting the job done with as few resources as possible. Resilience on the other hand requires some spare resources and human capacity to withstand adverse surprises or being able to quickly recover from them. Hence, if we really want to improve our resilience, we need to balance efficiency in a way that it does not forestall resilience.

Typical measures

We enter an area where we not only need to take the IT systems into account but more particularly the encompassing organization and processes. Hence, also the typical measures focus more on the organization and processes than the technical systems:

  • Self-organized teams help to establish the spontaneous collaboration patterns we need to cope with unexpected situations.
  • Fire drills & chaos engineering create the required routine, helping to make surprises “boring”.
  • Encouragement of broad knowledge and creativity fosters outside-the-box thinking and creativity which is often needed to successfully master unexpected adverse situations.
  • Slack in the system provides the capacity and resources required to withstand or quickly recover from unexpected adversities.
  • Observability increases the chances of detecting imminent adverse situations earlier and helps to better analyze the situation.

Opposed to the plateau of robustness, here we actually need full-fledged observability, including the possibility to ask the observability solution arbitrary questions regarding the current, past and likely future system state and behavior. It is important that the observability solution does not limit us to querying the data already shown in the dashboards but also allows for ad hoc queries that are not related to the normally monitored and presented data.

Trade-offs

The high-plateau of basic resilience approach has several trade-offs:

  • It takes a lot of work and effort to reach the plateau, especially because of the required changes regarding the organization and processes.
  • A change of mindset is needed, away from efficiency-obsession to a sensible balance of efficiency and resilience which tends to be very hard for companies that are traditionally efficiency-obsessed.
  • The whole socio-technical system needs to be taken into account.
  • Usually, collaboration modes at the system boundaries need to be reshaped, i.e., how the respective business and IT departments (or cross-functional stream-aligned teams) interact with other parts of the company.
  • The approach works very well with an economies of speed business model, i.e., for companies that focus on market feedback cycle times.
  • It allows for very high availability even in the face of unexpected adverse situations.
  • It enables very high innovation speed without compromising dependability even in highly uncertain environments. 3
  • Resilience, i.e., the ability to successfully cope with expected and unexpected adverse events and situations is achieved.

When to use

Such a setup is required for safety-critical contexts.

It also is suitable if the availability demands are very high but the technical environment tends to be very unreliable.

It also is suitable if high innovation speed is required in highly uncertain business environments.

When to avoid

Such a setup is not sufficient if the threat surface changes frequently (see also the “Blind spot” below).

Impact radius

The impact radius is the full socio-technical system, i.e., it includes the IT systems, the IT department, the business department, the processes, the organization and the collaboration modes at the (socio-technical) system boundaries.

Blind spot

The blind spot of this setup is a lack of progress. While the organization is able to handle surprises it does not continuously adapt to a changing threat surface.

Summing up

The high-plateau of basic resilience is the third interim stop on our journey towards resilience. It accepts that surprises are inevitable and that surprises cannot be handled by technical systems alone but require humans in the loop. Therefore, it changes the attitude towards expecting the unexpected and takes the whole socio-technical system into account.

It takes a lot of effort to reach the plateau: A big mindset change is needed. More parties are involved. Processes and the organization are affected. Collaboration modes at the system boundaries are affected. Therefore, only few companies have yet achieved this plateau.

It allows for excellent availability, even if the face of surprises. Therefore, it is also suitable for safety-critical contexts. However, it does not unleash the full potential of resilience because it lacks continuous adaptation to a changing threat landscape.

In the next post, we will discuss what it needs to continuously adapt to a changing threat and surprise landscape which will finally bring us to the peak of Mt. Resilience. Stay tuned … ;)


  1. Metastable failures are failure modes that persist even if the original failure cause has been removed due to unexpected side effects of robustness measures. If you would like to dive deeper into that topic, I recommend starting with the introductory paper “Metastable Failures in Distributed Systems” by Bronson et al. and the paper “Metastable Failures in the Wild” by Lexiang et al. which contains a practical examination of the topic. ↩︎

  2. The complementing observation is that economies of speed do not try to maximize efficiency (minimizing cost of work). Instead, they try to maximize effectiveness (maximizing revenue by maximizing impact of work). To maximize profits, they try to improve the left side of the balance sheet, not the right side. From an economic point of view, resilience attempts to minimize the risk of losing revenue (due to being unavailable), i.e., it also aims at the left side of the balance sheet. Therefore, resilience and economies of speed are not at odds like resilience and efficiency-obsession are. For a more detailed discussion of efficiency and effectiveness, see the blog post “Forget efficiency”↩︎

  3. Remember that one of the consequences of the ongoing digital transformation is that business and IT have become inseparable. This also means you cannot change anything at the business side without touching IT. Therefore, the speed how fast IT can implement and release changes without compromising quality (which is essential for robustness and resilience) also limits how fast business can respond to market movements. Additionally, the ability of the IT department to deal with adverse events and situations – expected and unexpected – also limits the ability of the business to respond to adverse events and situations. ↩︎