Uwe Friedrichsen

13 minute read

An elevated wooden path between trees

The long and winding road towards resilience - Part 1

In its core, this post series will discuss three questions:

  1. What is resilience?
  2. How can we become resilient?
  3. Do we always need to go the full 9 yards?

While the second question probably immediately makes sense to you, the first and third question may make you wonder. Is it not obvious what resilience is? And is there any alternative than doing everything needed if we want to become resilient? Based on my observations there is a lot of confusion regarding resilience. Almost everybody I talk to means something different if they talk about resilience.

Which leads to the third question. Quite often, people are not actually interested in resilience if they talk about resilience (especially in the context of IT). There are several prototypical evolution steps companies go through on their journey towards resilience. However, depending on the task at hand, it may be perfectly fine to stop your journey at one of the interim steps. Of course, this is not actual resilience what they achieved in such a situation but it may be completely sufficient to solve their task.

This blog series is based on a presentation, I gave a few times (see, e.g., the J On the Beach 2024 recording). In its core, the post series follows the same storyline as the presentation. However, it dives deeper into the topic and (hopefully) fills the blanks, the presentation had to leave due to time and scope restrictions.

As all this would be way too long for a single blog post, I split it up in several posts:

  1. What is resilience? (this post)
  2. The valley of feature-completeness
  3. The plateau of stability (link will follow)
  4. Availability revisited (link will follow)
  5. The plateau of robustness (link will follow)
  6. Surprises and antagonists (link will follow)
  7. The high-plateau of basic resilience (link will follow)
  8. Responding to changing threat landscapes (link will follow)
  9. The peak of advanced resilience (anti-fragility) (link will follow)
  10. Do we always need to go the full 9 yards? (link will follow)

The limits of storytelling

Initially, I planned to write this blog series as a hero’s journey. But along the way I realized that this is a journey a single person cannot shape. A single person can act as an evangelist and maybe a catalyst to get things going. However, the person will never be able to shape the whole journey.

Hence, the person would have either needed to pass on the baton along the way and the next person would have needed to do the same and so on what makes the identification with the person harder. Or I would have needed to build a bigger group along the way which brings its own challenges.

Therefore, I decided to refrain from the storytelling approach because the topic at hand makes it hard and I am afraid that goes beyond my personal storytelling skills.

As a consequence, I will explain the journey in a more descriptive way, i.e., I will not discuss it from a first person’s perspective. Instead, I will show what makes a company move from one level to another, how to get there, the tradeoffs of the distinct levels and their blind spots (which in turn point to the next level).

While the emotional attachment of a good hero’s journey will be missing, I hope I will still get the important information across – however, in a bit more prosaic way.

Final preparations

Before we jump into the subject matter, I would like to add the following remarks:

  • Usually, it is better to start explaining a topic by answering the question “Why” first instead of “What”. “Why” provides purpose. “What” without asking “Why” can lead to solutioneering in the worst case. I skip the “Why” here because I already discussed it in prior blog posts like this general introduction why we need resilience or more recently as an essential building block towards a more sensible IT. Hence, if you are looking for the “Why”, I recommend reading the aforementioned blog posts first.
  • I present a series of reasoning models in this blog post. Also, the journey itself is sort of a model because it leaves out a lot of details that may influence your actual journey. As I discussed in a former blog post: “All models are wrong. Some are useful!”. Hence, if your particular journey is different, if you cannot precisely identify your current resilience level because you implemented parts of level A and parts of level B, that is perfectly fine. I do not provide the one and only irrefutable truth. I try to provide you with a reasoning model, a simplification of reality that supports you in reasoning about your situation and what may be a next useful step to make. Nothing more. But also nothing less.

With that being written, let us jump in.

What is resilience?

I hear a lot of people talking about resilience. After all, it has become fashionable to “be resilient” or at least to “have a resilient IT”. I talk to IT decision makers who claim to have a resilient IT. I see suppliers praising the resilience of their solutions. I talk to developers wanting to write resilient software. And so on.

The question that always immediately comes to my mind is: Are we talking about the same thing? Is it really resilience you talk about or do you talk about something else?

Hence, what is resilience?

Maybe before answering the question, let us point out what resilience is not.

I was surprised when I realized that people tend to consider things resilient that function over a longer period of time. The reasoning is: It worked up to now. Thus, it must be resilient.

It is not!

Something is not resilient just because it functions over a longer period of time.

If something functions over a longer period of time, it just proves that its basic design works within the expected conditions. That is it. Nothing else.

Resilience or the lack of it shows if something unexpected, something outside the expected conditions happens. Think, e.g., supply chain. Many people were convinced the global supply chains were highly resilient because they functioned well. However, they only functioned well as long as everything went as expected within very narrow tolerance boundaries.

Now let us add something unexpected. Hello, Ever Given! A single ship blocking the Suez Canal for a few days and it took the global supply chains months to recover from it – also in places that did not have any connection to the Suez Canal. So much for resilience. The global supply chains are not resilient. They only work as expected as long as nothing unexpected happens.

Which brings us back to the question: What is resilience?

Of course, as for many other topics you also find the usual know-it-all-and-better people on the Internet who claim to be the authoritative truth regarding resilience. The problem is that there is more than a single person and all of them have a different authoritative truth they insist on which makes things complicated.

Therefore, I tried a different approach. I looked up definitions of resilience from several different sources and domains and tried to create a common bottom line based on them. This still leaves the risk I missed some important detail while distilling the bottom line. But at least it avoids the single, bigger-than-life authority kind of approach which often is turned into some dogmatic “either you submit to our definition or you are our enemy” belief thing by the disciples of the authority. 1

So, I went through domains like psychology, ecology, supply chain, organizational theory, system engineering, IT and more. I also looked at the definitions, some of the authorities in the field of resilience like Erik Hollnagel and David Woods came up with. Overall, I worked through more than two dozen definitions 2 and tried to distill a common bottom line. This led me to the following definition of resilience:

Resilience is the ability to successfully cope with adverse events and situations, including

  1. handling expected adverse events and situations (robustness)
  2. handling unexpected adverse events and situations (coping with surprise)
  3. improving due to adverse events and situations (anti-fragility)

Note that robustness alone (the first part of the definition) is not resilience. You also need to be able to successfully cope with unexpected adverse events and situations (the second part of the definition). However, without being able to also successfully handle expected events and situations, the ability to handle surprises is of little value. You need both to implement resilience. 3

Not all definitions of resilience explicitly include the ability to improve due the adversities faced (the third part of the definition). However, if you dive deeper into the fine print following the definitions themselves, most of the definitions include adapting to change or even transforming. In the end, I do not think it is a sign of actual resilience if you face the same kind of problems over and over again (even if they may come in different disguises) and do not ponder how to adapt to a place where this kind of problems becomes less likely. Therefore, and because many definitions of resilience explicitly include it, I also included this concept in the definition.

Examples of adverse events and situations

Now that we know what resilience is, let us briefly ponder what such adverse events and situations could comprise in the context of IT:

  • Most people I talked to think about hardware failures first and overload situations second.
  • Quite some people equate resilience with IT security and thus equate adverse events with cyberattacks.
  • Some people also have situations like a central process becoming latent in mind.
  • Some people also consider input parameter errors, i.e., input that cannot be processed correctly.
  • Few people consider problems like a firmware bug in an infrastructure component like, e.g., a switch.
  • Also, few people think about a critical software bug as a potential adverse situation.
  • Very few people think about actual surprises. E.g., I once had the situation that the triple redundant cooling of our (sole) data center failed simultaneously. As a consequence, all production servers triggered emergency shutdowns due to overheating. Nobody (including me) expected that such a situation could ever happen.
  • Also, very few people consider the business domain as a source of surprises. E.g., imagine your strongest competitor launches a disruptive new product. If you are not able to respond very quickly, you will lose lots of customers to your competitor, in the worst case losing your viability. Due to the consequences of the ongoing digital transformation, this affects your IT organization in the same way.

And so on. And we have not yet talked about the really big drivers of uncertainty like a global crisis (COVID or climate change come to mind as two popular examples) or the basically unpredictable future geopolitical and economical developments. 4

As we can see, adverse events and situations in the context of IT comprise many more things than just the usual expected technical issues. However, I see few people who have such topics – especially surprises – in mind if they talk about resilience. We will dive deeper into this discussion when we discuss the question how we can become resilient.

Resilient software design

Before wrapping up this first post, I would like to add a complementing definition, the definition of resilient software design. Resilient software design (RSD) is a topic, quite some people (including me) often talk about. However, it is important to understand the distinction between resilience and RSD.

My definition for RSD is:

Resilient software design is designing and building software-based systems in ways that improve their dependability and thus support resilience according to the definition above.

If you ponder this definition for a moment, you will realize that software solutions on their own are not actually resilient. They can be robust. However, it is basically impossible to implement code for handling surprises, i.e., something you do not expect, something you do not even know it can happen at all. If you are lucky, your code will accidentally handle the surprise successfully. But this would be luck. Usually, your system will simply fail in the face of an adverse surprise.

Nevertheless, even if a software system cannot be resilient on its own, it is of great value if it is robust, i.e., if it is able to handle known and expected adverse events. As written above: Without this capability, the capability of handling surprises is of little value: “Hey, we are perfectly prepared for surprises. However, our systems inevitably crash if any part of them should temporally be unreachable.”

So, robustness is a required step on our way towards resilience. RSD is about creating robust IT solutions. 5

Summing up

We started our journey towards resilience with clarifying what resilience is. This is an important prerequisite for our journey because from all I see most people do not mean resilience if they say resilience. We also added a few examples of adverse events and situations to further clarify what resilience is. Finally, we differentiated resilience from resilient software design.

With these necessary preparations done, with our backpack packed and all gear in place, we are ready to actually start our journey. In the next post, we will kick the journey off by looking at the place, many companies still live at and discuss why it is not a good idea anymore to linger there. Stay tuned …


  1. Of course, all these disciples will declare me an enemy because I did not blindly submit to their belief. But well, I guess I will have to deal with that. ↩︎

  2. I will not list all the definitions here because that would be a post on its own. I currently write a book about resilient software design where I cite and discuss quite some of the definitions I went through (I left out definitions that basically only were repetitions of the definitions I listed). Unfortunately, at the moment I do not know when I will find the time to complete the book. So, please bear with me. ↩︎

  3. Sometimes, you see people from the system engineering community insisting that robustness is not part of resilience. Personally, I think this is due to the fact that most of those people come from safety-critical domains. In safety-critical domains people can (and will) die if the system fails. Therefore, all those systems are built with robustness in mind, i.e., robustness is basically a “given” in those domains. The challenge for those people is not to make systems robust (because they already are) but to prepare for surprises. Hence, my guess is that those people are a bit mislead by the properties of their domain when it comes to resilience. However, in IT, especially enterprise IT, we are far away from building robust systems. Therefore, I think it is important also to stress the relevance of robustness as a required prerequisite to becoming resilient. ↩︎

  4. I will touch those global drivers of uncertainty (and thus surprises) only lightly in this post series because these topics are “beyond the pay grade” of most of us. However, it is important to keep in mind that completely ignoring those topics can easily create a life-threatening risk for a company. ↩︎

  5. I leave out really nasty failure modes like metastable failures in this blog series, i.e., failure modes that persist even if the original failure cause has been removed due to unexpected side effects of robustness measures. If you would like to dive deeper into that topic, I recommend starting with the introductory paper “Metastable Failures in Distributed Systems” by Bronson et al. and the paper “Metastable Failures in the Wild” by Lexiang et al. which contains a practical examination of the topic. ↩︎