The long way towards resilience - Part 10
Do we always need to go the full nine yards?

The long and winding road towards resilience - Part 10
In the previous post, we discussed what we find at the peak of Mt. Resilience, the peak of advanced resilience (anti-fragility).
In this post, we discuss the final question left, which is if we always need to climb to the top or if it is okay to stop our journey at one of the interim plateaus.
All models are wrong …
Before we dive into the question, let me repeat a statement first, i made quite at the beginning of this blog series:
Note that I will describe a prototypical journey of a company from zero to resilience. The starting point and the interim stops are states I have seen several times on my own. Nevertheless, this is a simplified model, which means that the journey of your particular company might be different.
Our journey was really long and our way to the top was arduous and full of obstacles. Thus, most of of you probably have forgotten this statement from the beginning of our journey. However, it is very important to keep it in mind and thus I repeat it here.
Your mileage will vary. E.g., you may find some aspects of the first and third plateau in your company while others are completely missing. This is normal. Reality is always more complex (some say chaotic) than a model. And this prototypical journey is a model. It is a deliberate simplification of reality. As I explained in a prior post, it is essential for models to be more simplistic than reality in order to be useful.
This model is meant to help you reasoning about the state of your company and ponder useful next steps. It provides realizations needed to ignite a change towards a higher level of resilience. It discusses typical measures required to get to that higher level. It looks at tradeoffs and blind spots. All these things are reasoning tools. Nothing more. But also nothing less.
Overall, this model is meant to help you to make decisions without getting lost in details. Usually, after making a decision based on the reasoning tools of the model you still need to adapt it a bit to fit it into the complex, often chaotic mess of your reality. But you have a basic decision you would not have without the model.
Therefore, always be aware that the prototypical journey described in the prior posts does not reflect your exact journey with all its details and subtleties. Because if it would, it would be so complex that it would be useless for you. 1
With this being said (or more accurately: written), let us move on to our final question.
Do we always need to go the full nine yards?
After we finally made it to the peak of Mt. Resilience in the previous post and enjoyed the nice view for a while, we passed our journey in review and were amazed where it led us:
- We started with the intention to make our IT systems resilient. In the beginning, we thought this would be a purely technical task.
- But then we learned that we need to include the business department and also touch our processes if we want to become at least robust.
- However, this was still not enough to become resilient. We learned we must even touch our organization, our collaboration modes at the organizational boundaries, have arduous discussions about balancing efficiency and resilience and always ponder the whole socio-technical system to become resilient in a narrower sense.
- And this was still not the end of the journey. We have seen that we need a continuous adaptation process including the ability of leapfrogging basically for the whole company or at least big parts of it that also includes reshaping the collaboration modes at the market boundaries as needed to become fully resilient – which is sometimes also called anti-fragility.
We never intended to be there. We are software engineers (probably most of us are), maybe IT leaders. This stuff further up Mt. Resilience seems to be way outside our “pay grade” and the idea of discussing organizational changes and even tougher topics (in terms of expected discussions and resistance) feels at least awkward for most of us. Still, the journey inevitably led us there.
Hmmm …
The obvious question that most of you (and admittedly also I sometimes) have in mind is:
Do we always need to go the full nine yards of resilience?
Do we always need to climb to the top to become resilient or is it okay to stop somewhere further down Mt. Resilience, even if we understand we do not get full resilience?
Well, as so often it depends on what you need to achieve. If people talk about resilience and their need to become more resilient, they often mean very different things. Sometimes, they actually mean resilience. Sometimes, they mean something different and only use the word “resilience”.
The answer to this question is basically spread across the prior posts. However, it is a bit cumbersome to gather up the pieces from all those locations. Therefore, I grouped it all in a single table here.
The table consists of the different interim stops of our journey as rows and some essential properties of the interim stops as well as when it is okay to stop the journey there as columns.
The first row describes the starting point of our journey, the valley of feature-completeness.
Based on my experience, many companies are still there: Here is Dev, responsible for implementing new features, measured by feature throughput. There is Ops, responsible for reliably running systems, measured by availability. I.e., everyone except Ops treats availability as a S.E.P. (somebody else’s problem).
This approach is okay if the system landscape consists of mostly isolated, monolithic systems that communicate via batch interfaces. However, the reality is today’s IT system landscapes are complex, distributed and highly interconnected, mainly communicating via online interfaces. For such IT landscapes, the approach is not suitable anymore, let alone for the intensifying effects of highly dynamic markets and the ongoing digital transformation.
My basic recommendation is to move away from the valley of feature-completeness as it is not suitable to meet the demands of today’s markets and IT.
The second row is about the first interim stop, the plateau of stability.
We also find relatively many companies there. The core idea there is to avoid failures by all means, usually with a focus on crash failures and overload situations. This focus makes the people involved vulnerable for the 100% availability trap, the fallacy that everything beyond the boundaries of one’s own application is 100% available and thus they are not planning for failures of the other systems.
Going for stability as described is okay if the availability demands are not too high: less than 99,9% availability (“3 Nines”) is a good rule of thumb. Additionally, the systems should not be distributed internally (i.e., not have a service-based architecture) and planned downtimes should also be possible. In such a setting, it can be okay to stop the journey at the first plateau.
My basic recommendation is to use this approach only for non mission-critical systems where unexpected downtimes are not a big issue, even if they take several hours up to a few days.
The third row is about the second interim stop, the plateau of robustness.
If your availability needs are higher and/or your application is distributed internally but the application is not safety-critical, the plateau of robustness may be sufficient. At this plateau, you overcame the 100% availability trap, accepting the failures are inevitable. Therefore, you started to embrace failures of all kinds which leads to failure response patterns that are quite different from the ones seen at the plateau of stability. Based on my experiences, not so many companies made their way up to the plateau of robustness.
My basic recommendation is to aim for this plateau for most of your enterprise software systems unless they are safety-critical. If they are mission-critical, if very high availabilities (99,9% and higher) are mandatory and downtimes longer than a few minutes at max are not an option, this level is basically the mandatory minimum resilience level.
The fourth row is about the third and last interim stop, the high-plateau of basic resilience.
In safety-critical contexts, going at least for the high-plateau of basic resilience is a must because a long-lasting failure due to an unexpected adversity (a surprise) is not an option. People’s health or even lives are at risk if the system fails. This changes the core driver of your resilience measures to “Expect the unexpected”. The handling of potential surprises is part of your considerations which widens the scope to the whole socio-technical system. The consideration of surprises also enables to move reliably in highly unpredictable and uncertain technical and business environments. However, only relatively few companies are found at the high-plateau of basic resilience.
My basic recommendation is to go for this plateau whenever safety comes into play. In such settings, this plateau becomes the mandatory minimum resilience level. Keep this in mind because due to the ongoing conflation of IT and OT, more and more safety demands will find their way into traditional enterprise systems.
The fifth and last row is about the final destination of the journey, the peak of advanced resilience.
The peak of advanced resilience adds continuous adaptation – or leapfrogging if needed – to the game, i.e., it embraces surprises as an opportunity to learn from, enabling anti-fragility. This prepares for a successful endless game in a VUCA world, a world in which we cannot reliably anticipate anymore what will happen next.
Even if this tends to be more of a business issue, IT is affected equally because due to the effects of the (still poorly understood) effects of the ongoing digital transformation, business and IT have become inseparable). However, even fewer companies made their way to the peak.
My basic recommendation is to carefully examine this peak, not only from an IT perspective but from a whole company perspective. As the degree of VUCA rises for most companies and the calls for more resilience become louder everywhere, it becomes more and more important to understand what resilience actually means and what needs to be done to get there. You need to aim for this peak if you want o survive and thrive in a VUCA world with more and more unforeseen threats that go way beyond a bit of cybercrime (which already overstrains most companies). However, these considerations belong at the C-level. A mere software engineer usually will not be able to do much about this.
Overall, this means if you are just interested in highly available systems which are not safety-critical, the plateau of robustness might be sufficient for you. If your systems are safety-critical, you must set up your IT organization to also deal reliably with surprises, i.e., move up at least to the high-plateau of basic resilience.
Staying below the plateau of robustness is only okay if availability is not an important system property, if your IT system landscape is not too complex and if your systems are not distributed internally. As this is rarely the case anymore today, staying below the plateau of robustness is only okay in few situations.
Moving up to the high-plateau of basic resilience or the peak of advanced resilience also is recommendable (besides in safety-critical contexts) if the business or technical environment is highly uncertain (keyword: “VUCA”) and you want to set up your IT systems and organization (or your whole company) in a way that can successfully deal with all kinds of uncertainty, i.e., surprises.
Note that as the whole journey, this table is also a reasoning model. It leaves out details to support (or even enable) the decision making process. Your reality will be more nuanced. This means while this model may support you in finding a general direction, you may need to fine-tune your individual path based on your concrete needs.
Also note that at least for the lower plateaus, it is possible to decide for each system if you need to strive for robustness or if stability is sufficient. You do not necessarily make a single decision for the whole IT landscape.
However, if you go for actual resilience, it will also strongly affect your IT organization. This means, if you go up that high, you most likely make a decision for your whole IT landscape. But you still may decide per system how much effort you want to put into implementing robustness and resilience because most likely not all your systems will be business- or safety-critical.
Where are we now?
While you prepare for your journey, you may ask yourself where you currently are.
We introduced a series of reasoning tools that can also be used to reason about our current positioning.
We looked at the
- Impact radius
- IT collaboration model
- Treatment of availability
- Failure types considered
- Consideration of surprises
- System level regarded
- Resilience response types
As shown in the table above, we can use these tools to determine our current positioning regarding our journey towards resilience:
- If only Ops cares about availability (no IT collaboration), we are still in the valley of feature-completeness.
- If Dev also cares about availability, but the primary driver is to maximize MTTF (avoid failures), we are at the plateau of stability.
- If we embrace failures, i.e., also think about minimizing MTTR to maximize availability overall, we are at the plateau of robustness.
- As soon as we accept that surprises are inevitable and prepare for them by setting up the whole socio-technical system accordingly, we are at the high-plateau of basic resilience.
- If we extend our resilience response types to include adaptation and transformation, i.e., if we embrace surprises as an opportunity to improve and build anti-fragility, we are at the peak of advanced resilience.
Moving further up Mt. Resilience also widens the impact radius from a purely technical consideration (stability) over business and process considerations (robustness) and organizational considerations (basic resilience) to the organizational boundaries (advanced resilience). However, I think it is easier to determine the current positioning regarding resilience using the other reasoning tools as described before.
Again, all these tools and their embedding in the journey form a reasoning model, deliberately leaving out details and thus simplifying reality. Your actual positioning may be a bit more subtle. It is even possible that you already do things I only included at higher levels while you did not yet implement something from a lower level. Still, these models should help you to determine your basic positioning which you can use as a starting point.
Moving on
We have come a long way. We started with the question what resilience actually means because we see many people using the term in very different ways. This led us to the question how to become resilient. We started at a typical division-of-labor focused IT organization where Dev is responsible for implementing business features and Ops is responsible for running the IT system landscape reliably. Then we moved towards resilience step by step until we eventually reached the peak of Mt. Resilience.
Additionally, we discussed if we always need to go the full nine yards, to always strive for full resilience and learned that depending on our needs, it can be sufficient to end the journey at one of the interim stops. While this usually does not provide actual resilience, it may be enough for the given context.
Finally, we used the different reasoning tools we introduced along the way as a means to determine our current positioning on our journey towards resilience.
Again, it was a prototypical journey, leaving out many details and subtleties. Otherwise, the journey’s path probably would have turned into a maze and we all would be lost in there up to now. Reality is always a bit of a maze and your individual journey will almost certainly deviate at least a bit from the one I sketched. However, the prototypical journey should be useful to determine the relevant steps needed to move closer towards resilience.
We have reached the end of this blog series. I have to admit that it became longer than I expected. But then there are so many things that are important to understand when it comes to resilience. I still left out a lot of things that also would be worth writing down – maybe I will do it in some post(s) in the future.
I hope, the series helped you to understand resilience better, why it is useful, why it is relevant and what it takes to get there (or at least to robustness). Personally, I think resilience will become an increasingly essential property of IT systems, IT organizations and whole companies as everything becomes more and more complex and uncertainty and surprises become the new normal. In many places, they already are.
The probably biggest obstacle on our journey towards resilience is the unbridled and usually short-sighted efficiency-obsession that rules most companies which leads into the opposite direction, to highly rigid and fragile systems and organizations that lack the ability to cope with adverse events and situations, no matter if they are expectable or not. This opposing force already hits to a certain degree when it comes to robustness. But it unfolds its full detrimental power when it comes to resilience.
So, the journey will not be easy. However, I am convinced it is a vital one for almost every IT organization and company. Thus, hopefully meet you on the path anytime soon …
-
This is also known as “Bonini’s paradox”, sometimes also called “Valéry’s paradox”, named after Paul Valéry who was one of the first observers of this paradox. ↩︎
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email