Resilience
Recently, I had two experiences within a few days that made me think regarding system dependability. In both situations, the systems acted detached from their surrounding reality and thus became confusing or even annoying – even if it would have been easy for them to detect their reality detachment.
About a decade ago, Jeffrey Dean and Luiz AndrĂ© Barroso published their IMO great article “The tail at scale” in the Communications of the ACM . The article dives into the topic of latency tail-tolerance.
We have discussed the business case for resilient software design in my previous post. Let us assume, you have a budget and you know which are the most critical business processes/capabilities/interactions (whatever term suits your needs best) you need to secure, i.e., make more resilient.
The business case of resilience is a bit tricky. You find quite disparate forces at work: While some people tend to underrate the need and value of resilience a lot, other people find it hard to stop adding resilience measures. As so often, the sweet spot is somewhere in the middle.
I recently had a short discussion with a Product Owner after a talk I gave about resilient software design. His question was: “How can I motivate developers to care more about resilience?”