Resilience
About a decade ago, Jeffrey Dean and Luiz AndrĂ© Barroso published their IMO great article “The tail at scale” in the Communications of the ACM . The article dives into the topic of latency tail-tolerance.
We have discussed the business case for resilient software design in my previous post. Let us assume, you have a budget and you know which are the most critical business processes/capabilities/interactions (whatever term suits your needs best) you need to secure, i.e., make more resilient.
The business case of resilience is a bit tricky. You find quite disparate forces at work: While some people tend to underrate the need and value of resilience a lot, other people find it hard to stop adding resilience measures. As so often, the sweet spot is somewhere in the middle.
I recently had a short discussion with a Product Owner after a talk I gave about resilient software design. His question was: “How can I motivate developers to care more about resilience?”
I briefly mentioned the 100% availability trap in a prior post. As this misconception is so widespread, I decided to discuss it in more detail in this post.