The business case of resilience is a bit tricky. You find quite disparate forces at work: While some people tend to underrate the need and value of resilience a lot, other people find it hard to stop adding resilience measures. As so often, the sweet spot is somewhere in the middle.
I recently had a short discussion with a Product Owner after a talk I gave about resilient software design. His question was: “How can I motivate developers to care more about resilience?”
I briefly mentioned the 100% availability trap in a prior post. As this misconception is so widespread, I decided to discuss it in more detail in this post.
In the previous post, we discussed why the imponderabilities of distributed systems will hit us at the application level and we cannot leave their handling to the operations teams as we did in the past.
In the previous post, we discussed availability, and how more nodes and the effects of remote communication affect it negatively. We learned that failures in today’s distributed, highly interconnected system landscapes are unavoidable and that we need to embrace them if we want to create highly available solutions.