Distributed Systems
Recently, AWS experienced one of its rare partial outages. Its DynamoDB service experienced a disruption in the US-East-1 region that could be tracked down to a latent race condition in the DynamoDB DNS management system which caused the disruption. A comprehensive post-event summary describing the outage, its cause and the resulting effects can be found here.
In the previous post we discussed what eventual consistency actually means and why we sometimes need to favor eventual consistency over strong consistency. We also saw that most of the time we will not perceive any differences between eventual and strong consistency if set up properly. The differences only become apparent if the system encounters adverse conditions like, e.g., a network partition, loss of a node, or alike.
Recently I saw a skeet on Bluesky:
We have discussed the business case for resilient software design in my previous post. Let us assume, you have a budget and you know which are the most critical business processes/capabilities/interactions (whatever term suits your needs best) you need to secure, i.e., make more resilient.
The business case of resilience is a bit tricky. You find quite disparate forces at work: While some people tend to underrate the need and value of resilience a lot, other people find it hard to stop adding resilience measures. As so often, the sweet spot is somewhere in the middle.