Achieving Resilient Cloud Infrastructures: Removing Single Points of Failure

Hey there, let's talk about single points of failure (SPOFs). They're like the weak links in a chain, if they fail, the whole system goes down. A highly available or reliable system cannot have a SPOF. But don't worry, we can remove SPOFs by employing some clever techniques.

Here are a few ways to remove SPOFs:

Introduce Redundancy: We can add a secondary resource as a failover when the primary resource fails. The failover typically requires some time before it completes, and during this time period, the resource remains unavailable. Use standby redundancy for stateful components. In active redundancy, requests are distributed to multiple nodes. In case a node fails, then the workload is distributed amongst the healthy nodes. It's like having a backup plan in case your primary plan falls through.

Detect Failure: We should design good health checks for our backend nodes. We need to be able to detect when a node fails, so we can take appropriate action. It's like having a doctor who can detect when you're sick and prescribe the right treatment.

Durable Data Storage: Synchronous replication only acknowledges a transaction after it has been durably stored in both the primary location and its replicas. It is ideal for protecting the integrity of the data from the event of a failure of the primary node. In asynchronous replication, changes on the primary node are not immediately reflected on the replicas. Which means it is best suited for horizontal scaling. It's like having a safety deposit box where you store your valuables, so they're protected in case of a disaster.

If you want to learn more about these techniques, check out this link to AWS Whitepapers & Guides.