As you venture into more complicated architecture setups, there are a few things engineers sometimes don't understand when setting up consensus-based systems. Understanding quorums and fault tolerance are critical in understanding such systems.

The problems in these systems are numerous and often include state —which complicates things further. Generally, when designing systems like this, avoiding state management can simplify things and create more robust systems if avoidable.

So let's take the example of a distributed data system where actions it takes are recoverable in the result of a crash.Well, the distributed system allows us to recover from single failures, but do we need all servers to have a copy to ensure that data is available? This is undoubtedly wasteful. It will also make the entire throughput of the system slower since it will have to acknowledge each action taken. Overall, system latency will come into play as cluster sizes increase.

What do we do if we can't reach other nodes in the cluster? How many failures can we tolerate?

This post is for paying subscribers only

Sign up now and upgrade your account to read the post and get access to the full library of posts for paying subscribers only.

Sign up now Already have an account? Sign in