Definition: the system continue to operate despite an arbitrary number of messages being dropped or delayed by the network.
so the reality is no distributed system is safe from the partition, they can occur due to problems in the network, or just a node can go down for one reason or another. so in any of those partition cases we have an interruption in our system communication, these interruptions can be short or long-lived.
CAP Theorem has more information how to a distributed system should deal with the partitions.

Fault Tolerance is not the same as Resilency. These two terms are sometimes used interchangeably, but are indeed different.

Fault Tolerance

Fault Tolerant means the ability of a system to survive (tolerate) when a fault occurs, e.g, surviving a server crash or network partition etc
There may be some temporary drop in overall performance, however system features are not affected
Mechanisms such as checkpoint/restore, Replicated State Machines can solve this issue
The systems usually has the ability to self-detect faults and do failovers

Example(s)

If out of N instances of a microservice sitting behind a reverse proxy like nginx, one instance fails, the service is still available. However, There will be a decrease in throughput . Hence the service is fault-tolerant.

Mustafa's zettelkasten 📔

Explorer

partition tolerance

Fault Tolerance

Example(s)

Graph View

Table of Contents

Backlinks