What is fault tolerance in distributed system


Definition of fault tolerance:

Fault tolerance is a property of the system that helps to continue its working when a fault occurs.

A distributed system is a system that has different components from different machines. The machines are located in different locations. All machines work together to perform any task. For example, Google is a distributed system, there are different servers located in different countries. All servers work together to perform any task.

In the early day’s computer systems were not distributed and they also not share computation resources. Now, most of the computers are distributed. They work independently on a common task. Suppose there are 10 distributed systems and one system gets any fault then the other 9 systems will take over the computation of the fault system. The user will not get any issue with his tasks. Big e-commerce stores run on distributed systems. If any system fails the other distributed systems will take control of the website and end-users will not get any issue in purchasing products from the website.

There are three kinds of problem occurs in distributed systems:-

  • Failures
  • Errors
  • Faults

When there occurs any problem in the system then it is considered a fault in the system. If we inspect the system after a fault occurs then we will notice errors in the code or system working. Failure is the state in which the system fails to give an outcome because of faults and errors in the system.

Fault can be either transient fault or permanent fault.

Permanent vs Transient Fault:

  • As the name suggests permanent faults are permanent and transient faults are those that occur for a small duration.
  • Permanent faults can affect the system badly whereas transient faults affect the system at a low rate.
  • The permanent fault is easy to identify but the transient fault is difficult to identify.
  • An example of a permanent fault is when any node becomes unavailable. An example of transient fault is processor fault, network fault, and media fault.

The fault occurs when any hardware or software stops working. It can also occur if some malicious code enters the system by some unauthorized access.

We need fault tolerance in distributed systems because of reliability, availability and security.

Reliability is the continuous working of the system without any issue. Availability is the feature of the system to have a continuous flow of data between system and user. And security means no unauthorized user can access the data.

Example of fault tolerance:

  • Air traffic control
  • Banking services
  • Patient monitoring system


Leave a Comment

Your email address will not be published. Required fields are marked *