Tolerating Transient and Permanent Failures

Efthymios Anagnostou and Vassos Hadzilacos

Self-stabilizing protocols are resilient to *transient
* failures affecting potentially *all* processors of a distributed
system. In contrast, fault-tolerant protocols are resilient to *permanent*
failures affecting some
*fraction* of processors. We investigate the possibility of designing
protocols that are both self-stabilizing and fault-tolerant, in the asynchronous
model of distributed systems. We prove that no such protocols exist for a wide
range of problems, and that this is so even if we allow randomization. The
problems that our impossibility results cover include, among others: Determining
(even approximately) the size of the distributed system; leader election;
mutual exclusion; computing a spanning tree; and computing a maximal independent
set of nodes. All of these problems are solvable in asynchronous systems using
(randomised) protocols that are only fault-tolerant (but not self-stabilizing);
or only self-stabilizing (but not fault-tolerant). We then focus on the problem
of computing distinct names for the processors in a ring. We give three
(randomised) protocols that solve this problem in three settings satisfying
increasingly weak assumptions: When processors know the exact size $n$ of the
ring; when they only know an upper bound on $n$; and when they have no
information about $n$.