RFC 992 (rfc992) - Page 1 of 18
On communication support for fault tolerant process groups
Alternative Format: Original Text Document
K. P. Birman (Cornell)
Network Working Group T. A. Joseph (Cornell)
Request for Comments: 992 November 1986
On Communication Support for Fault Tolerant Process Groups
K. P. Birman and T. A. Joseph
Dept. of Computer Science, Cornell University
Ithaca, N.Y. 14853
607-255-9199
1. Status of this Memo.
This memo describes a collection of multicast communication primi-
tives integrated with a mechanism for handling process failure and
recovery. These primitives facilitate the implementation of fault-
tolerant process groups, which can be used to provide distributed
services in an environment subject to non-malicious crash failures.
Unlike other process group approaches, such as Cheriton's "host
groups" (RFC's 966, 988, [Cheriton]), our approach provides powerful
guarantees about the behavior of the communication subsystem when
process group membership is changing dynamically, for example due to
process or site failures, recoveries, or migration of a process from
one site to another. Our approach also addresses delivery ordering
issues that arise when multiple clients communicate with a process
group concurrently, or a single client transmits multiple multicast
messages to a group without pausing to wait until each is received.
Moreover, the cost of the approach is low. An implementation is be-
ing undertaken at Cornell as part of the ISIS project.
Here, we argue that the form of "best effort" reliability provided by
host groups may not address the requirements of those researchers who
are building fault tolerant software. Our basic premise is that re-
liable handling of failures, recoveries, and dynamic process migra-
tion are important aspects of programming in distributed environ-
ments, and that communication support that provides unpredictable
behavior in the presence of such events places an unacceptable burden
of complexity on higher level application software. This complexity
does not arise when using the fault-tolerant process group alterna-
tive.
This memo summarizes our approach and briefly contrasts it with other
process group approaches. For a detailed discussion, together with
figures that clarify the details of the approach, readers are re-
ferred to the papers cited below.
Distribution of this memo is unlimited.
Birman & Joseph