RFC 992 (rfc992) - Page 1 of 18


On communication support for fault tolerant process groups



Alternative Format: Original Text Document



K. P. Birman (Cornell)
Network Working Group                                  T. A. Joseph (Cornell)
Request for Comments: 992                              November 1986



       On Communication Support for Fault Tolerant Process Groups

                     K. P. Birman and T. A. Joseph
             Dept. of Computer Science, Cornell University
                           Ithaca, N.Y. 14853
                              607-255-9199


1. Status of this Memo.

   This memo describes a collection of multicast communication primi-
   tives integrated with a mechanism for handling process failure and
   recovery.  These primitives facilitate the implementation of fault-
   tolerant process groups, which can be used to provide distributed
   services in an environment subject to non-malicious crash failures.
   Unlike other process group approaches, such as Cheriton's "host
   groups" (RFC's 966, 988, [Cheriton]), our approach provides powerful
   guarantees about the behavior of the communication subsystem when
   process group membership is changing dynamically, for example due to
   process or site failures, recoveries, or migration of a process from
   one site to another.  Our approach also addresses delivery ordering
   issues that arise when multiple clients communicate with a process
   group concurrently, or a single client transmits multiple multicast
   messages to a group without pausing to wait until each is received.
   Moreover, the cost of the approach is low.  An implementation is be-
   ing undertaken at Cornell as part of the ISIS project.

   Here, we argue that the form of "best effort" reliability provided by
   host groups may not address the requirements of those researchers who
   are building fault tolerant software.  Our basic premise is that re-
   liable handling of failures, recoveries, and dynamic process migra-
   tion are important aspects of programming in distributed environ-
   ments, and that communication support that provides unpredictable
   behavior in the presence of such events places an unacceptable burden
   of complexity on higher level application software.  This complexity
   does not arise when using the fault-tolerant process group alterna-
   tive.

   This memo summarizes our approach and briefly contrasts it with other
   process group approaches.  For a detailed discussion, together with
   figures that clarify the details of the approach, readers are re-
   ferred to the papers cited below.

   Distribution of this memo is unlimited.




Birman & Joseph