RFC 992 (rfc992) - Page 2 of 18


On communication support for fault tolerant process groups



Alternative Format: Original Text Document



RFC 992                                                    November 1986


2. Acknowledgments

   This memo was adopted from a paper presented at the Asilomar workshop
   on fault-tolerant distributed computing, March 1986, and summarizes
   material from a technical report that was issued by Cornell Universi-
   ty, Dept. of Computer Science, in August 1985, which will appear in
   ACM Transactions on Computer Systems in February 1987 [Birman-b].
   Copies of these paper, and other relevant papers, are available on
   request from the author: Dept. of Computer Science, Cornell Universi-
   ty, Ithaca, New York 14853. ().  The ISIS
   project also maintains a mailing list.  To be added to this list,
   contact M. Schmizzi ().

   This work was supported by the Defense Advanced Research Projects
   Agency (DoD) under ARPA order 5378, Contract MDA903-85-C-0124, and by
   the National Science Foundation under grant DCR-8412582.  The views,
   opinions and findings contained in this report are those of the au-
   thors and should not be construed as an official Department of De-
   fense position, policy, or decision.

3. Introduction

   At Cornell, we recently completed a prototype of the ISIS system,
   which transforms abstract type specifications into fault-tolerant
   distributed implementations, while insulating users from the mechan-
   isms by which fault-tolerance is achieved.  This version of ISIS, re-
   ported in [Birman-a], supports transactional resilient objects as a
   basic programming abstraction.  Our current work undertakes to pro-
   vide a much broader range of fault-tolerant programming mechanisms,
   including fault-tolerant distributed bulletin boards [Birman-c] and
   fault-tolerant remote procedure calls on process groups [Birman-b].
   The approach to communication that we report here arose as part of
   this new version of the ISIS system.

   Unreliable communication primitives, such as the multicast group com-
   munication primitives proposed in RFC's 966 and 988 and in [Cheri-
   ton], leave some uncertainty in the delivery status of a message when
   failures and other exceptional events occur during communication.
   Instead, a form of "best effort" delivery is provided, but with the
   possibility that some member of a group of processes did not receive
   the message if the group membership was changing just as communica-
   tion took place.  When we tried to use this sort of primitive in our
   original work on ISIS, which must behave reliably in the presence of
   such events, we had to address this aspect at an application level.
   The resulting software was complex, difficult to reason about, and
   filled with obscure bugs, and we were eventually forced to abandon
   the entire approach as infeasible.

   A wide range of reliable communication primitives have been proposed
   in the literature, and we became convinced that by using them, the
   complexity of our software could be greatly reduced.  These range



Birman & Joseph