RFC 528 (rfc528) - Page 1 of 9
Software checksumming in the IMP and network reliability
Alternative Format: Original Text Document
Network Working Group J. McQuillan
Request for Comments: 528 BBN-NET
NIC: 17164 20 June 1973
SOFTWARE CHECKSUMMING IN THE IMP AND NETWORK RELIABILITY
As the ARPA Network has developed over the last few years, and our
experience with operating the IMP subnetwork has grown, the issue of
reliability has assumed greater importance and greater complexity.
This note describes some modifications that have recently been made
to the IMP and TIP programs in this regard. These changes are
mechanically minor and do not affect Host operation at all, but they
are logically noteworthy, and for this reason we have explained the
workings of the new IMP and TIP programs in some detail. Host
personnel are advised to note particularly the modifications
described in sections 4 and 5, as they may wish to change their own
programs or operating procedures.
1. A Changing View of Network Reliability
Our idea of the Network has evolved as the Network itself has grown.
Initially, it was thought that the only components in the network
design that were prone to errors were the communications circuits,
and the modem interfaces in the IMPs are equipped with a CRC checksum
to detect "almost all" such errors. The rest of the system,
including Host interfaces, IMP processors, memories, and interfaces,
were all considered to be error-free. We have had to re-evaluate
this position in the light of our experience. In operating the
network we are faced with the problem of having to perform remote
diagnosis on failures which cannot easily be classified or
understood. Some examples of such problems include reports from Host
personnel of lost RFNMs and lost Host-Host protocol allocate
messages, inexplicable behavior in the IMP of a transient nature,
and, finally, the problem of crashes -- the total failure of an IMP,
perhaps affecting adjacent IMPs. These circumstances are infrequent
and are therefore difficult to correlate with other failures or with
particular attempted remedies. Indeed, it is often impossible to
distinguish a software failure from a hardware failure.
In attempting to post-mortem crashes, we have sometimes found the IMP
program has had instructions incorrect--sometimes just one or two
bits picked or dropped. Clearly, memory errors can account for
almost any failure, not only program crashes but also data errors
which can lead to many other syndromes. For instance, if the address
of a message is changed in transit, then one Host thinks the message
was lost, and another Host may receive an extra message. Errors of
this kind fall into two general classes: errors in Host messages,
McQuillan