RFC 1691 (rfc1691) - Page 2 of 10


The Document Architecture for the Cornell Digital Library



Alternative Format: Original Text Document



RFC 1691               CDL Document Architecture             August 1994


Document Architecture Overview

   Just as a conventional library contains books rather than pages, so
   the electronic library must contain documents rather than images.
   During the scanning process, images are automatically linked into
   documents by creating document structure files which order the image
   files in the same way the binding of a book orders the pages.  Thus,
   the digital book as currently configured consists of two parts: a set
   of individual pages stored as discrete bit map image files, and the
   document structure files which "bind" the image files into a
   document.  In addition, a database entry is made for each digital
   document which permits searching by author and title (i.e.,
   bibliographic information).  Beyond the order of the pages, the
   arrangement of a physical book provides information to readers.  The
   title page and publication information come first; the table of
   contents usually precedes the text; the text is divided into sections
   or chapters; if there is an index, it follows the text.  The reader
   often refers to these components of a book when browsing the library
   shelves, in order to determine whether to read the book.

   The document structure provides direct access to the components of an
   electronic document, storing the information that would otherwise be
   lost when the book is disbound for scanning.

Document Architecture Requirements

   Listed below are the requirements that were initially set down for
   the Cornell Digital Library Architecture.

   1. The architecture must be open (i.e., published and freely
      available).

   2. The architecture should be as simple as possible (to facilitate
      product development).

   3. The architecture should assume data storage in UNIX file systems.

   4. The architecture should allow for standard data usage, such as via
      FTP and Gopher servers (i.e., pages of a document must exist in a
      single directory, and the naming convention used must order them
      in the standard collating sequence, such as the series "0001.TIF,
      0002.TIF,..., 0411.TIF" (NOTE: a series such as "1.TIF, 2.TIF,...,
      10.TIF" would be ordered "1.TIF, 10.TIF, 2.TIF, ..." which is not
      acceptable).

   5. The architecture should provide for storing the same information
      in different formats.  For example, when a page of a document is
      available at several different resolutions.



Turner