RFC 1691 (rfc1691) - Page 2 of 10
The Document Architecture for the Cornell Digital Library
Alternative Format: Original Text Document
RFC 1691 CDL Document Architecture August 1994
Document Architecture Overview
Just as a conventional library contains books rather than pages, so
the electronic library must contain documents rather than images.
During the scanning process, images are automatically linked into
documents by creating document structure files which order the image
files in the same way the binding of a book orders the pages. Thus,
the digital book as currently configured consists of two parts: a set
of individual pages stored as discrete bit map image files, and the
document structure files which "bind" the image files into a
document. In addition, a database entry is made for each digital
document which permits searching by author and title (i.e.,
bibliographic information). Beyond the order of the pages, the
arrangement of a physical book provides information to readers. The
title page and publication information come first; the table of
contents usually precedes the text; the text is divided into sections
or chapters; if there is an index, it follows the text. The reader
often refers to these components of a book when browsing the library
shelves, in order to determine whether to read the book.
The document structure provides direct access to the components of an
electronic document, storing the information that would otherwise be
lost when the book is disbound for scanning.
Document Architecture Requirements
Listed below are the requirements that were initially set down for
the Cornell Digital Library Architecture.
1. The architecture must be open (i.e., published and freely
available).
2. The architecture should be as simple as possible (to facilitate
product development).
3. The architecture should assume data storage in UNIX file systems.
4. The architecture should allow for standard data usage, such as via
FTP and Gopher servers (i.e., pages of a document must exist in a
single directory, and the naming convention used must order them
in the standard collating sequence, such as the series "0001.TIF,
0002.TIF,..., 0411.TIF" (NOTE: a series such as "1.TIF, 2.TIF,...,
10.TIF" would be ordered "1.TIF, 10.TIF, 2.TIF, ..." which is not
acceptable).
5. The architecture should provide for storing the same information
in different formats. For example, when a page of a document is
available at several different resolutions.
Turner