This is a draft published here to elicit wider comment. Note that this is only a partial effort at what plainly could be much longer and more detailed document. There are many issues to resolve, and lots of things we need to be clear about. Please feel free to ask any questions or add any comments you may have on one or more of the various discussion lists.
This document is a general overview of the technical considerations that lie behind the Stoa's efforts to promote the widespread use of electronic resources in Humanistic study. Its intended audience is anyone considering authoring a document for publication by the Stoa. It is likely that this group includes people with a very wide range of technical skills and therefore it is difficult for a single document to serve as an introduction to current discussions of electronic publication and at the same time keep the attention of more experienced readers. The approach we take here is to describe general principles illustrated by straightforward examples and to provide pointers to further reading. Readers can therefore use this document either as a starting point for thinking about electronic publication or as a high-level description of the technical criteria by which the Stoa will evaluate submissions.
Because the technical criteria that this document establishes should contribute to the overall goals of the Stoa, these are reviewed here. As presented in the introductory web-page, the Stoa intends:
The second and third goals are the most directly relevant to this document. The World Wide Web has shown the promise and importance of interlinked electronic resources for scholarly collaboration. Taken together projects such as Perseus, The Online Resource Book for Medieval Studies, Diotima, De Imperatoribus Romanis and Suda-On-Line are beginning to construct new modes of accessing and using humanistic materials. When thought of as a collection of sites on the Web, however, this new environment falls short of the well-established requirements of ongoing scholarly discourse in two important ways. Neither the permanence of each resource nor the permanence of any links between these resources is guaranteed. Compare this with the current level of permanence provided by the storage of printed materials in libraries. By means of rare book departments libraries provide access to materials that are hundreds of years old. Via interlibrary loan one can, ideally, track down a reference to any published work. Additonally, so long as a particular edition is cited, the author and reader will use the exact same version. Though the specific processes may change, we expect this happy situation to last into the forseeable future. This is not currently the case for the World Wide Web. Pages are updated, their addresses on the internet change, or they may cease to exist entirely subject to the whims of personal or departmental or institutional support. The Stoa hopes to encourage the creation of a new inter-linked electronic environment without losing the advantages of the old printed one. This document describes one part of the process by which this will be accomplished.
One last introductory matter remains. The Stoa will publish any sort of document. Already planned are monographs, primary sources and translations, encyclopedias, interactive geographic resources, and archaeological databases. It is not possible to succinctly present the specific technical criteria by which each of these types will be evaluated. This again requires mixing statements of general principle with specific examples. Readers are, therefore, encouraged to extrapolate a general approach to electronic publication and to raise specific questions in the Stoa sponsored public discussion forum.
An important principle to establish is that the Stoa will evaluate the structure and content of documents but not their appearance. The structure of a document refers to the divisions within a text or database that allow both navigation and reference. For printed books and journals page numbers provide a convenient reference system that is usually combined with divisions such as chapters and sections, usually indicated by bold-faced headings, italicized lines, or other visual cues. These divisions are also used to create external references into a text such as "Chapter 1, p. 85". Clearly electronic documents cannot be divided into pages. They also should not use visual clues to indicate structure. This second point may be less obvious but can easily be explained with a brief example that compares the markup of a book in both HTML and XML. [link to fuller discussion elsewhere of html and xml]
In HTML the structure of a very simple book might look like this:
An XML representation of the same book might be:<html> <title>A Dog's Life</title> <body> <h1>Chapter 1</h1> <p>A paragraph</p> <p>Another paragraph</p> <h2>Chapter 2</h2> <p>Yet another paragraph.</p> </body> </html>
The most important difference between these two versions of the same document is that the XML explicity marks both the beginning and end of each chapter as well as explicitly indicating that each chapter has a number. Clearly marking these features of a document's structure makes it easy to implement automated searches such as "List all chapters containing the word 'yet'." The need for clear markup increases for more complex documents that include footnotes, figure and table headings, appendices and other types of text. Using the formatting elements that HTML provides might well obscure this structure, rather than making it plain. It is also important to note that the XML version can easily be converted into the simpler HTML version for reading with a web browser. The reverse, however, is not possible.<book title="A Dog's Life"> <chapter num="1"> <p>A paragraph.</p> <p>Another paragraph</p> </chapter> <chapter num="2"> <p>Yet another paragraph.</p> </chapter> </book>
This example raises a crucial distinction between data and functionality that is central to the Stoa's mission. From a technical standpoint a Stoa publication is a representation of a document or database that readily supports multiple uses. Reading and browsing by end-users and sophisticated searching that takes account of structure are just two possible uses. Some readers may by now have grown comfortable with the idea of web-sites as electronic publications. But in the system proposed by the Stoa, any web site is just one version of a Stoa publication that must also be available as the original data made accessible and attractive by that site. This separation of underlying data from any particular interface or mechanism for using it is a central component of the Stoa's effort to promote the longevity of electronic resources in the Humanities.
Three more general criteria for the technical evaluation of Stoa documents are:
A basic expectation of a Stoa document is that its use not depend on any particular operating system, proprietary file format, or delivery mechanism. It would therefore not be appropriate to publish Stoa documents as Adobe Acrobat files, Microsoft Word documents, FileMaker database files, or even HTML files. Experience has repeatedly shown that such technologies have relatively short lives and that dependence upon them ultimately leads to unreadable data and wasted effort. Rigorous application of this principle will help ensure a document's longevity despite the continued rapid development of computer software.
Likewise, all Stoa documents should have an automatically parseable reference system. We have already seen that printed materials often base their reference systems on a combination of page number and other division such as chapter. The ability to refer to previously published work is so essential to scholarly discourse that Stoa documents must implement a reference system just as reliable and easy to use as that found in books. Fortunately, this is not hard to do. Databases should have unique keys for all records, encyclopedias will explicitly mark the titles of each entry, full-text documents such as scholarly monographs should have a clear structure of sections perhaps even down to the numbering of paragraphs.
Why must these reference systems be automatically parseable? In discussing a sample XML document above we posited the trivial task of searching for all chapters containing the word 'yet'. In order for a search program to produce the correct list of chapters it must be able to figure out which chapter it is in when it finds the desired word. Likewise, when a user asks to read one of the chapters in that list, it must be possible to for a computer program to locate the text of that chapter when it is given a reference to it.
The Stoa does not mandate the use of particular standards or procedures for the publication of online materials so much as it tries to support the development and use of practices and methods that will achieve this goal. Of course, the use of standards such as TEI-conformant SGML and XML for the representation and presentation of document structure, DOI for the identification of resources, and ISO-8859 and Unicode for character codes is fundamental to achieving longevity for electronic publications. Therefore it is likely that most Stoa sponsored projects will avail themselves of these pre-existing resources. However, technical excellence will ultimately be individually defined for each project and flexibility is important if the Stoa is to include as wide a range of materials as possible.