Stoa Consortium

Stoa Consortium

1.8.1. TEI and Latin Texts


Up: 1.8. An Introduction to XML and the Text Encoding Initiative Next: 1.8.2. What's the big deal about XML?

Our proposal to create a digital archive of Latin texts is not unique. Indeed, there are digital archives housed on the Web whose breadth and scope of content far exceeds our modest little selection of Erasmus' Colloquia. The Latin Library (www.thelatinlibrary.com) is an excellent example of such an archive. The Latin Library houses a vast collection of texts, and continues to grow thanks to the participation of people all over the world who submit electronic documents for inclusion in the collection. Not only does the Latin Library contain full-text versions of much of the Classical canon, but it also offers an impressive collection of Latin texts from the medieval period up through the 20th century.

While the Latin Library is a valuable resource for direct access to a wide variety of material, it is also, from a technical standpoint, quite primitive. The Latin Library is basically a group of HTML pages connected together via hyperlinks. Because it is made up of a large number of disparate files, the opportunity to automate the organization of the archive is limited. For example, the home page contains a manually created table, listing all of the authors in alphabetical order. If the archive manager wants to add a text by a new author, he or she must create an entirely new table, shifting all of the the other authors accordingly in order to accommodate the new author in correct alphabetical sequence. Additionally, if the archive manager wants to modify the look and feel of the archive (i.e. adding a logo or changing the background color), it would require him or her to make these changes to each page individually—a daunting task with such a huge archive.

By contrast, the present collection of Erasmus' Colloquia familiaria is marked up in XML, in compliance with the Text Encoding Initiative (TEI) guidelines for creating texts directly in, or converting them to, electronic form. We chose to create our texts in XML instead of HTML because we wanted to privilege the meaning and structure of the data contained in the document over the display of the document itself. The present archive—including the introductions, the interprative questions, and the texts themselves—resides entirely within a single parsing XML file, organized in a structured hierarchy. This has a number of advantages: it enables us to accommodate new texts easily, without disrupting the overall structure of the archive; it allows us to generate the table of contents (and other navigational elements) automatically, which means that changing the TOC as new items are added can be done in a matter of seconds; it allows for global transformations in format; and finally, the privileging of structure over display means that the text can be delivered as HTML, a PDF or any other document format the user may choose. Another key advantage to XML is that, unlike HTML, its status as a supported archival format increases the likelihood that this collection will be preserved over the long-term.

Additionally, the descriptive nature of XML enables us to embed information within the text itself that would allow for a greater level of searchability than is possible with an HTML file. While it is true that <META> tags in the <HEAD> area of an HTML file provide the opportunity for some level of document description, it is simply not possible to describe a document to the same level of granularity as can be done with an XML file. Again, if we examine the HTML-based Latin Library, it is apparent that the site has limited search functionality. In fact, the only search capability available is via a drop-down menu organized by author, via simply reading the page searching for the author and the text you want, or by keyword searching (ctrl + F). This is inefficient and favors known-item searching over an ability to browse not only by author, but also by other aspects such as "subject," "period," "country," etc.

For the time-being, the promise of powerful searchability within the archive itself admittedly remains more in the realm of the potential than the actual, even for our present collection (for example, we have declared "subject" tags at the collection-level only, whereas we should provide the same for each individual colloquium; we also plan to tag real place names, proper adjectives, and the original publishing date of each dialogue, so the texts can be searched according to these parameters). On a collection-level, however, our file contains very detailed information about its author, its sources, various editorial choices and publication information. This is because the TEI guidelines require the creator of a document to provide this sort of detailed information within the TEI header, with an even greater level of detail possible (for example, we have included Library of Congress Subject Headings!). Indeed, the TEI header essentially embeds cataloging information about the resource within the resource itself. If done correctly, this not only facilitates the discovery of and access to that resource, but could make it possible for the resource to be housed in online databases, such as a library's catalog.

Up: 1.8. An Introduction to XML and the Text Encoding Initiative Next: 1.8.2. What's the big deal about XML?



Date: last revised 2003-12-18 Author: Jennifer K. Nelson.
This page is covered by a Creative Commons ShareAlike license.