Stoa Consortium

Stoa Consortium

1.8.4. Text Encoding Initiative


Up: 1.8. An Introduction to XML and the Text Encoding Initiative Previous: 1.8.3. Great. So where do I start?

The Text Encoding Initiative (TEI) is an XML application for describing humanities texts. Beginning in 1987, a planning group sponsored by the Association for Computers and the Humanities (ACH), and funded by the National Endowment for the Humanities, began to discuss the feasibility of developing an encoding scheme for electronic texts that would meet the needs of scholars, archivists and other researchers in the humanities.114 When the TEI Consortium became an official working group in 1988, it was not only funded by the NEH, but also received money from the Commission of European Communities, the Andrew Mellon Foundation and the Social Science and Humanities Research Council of Canada.115 The consortium itself was made up of scholars and researchers—as well as representatives from the library community—from North America and Europe. A first draft version of the TEI Guidelines was released in 1990, based on SGML elements. In 1998, after XML was finalized by the W3C, the TEI guidelines were updated to support XML instead.

The purpose of encoding a text goes beyond the simple goal of enabling someone to read it. As mentioned in our introduction, text encoding provides information about a text that allows a computer program to perform functions on that text.116 For example, Susan Hockey points out that, without explicit instruction, a computer cannot distinguish between the personal pronoun I and the Roman numeral I.117 Therefore, without semantic markup, only very simple searches can be performed on texts, as with those in the Latin Library. TEI provides a standard for text encoding to facilitate document exchange within a scholarly community. In fact, the goals of the TEI scheme for encoding electronic texts are the following: to provide a standard format for data exchange; to provide guidance for encoding of texts in TEI format; to support the encoding of all kinds of features of all kinds of texts studied by researchers; and to be application independent.118

In order to meet the needs of the vast area of study called "the humanities," the TEI DTD allows for genre-specific tags that people invoke when marking up specific kinds of texts: TEI provides base tags for prose, verse, drama, transcription of speech, print dictionaries and terminological databases. Because Erasmus' Colloquia is a collection of dialogues with varying numbers of interlocutors, we chose the drama base-tags to mark up our texts. This allows not only for the declaration of common elements such as the explicit division of hierarchy in the text structure via <div> tags and the demarcation of content in <p> tags, but also, for example, for the declaration of the "cast" at the beginning of each dialogue, enclosed in <castList><castItem></castItem></castList> tags at the beginning of the dialogue, and for each speaker in the dialogue to be enclosed in <speaker></speaker> tags. The words spoken by that speaker are enclosed in <sp></sp> tags. An example of an explicitly declared cast list would look like this: <castList><castItem>Paedagogus</castItem><castItem>Puer</castItem></castList>; and within the dialogue itself a complete tag-set plus content would look like this: <sp><speaker>Puer</speaker><p>Numquid aliud vis?</p></sp>.

Another important part of a TEI/XML document is the TEI header. While tags such as <castItem> and <speaker> describe elements within the text, the TEI header describes the text itself as a whole, making explicit its author, its source, its encoding and its revision history. One difficulty with digital documents retrieved on the Web, according to Susan Hockey, is that it is difficult to ascertain who created them, what the transcription policies were of the people who created them, what exact sources were used, and whether there are any usage or copyright statements associated with the original text.119 The relationship of the TEI header to the document is akin to the relationship between an information resource and a surrogate record stored in a library catalog: it contains a subset of meaningful information about the resource that can facilitate the discovery of and access to the resource itself.

Whether a text is being converted to digital format or is "born digital," the TEI header remains embedded in the document itself, and requires the author to make explicit all of these aspects by requiring the inclusion of four main parts:120

Although time has not permitted us to tag the document itself to the level of granularity we would ultimately like, we hope that this module demonstrates the potential for XML—applied via a community specific DTD—to create electronic documents that will be useful to scholars beyond simply existing as a text to read. By packing the <HEAD> section of a document with meaningful creation, source, editorial and revision information, and by explicitly declaring elements within the document itself, the text becomes a resource upon which further analysis can be performed. XML essentially allows an author to transform a text into a database within which a computer can sort, filter and organize information according to whatever view a researcher chooses.

Up: 1.8. An Introduction to XML and the Text Encoding Initiative Previous: 1.8.3. Great. So where do I start?



Date: last revised 2003-12-18 Author: Jennifer K. Nelson.
This page is covered by a Creative Commons ShareAlike license.