How to Mark Up a Text

Last changed: April 11, 2011    

Marking Up a Text


This document is intended for Stoa collaborators, Perseus editors, and Perseus programmers. Readers should be familiar with the basic ideas of SGML. The Stoa's Introduction to Structured Markup and the Text Encoding Initiative's Gentle Introduction to SGML supply useful background information, knowledge of which will be assumed in what follows. The TEI Guidelines for Electronic Text Encoding and Interchange are the basis for the markup schema described here.

Writers and editors creating a new text in electronic format will not be concerned with some of the more technical information in this document. Programmers dealing with raw output from optical scans or data entry, on the other hand, should read the hints about automatic processing. Editors doing more detailed markup on a text that has already been turned into correct SGML will have requirements intermediate between those of the other groups. Editors with little experience in SGML should experiment with simple files first; on a first reading, ignore anything that appears too complicated.


    Marking the structure
    Marking the language
    Marking names and dates
    Marking quotations and references
    Optional features
Creating the TEI Header
    General information
    Source information
    Structure information
    Language information
    Audit information
Details and Special Features
    General considerations
    Footnotes and other annotations
    Apparatus critici and textual notes
    Figures and diagrams
    Metrical schemata
    Dictionaries and lexica
Validating the Markup
    Syntactic validation
    What happens next?

Marking the structure

SGML markup can encode many features of a text. The most important of these is the structure of the text: how its parts fit together. In the Perseus DTDs, structure is represented with numbered divs and with milestones. The largest structural element is always <div1>. The canonical citation scheme for a classical text is always encoded somehow, either with lower-level (higher-numbered) <div>s or with <milestone>s. Which one to choose depends on which scheme naturally fits the text; see the following examples.

Some texts divide very neatly into pieces. Virgil's Eclogues, for example, can be structured as ten separate <div1>s. Pindar's Epinicians might have a <div1> for each of the four books, followed by a <div2> for each poem, like this:

<div1 type="book" n="Olympian Odes">
<div2 type="poem" n=1>
<l>a)riston me\n u(/dwr,...
<div2 type="poem" n=2>
<l>a)nacifor/rmigges u(/mnoi,...

As a rule, any text that has books should have a <div>-level for books. Individual poems in a collection are also natural choices for <div>s. Chapters in a prose text are frequently <div>s, too. Conceptually, the <div> levels are the levels that would appear in an outline of the text; there can be as many as are necessary to make the structure clear, up to <div7>. If a text naturally divides into books, parts, chapters, sections, subsections, and named paragraphs, then use six <div> levels, from <div1> for the books down to <div6> for the named paragraphs.

Sometimes the canonical citation scheme does not fit the apparent structure of the text. The obvious example is Stephanus pages for Plato's dialogues. In this case, we cite the text based on how many words fit on a printed page in a particular edition. This is a natural use for <milestone>s. <Div> tags enclose a section of text: a <div1> contains some <div2>s, which in turn contain <div3>s, and so on until the lowest level in use, which contains paragraphs. <Milestone> tags, on the other hand, simply mark a place; they do not come in pairs (there is no </milestone> tag). As a result, <milestone> tags can be inserted anywhere they are needed. They can even, if necessary, describe a different structure from the one represented by the <div> tags.

For example, an editor might choose to divide Plato's Apology into sections, represented by <div1>s (the entire speech being the <text> itself, of course). These <div1>s correspond to logical sense-units in the text, according to this editor's view. They will generally not co-incide with the Stephanus pages, so Plato's Apology will have two structures: the editor's and the printer's. It might look like this:

<text type=speech> <div1 type=section n=1> <milestone ed=Stephanus unit=page n=17> <p>o(/ti me\n u(mei=s, w)= a)/ndres *)aqhnai=oi,... th=| fwnh=| te kai\ tw=| tro/pw| <milestone ed=Stephanus unit=page n=18> e)/legon e)n oi(=sper e)teqra/mmhn,...

Here the two structures are so different that the <milestone> actually appears inside a sentence of the Greek text. This is a fairly extreme example, of course, because the second structure is based on the appearance of the text in a particular edition, not on any logical features within the text.

Whichever structural scheme or schemes you choose will be declared in the <refsDecl> section of the TEI Header; see below.

Marking the language

Marking the language of a text or passage allows language-specific processing. For texts integrated with Perseus, this includes morphological analysis. The main language of the text is the first language to appear in the <langUsage> section of the TEI Header. This language can also be declared with the <lang> attribute on the opening text tag.

Within the text, every quotation, phrase, or other passage in a different language should be marked <quote lang=XX> or <foreign lang=XX>, where XX is the language identifier from the langUsage header section. Use the <quote> element for quotations, for example if Cicero quotes Euripides; use <foreign> for other foreign expressions, for example the Greek phrases in Cicero's letters. There is a standard set of language identifiers, described below in the discussion of the langUsage header section.

English, Latin, French, German, and Italian are written in the Roman alphabet. Standard entities like &agrave; or &uuml; (for à and ü respectively) are available to encode letters with diacriticals. Greek is written in Beta-code, as defined by the Thesaurus Linguae Graecae Project. Note, however, that Perseus and Stoa texts should not use the "escape sequences" for punctuation, formatting, and other features; the only Beta-code characters recognized by standard Perseus code are those that represent letters and diacriticals.

Marking names and dates

Although it is not required, it is often useful to tag names of persons and places that appear in the text. It is also useful to mark dates. Both kinds of markup make it possible to extract information from the text, answering questions like: whom does Cicero mention in letters of 63 BC? How often does Cicero refer to the year of his consulate in the Philippics? Do characters in Plautus often refer to the city where the play is supposed to be set? And so on.

Names can most easily be marked with the <name> element, like this:

<name type=person key=Cic>Cicero</name>
<name type=person key=Cic>Tully</name>
<name type=person key=Cic>the senior consul of 63</name>
<name type=place key=Col>the Colosseum</name>
<name type=place key=Col>the Flavian Amphitheater</name>

The key attribute can be used to tie together different ways of referring to the same person or place. Other ways of marking names are also acceptable: the <rs> element or the <persName> and <placeName> elements.

Dates are easy: just use the <date> element for a single date, and <dateRange> for a range. The value attribute, in the form year-month-day, gives the date in a standard, machine-readable form. The year is positive for dates AD or CE, negative for BC or BCE. Here are some examples:

<date value=-63>63 BC</date>
<date value=-63>M. Tullio Cicerone C. Antonio consulibus</date>
<date value=-63>the year of the notorius Catilinarian conspiracylt;/date>
<date value=69>AD 69</date>
<date value=69>69 CE</date>
<date value=69>the so-called Year of Four Emperors</date>
<date value=1999-10-31>Halloween 1999</date>
<date value=1999-10>October 1999</date>
<dateRange from=-106 to=-43>Cicero's lifetime</dateRange>
<dateRange from=1999-10-24 to=1999-10-30>the last week in October</dateRange>

Simple dates can frequently be identified and roughly tagged by a program.

Marking quotations and references

Ancient authors frequently quote or allude to even more ancient authors. Modern commentators frequently note parallels to a passage in other works. In each case, it is convenient to mark the quoted material. The <quote> element contains an actual quotation. Use the lang element, as noted above, if the quotation is in a different language from the surrounding text.

The <bibl> element marks bibliographic information about a work being cited. It can contain <title>, <author>, and other fields, as appropriate for the information given in the text. The <bibl> and <quote> elements can be combined into a <cit> element provided there is no other extraneous information between them. For markup of existing text, this is not always convenient; for newly-written text, it is usually possible.

Here are some examples:

In <bibl n="Catul. 1"> Catullus 1<bibl>, the word refers to the appearance of the book, cf. <cit><bibl n="Pl. Ps. 27">Pl. Ps. 27</bibl> <quote lang="la">lepidis litteris, lepidis tabellis, lepida conscripta manu</quote></cit>
The most famous line in <bibl>Merrill's commentary on Catullus</bibl> must be this one, on <bibl n="Catul. 32">poem 32</bibl>: <quote>Contents, execrable. Date, undeterminable. Meter, Phalaecean.</quote>

Within the <bibl> element, the n attribute gives the canonical citation for the text being referred to, if it is an ancient text. These should use the standard Perseus abbreviations, where they exist. The n attribute should be coded whether or not the text currently exists in the Perseus corpus; subsequent processing creates links for available texts and ignores unavailable texts. Currently there is no standard method for assigning Perseus abbreviations (and the other associated codes) to new texts, or to referenced texts that do not appear in the standard abbrevation lists. For Stoa documents, if the referenced text is available in a known place on the Internet, the site and the text can be added to the Stoa lookup database.

For references to modern or secondary sources, the <bibl> element can be used, as for ancient sources, although there are no standard citations. It is also convenient to collect references to secondary sources into a <listBibl> table. For example, references within the text might look like this:

As Goodwin argues (<bibl><title>Moods and Tenses</title>, 62.4</bibl>), this relative clause is equivalent to a protasis.

A table of bibliography references might look like this:

<head>Works Cited</head>
<bibl id=MT>Goodwin, <title>Syntax of the Moods and Tenses of the Greek Verb</title>
<bibl id=OCD>Hornblower and Saporth, eds., <title>Oxford Classical Dictionary</title>
<bibl id=Barrett>Barrett, <title>Euripides' Hippolytus</title>

The id attributes can be used as the target of <ref> links from elsewhere in the document:

As Goodwin argues (<ref target=MT><title>Moods and Tenses</title>, 62.4</ref>), this relative clause is equivalent to a protasis.

The difference between this example and the previous one is that the <ref> allows subsequent processing to create a hyperlink from the reference to the full citation in the <listBibl>.

Optional features

Anything in the text that might be useful can be marked. For example, the met attribute can be used on the <div> for a poem to give a rough idea of its meter. Strophes, antistrophes, epodes, and mesodes can be tagged as separate <div>s or as line-groups. Within a play, scenes, episodes, stichomythia, stasima, parabases, agones, and so on can be tagged, perhaps as <div>s. References to manuscripts, in a textual commentary, might be marked as <name> or <rs>, perhaps with a type=MS attribute. Technical terms can be marked <term>. Emphatic words can be marked <emph>. There are as many possibilities as the editor's patience and ambition allow.

On the other hand, it is not always necessary to tag everything. When standard typographical conventions are adequately clear, it may not be worth the effort to add extremely precise tags. For example, some words that appear within quotation marks are quotations and get tagged as <quote>. Other quoted words are not quotations, but are mentioned terms (<mention>), sarcasm or unusual slang (<soCalled>), terms being defined (<term>), glosses or definitions (<gloss>), or distinct in some other way (<distinct>). It is generally not useful to mark all of these cases.

A general guideline is to mark features that can be analysed, displayed, or searched by subsequent processing. If you want to provide a timeline for a text, you will need to mark dates. If you want to analyze different characters' diction, you will need to mark their speeches. If you will be studying the history of the text, you will need to mark references to manuscripts. But you do not need to mark any of these features if you have no use for them.

Another guideline is to follow current practice in similar texts. Where previous editors have chosen to mark a particular feature, you will probably wish to do the same; if previous editors have not tagged the feature, you can also safely not tag it. For example, suppose you are editing a newly-discovered manuscript of a long-lost play by Sophocles. If existing texts of Greek tragedy mark anapestic systems as <div2 type=anapests>, you should mark anapestic systems similarly. If existing texts do not mark stichomythia, however, then you need not mark it either. You might, however, if you expect to study how the stichomythia in this hypothetical new play differs from those in the other seven plays.

Creating the TEI Header

General information

The TEI Header appears at the beginning of each SGML source file. It indicates which specific DTD is in use, what options are required, the general characteristics of the text, the source of the text, and the choices the editor made in marking it up. Headers of different files are generally more similar than different; it is a reasonable practice to copy the header from an existing file and make the few necessary changes to create a new one.

The header begins with the lines that declare the specific DTD in use for the file. For Perseus-style files containing classical literature, the choices are PersProse (prose texts and commentaries), PersVerse (lyric or epic verse), or PersDrama (drama). Most modern secondary sources also use the PersProse DTD. There are several more specialized DTDs for unusual texts, including catalog entries; in general, the best way to select the correct DTD for a particular text is to use the same one used by another similar text.

The DTD declaration is followed by the document root element, always <tei.2>, and then by the <teiheader> element. After the header comes the <text>, structured as described above.

Within the header, certain elements must appear and others are optional; those that appear have a prescribed order. The <filedesc>, or file description, comes first. It includes a <titlestmt> giving the title of the work and a <sourcedesc> giving basic information about the original text, if this is an electronic version of an existing text. Texts belonging to Perseus always include the standard Perseus &responsibility; and &Perseus.publish; entities. Texts published with the Stoa will include a standard entity to indicate this. These are similar to the publication information found in a printed text.

Source information

The source information in the <sourcedesc> section of the <filedesc> should describe the original text, not the electronic edition unless that is the original. If the text was scanned from a book, include the ISBN and OCLC numbers for the book. If the text was originally written as an electronic text, say so. Here are some examples:

Text taken from the Teubner edition of Suetonius:
<bibl><idno type=ISBN n=3519018276></bibl>

The source description for the present document:
<bibl>Text created in electronic format</bibl>

Structure information

The <encodingdesc> section contains information about how the SGML markup was done. Most important for our purposes is the <refsDecl> section, which describes how the <div>s and <milestone>s work together to describe the structure or structures of the text. There can be several different <refsDecl> elements if there are seveal structures in use, as will be the case when there are both <div>s and <milestone>s. There may also be different structures for different parts of the text, for example for an introductory essay and a commentary.

Within a <refsDecl> element, <step> or <state> elements define the individual components of the citation scheme for the text. For example, a text divided into books, chapters, and sections, represented respectively by <div1>, <div2>, and <div3>, will have three <step>s, like this:

<step refunit="book" delim="." from="DESCENDANT (1 DIV1 N %1)">
<step refunit="chapter" delim="." from="DESCENDANT (1 DIV2 N %2)">
<step refunit="section" from="DESCENDANT (1 DIV3 N %3)">

This indicates that a reference of the form "11.10.12" refers to the position in the text within <div1 n=11>, <div2 n=10>, and <div3 n=12>. The delim in the <step> definition indicates what character separates the part being defined from the next one. In this example, both delimeters are periods. They need not both be the same; consider one standard reference scheme for the Bible:

<step refunit="book" delim=" " from="DESCENDANT (1 DIV1 N %1)">
<step refunit="chapter" delim=":" from="DESCENDANT (1 DIV2 N %2)">
<step refunit="verse" from="DESCENDANT (1 DIV3 N %3)">

This specification indicates that references might look like "Exod. 1:2": book, followed by space, chapter, followed by colon, and verse. Moreover, books are represented by <div1> elements, chapters by <div2>s, and verses by <div3>s.

The from attribute is encoded in TEI extended pointer syntax. The Perseus text- processing system does not currently parse this attribute, so it is not essential that it be semantically correct. Follow the examples given here; refer to the TEI Guidelines, section 5.3.5, for additional information.

<State> elements usually correspond to <milestone> markers within the text. The <state> includes a unit, which is similar to the refunit in the <step> form; this is name of the kind of unit the text is divided into (book, chapter, section, verse, or whatever is appropriate). The <state> element also has an attribute called ed, intended to indicate which edition this reference scheme is drawn from.

When there is more than one structure in use, code one <refsDecl> for each one. For example, the Plato text described above, with editorial sections encoded by <div>s and Stephanus pages encoded by <milestone>s, might have the following <refsDecl>s:

<step refunit="speech" delim="." from="DESCENDANT (1 TEXT)">
<step refunit="section" from="DESCENDANT (1 DIV1 N %3)">
<state ed="Stephanus" unit="page">

When different parts of the text have different structures, for example an introduction divided into sections and a commentary divided into commLines, use the n attribute on the <refsDecl> elements to tell them apart, like this:

<refsDecl n="text=comm">
<step refunit="text" from="DESCENDANT (1 TEXT)">
<step refunit=commLine n=chunk from="DESCENDANT (1 DIV3 N %1)">
<refsDecl n="text=intro">
<step refunit="text" from="DESCENDANT (1 TEXT)">
<step refunit="section" n=chunk from="DESCENDANT (1 DIV2 N %1)">

Finally, the n=chunk attribute on some of the <step> elements is a signal to the Perseus text-processing system that the text should, by default, be displayed to the user in these units. That is, for the text given above, one commLine of the commentary, or one section of the introduction, will be on the screen at at time.

Language information

The <langusage> section of the <profiledesc> indicates what languages are in use in the text. Each language is represented by a standard code, which denotes not only the language but also the character set it uses (strictly, the Writing System Declaration). Perseus utilities assume the first language in the <langusage> is the default language for any text that does not have a lang attribute.

The following are the most common language codes:

<language id=en>English
<language id=la>Latin
<language id=greek>Greek (in Beta-code)
<language id=xgreek>transliterated Greek (in Roman characters)
<language id=it>Italian
<language id=de>German
<language id=fr>French
<language id=sp>Spanish

Within the text, every quotation or foreign phrase must be marked with its language. Strictly, it is not necessary to mark the language of a quotation if it is in the same language as the surrounding text, but it is usually easier to do so. The lang attribute is available on all elements; it is most often used on <text>, <quote>, <lemma>, and <foreign>, as in the following examples:

That the parallels are not purely imaginary strikes the eye from this paragraph from a respected and sober general work: <quote lang=en>`Here we already find the essential themes which will characterize Augustine's thought throughout his career.'</quote>
Compare <bibl n="Psalm 146.5">Ps. 146.5</bibl>, containing the similar phrase <quote lang=la>`magnus dominus noster et magna virtus eius et sapientiae eius non est numerus'</quote>
If the expressions of <bibl n="Catul. 4">c. 4</bibl> were to be taken literally, we must understand that the <foreign lang=la>phasellus</foreign> carried its master actually up the Po and the little Mincius into the Garda-lake, even to the shores of Sirmio itself.

Audit information

We use the <revisionDesc> section to keep track of what has changed in a text file. When the text is managed by a version control system, one change entry should contain that system's log information; this is generally done by inserting a keyword into the file. If the file is not managed by such a system, editors must remember to add change entries themselves.

A change entry looks like this:

  <date>29-Oct-99</date><respStmt><name>Anne Mahoney</name><resp></resp></respStmt>
  <item>Initial writing</item>

Details and Special Features

General considerations

Among those we generally deal with, the simplest kind of text to mark up is an original ancient work. Its structure is usually clear, and there is usually a canonical reference scheme based on books, sections, poems, or the like. In such a text, it is sufficient to mark the language and the structure.

Other features that can be marked include quotations of other works (e.g., quotations of poetry by characters in Plato), dates, and place names. These are generally not marked in the Greek and Latin texts in Perseus, partly because it can be difficult to mark them automatically. Quotations written in Greek in a Latin text are easy, of course. Quotations that are marked with quotation marks, spaced type, or a distinct layout can also be tagged, if those typographical features have been retained in the on-line copy of the text. The markup editor will probably have to supply the reference (the n attribute on the <bibl> element) by hand, however.

Dates in modern texts often have standard forms, like "April 18, 1775" or "4 July 1776". It is not difficult to write a program to identify and tag such dates. Dates in ancient texts may look more like "M. Tullio Cicerone C. Antonio coss.", referring to the consuls, eponymous archon, or other magistrate or priest who names the year. While it is possible to identify phrases of the appropriate form, it is harder to supply the value attribute for the <date> elements.

Place names in both ancient and modern texts are best identified by reference to an authority list. If the string "Ostia" appears in the authority list, the string in the text is tagged as a place name; this is a straightforward pattern-matching program. The authority list can contain multiple representations (e.g. "Ostia" in English, "Ostie" in French) and specify a key attribute to be given to all of them. Programmers writing code to identify place names should use one of the standard authority lists for their project.

Verse quoted within prose texts should be tagged as line groups, <lg>, which in turn contain lines, <l>. A quotation of stichic verse or a portion of a stanza needs no more structure; a quotation of several stanzas can be divided into several line groups. A single-line quotation does not need to be enclosed in a line group, however.

When the original source file has been generated from a printed text, whether by optical scanning or by keyboarding, the typographical conventions of the print edition are generally preserved fairly well. These are a guide to the structure of the text, although it is not always clear how a program can disambiguate, say, various uses of italics (for titles, for phrases in other languages that use the Roman alphabet, for emphasis). One fairly widespread convention is the use of boldface for lemmata in commentaries; this can often be recognized automatically, especially when boldface is not used for anything else in the text. Another convention is the use of spaced type for emphasis, in Greek or German passages; this can sometimes be recognized automatically, and can be coded as <emph>. When a text is typed in, as opposed to being scanned, the data entry operators will follow a standard set of rules for encoding unusual features (including Greek letters, unreadable characters, page breaks, and so on); if there are special rules for the text you are working on, you should have a copy of them.


A commentary is a "meta-text", in that it is of little value by itself but is generally read in connection with the text it comments on (its "base text"). The commentary usually quotes passages from the base text; when such a quotation occurs at the head of a block of notes, it is called a lemma. Normally, the commentary annotates passages from the base text in the order in which they occur, one after another from the beginning of the work to the end. Commentaries are almost invariably written in prose, so they are marked up as prose texts, even if the base text is in verse.

The divisions of a commentary will often match those of the text it comments on, so that the same citation scheme can be used for both. This structural correspondence makes it possible to match a part of the commentary with the part of the base text it comments on. For example, the text of Catullus might be divided into poems (represented by <div1> elements) and lines (represented by <l> elements). A commentary on this edition of Catullus will also be divided into poems and lines. Then a reference to "5.1" means the first line of poem 5 in the text, the comments on the first line of poem 5 in the commentary. In the commentary, the lines may be represented by actual divisions (<div2>, if the poems themselves are <div1>, since they are not actually lines of verse but paragraphs of text commenting on the verses of the base text.

The names of the divisions need not be the same in the text and the commentary, however. In particular, a commentary on a verse text will generally be organized by lines, or by lines within poems or scenes. Since the commentary may quote parallel passages of verse from other poets, it is not practical to mark the divisions of the commentary with type=line; instead, they should be commLine or the like. This is because the Perseus text-processing system will be confused by lines (of cited text) appearing within lines (of commentary). Use the following form:

The text:
<div1 type="poem" n="5" met="phalaecean">
<l>Vivamus, mea Lesbia, atque amemus,</l>
<l>rumoresque senum severiorum</l>
<l>omnes unius aestimemus assis.</l>
The commentary:
<div1 type="Poem" n="5" id=p5>
<p>To Lesbia; an exhortation to enjoy love and despise censure.
<div2 type="commLine" n="1" id=p5l1>
<p><lemma lang="la">vivamus</lemma>: the key-note of the whole poem is struck in the first word; with <foreign lang="la">vivere</foreign> in this pregnant sense, [lsquo ]to enjoy life,[rsquo ] cf. <cit><bibl n="Verg. Copa 38">Verg. Copa 38</bibl> <quote lang="la"><l>mors aurem vellens [lsquo ]vivite[rsquo ] ait [lsquo ]venio</quote></cit>; <cit><bibl n="Mart. 1.15.12">Mart. 1.15.12</bibl> <quote lang="la"><l>sera nimis vita est crastina; vive hodie;</quote></cit> and the proverbial <foreign lang="la">dum vivimus, vivamus</foreign>.

If the <div2>s of the commentary were marked type=line, they would appear to be lines of verse just like the lines quoted from (pseudo-)Virgil and Martial within the commentary.

If the commentary also has an introduction, the introduction is most likely a <div1>, and the sections or chapters within the introduction are <div2>s (and smaller <div>s as required, of course). The introduction may also be tagged as a <text>, with the body of the commentary being another <text>, both within a <group>, like this:

<text n=intro>
text of the introduction here
<text n=comm>
text of the commentary proper here

The characteristic feature of a commentary is lemmata. These are tagged with the <lemma> element. When the text and the commentary are in the same language (as is the case for scholia, for Servius's commentary on Vergil, and the like), there is no need for a lang attribute. In the more usual case, the text will be in Latin or Greek and the commentary in English; then, the lang attribute on the <lemma> element ensures the language is recognized. Here are some examples:

<div1 type=poem n=1 id=Cp1> <div2 type=commLine n=1 id=Cp1l1> <p><lemma lang=la>dono</lemma>: The indicative present with future meaning indicates imminent decision.
<p><lemma lang=la>lepidum novum</lemma>: of the external rather than of the internal character of the book.

Note that the lemma is enclosed in a <p> element, and the text of the lemma does not have a <p> of its own. That is, the <p> tag comes before the <lemma> tag. The lemmata in these examples are quite short, but this will not always be the case; similarly, the comments shown here have been curtailed for purposes of the examples, while in actual commentaries notes on a particular passage may run to several pages.

When the lemma omits some words from the base text, the ellipsis can be marked with the &hellip; entity ("horizontal ellipsis") or simply with dots; the <gap> element is intended for textual lacunal, not for editorial omissions. For example,

<p><lemma lang="la">quidquid &hellip; qualecumque</lemma>: said with modest self-depreciation.

Here the commentator quotes two words that are not contiguous in the poem; the actual text reads Quare habe tibi quidquid hoc libelli, qualecumque. The ellipsis marks the place of the intervening words not included in the lemma.

While lemmata are the most characteristic elements in a commentary, they are not the only ones. Frequently there will be introductions to individual poems, books, scenes, or whatever other natural divisions the text has. These introductory notes will be inside the structural unit of the commentary that represents or corresponds to that unit of the base text. For example, in a work with several books, each of which has chapters, there may be an introduction to book 1, an introduction to book 1 chapter 1, and then a series of lemmata commenting on particular phrases within that chapter. Here's how it would look:

<div1 type=book n=1 id=Cb1><head>Book One</head>
<p>The first book describes Augustine's childhood.
<div2 type=chapter n=1 id=Cb1c1>
<p>The opening words of the text are unusual. Most Latin prose works begin with a formal proem, but this does not.
<p><lemma lang=la>magnus es, domine</lemma>: A confession of praise.

A commentary on a Greek play might have introductory notes on the various episodes or lyrics, like this:

<div1 type=commentary>
<p>Scene: Colonus in Attica. The back-scene shows the sacred grove of the Eumenides.
<div2 type=episode id=C1>
<head>1-116: Prologue</head>
<p>Oedipus has sat down to rest, when a man of the place warns him that he is on holy ground.
<div3 type=commLine n=1 id=Cl1>
<p><lemma lang=greek>ge/rontos</lemma>: The action of this play is some years after that of the OT.

Within the commentary text, there will usually be references to the base text, to other works of the same author, and to other ancient works. All of these are tagged in the same way, as <bibl> elements (see above). It is particularly important to tag the references to the base text, and this can usually be automated. For example, the following might appear in a commentary on Catullus 1:

<p><lemma lang=la>lepidum novum</lemma>: of the external rather than of the internal character of the book; cf. <cit><bibl n="Catul. 22.6">Catul. 22.6</bibl> <quote lang=la>novi libri</quote></cit>

The original file may have said just "22.6"; the program creating the <bibl> elements must recognize that this is a reference to Catullus because it appears within a commentary on Catullus. St. Augustine's Confessions are a more complicated case. A commentary on this work may refer not only to other sections of the Confessions but to the City of God, the Soliloquies, and any of Augustine's dozens of other works. These works will be cited by title, without the author's name. The program creating <bibl> elements for this commentary must recognize all these as works of Augustine and create the appropriate n attribute. References to other authors are frequently easier to find, as the author's name will generally appear.

Any program attempting to create <bibl> elements from references in the text will tag most of the references correctly, will mis-tag some of them, and will miss others altogether, so it will always be necessary to review this work by hand even after using an automated tool. References that are mis-tagged may include cases where the original text is inconsistent. For example, in a commentary on Sophocles, Electra without an author's name may by convention refer to the play by Sophocles. If it ever refers to the play by Euripides, however, confusion is inevitable. References that are omitted may include those that do not have closely associated titles. A reference like "Soph. OC 176, 510, 695" should be tagged as three separate <bibl> elements, but a simple program may well only mark the reference to line 176 and omit the other two.

For references from the commentary to itself, it is convenient to use the <ref> element. Each section of the commentary (each low-level <div>) gets an id attribute, and these become the targets of the <ref>s. For example, here a note on Catullus 5.4 refers to a note on Catullus 3.11:

<div2 type=commLine n=4 id=p5l4>
<p>On the general conception see <ref target=p3l11>3.11 n.</ref>

The id attributes can be generated automatically based on the reference scheme for the text, for example "id=Cb1c2s3" for "Commentary, book 1, chapter 2, section 3", or "id=p1l3" for "poem 1, line 3". The references can also be generated automatically: "see on 1.2.3" can be turned into "see on <ref target=Cb1c2s3>1.2.3</ref>". Once this is done, the SGML parser will verify that targets actually exist. This validation is valuable for identifying incorrect reference numbers: if the text refers to 1.3.5, but there is no section 1.3.5 in the commentary, one of those figures has been mis-typed or mis-scanned and must be corrected.

For very large, complicated commentaries, it is convenient to be able to validate references to the text, similarly. If the text and the commentary are in the same file, of course, this validation is automatic. More often, however, the text and commentary are to be processed independently, or are impractically large when taken together; the normal case, therefore, is for text and commentary to be in separate files. In such a case, the desired validation can be accomplished with some temporary code and some automatically-generated ids. First, create an additional <text> element in the commentary file and copy the <div> lines from the text into this space, like this:

<text lang=la>
<div1 type=book n=1>
<div2 type=chapter n=1>
<div3 type=section n=1>
<text lang=en>
<div1 type=book n=1 id=Cb1c1s1>

Then add ids to the empty dummy <div>s, using the same sort of code as you used to add ids to the <div>s in the commentary. It is convenient to use similar names, distinguished with a leading character ("T" for the text, "C" for the commentary), so that the text of book 1, chapter 1, section 1 has id Tb1c1s1, and the commentary on that section has id Cb1c1s1.

Now augment the <bibl> elements that refer to the base text with <ref> elements whose targets use the ids of the dummy <div>s. For example "see <bibl n=August. conf. 1.1.1>1.1.1</bibl>" might become "see <ref target=Tb1c1s1><bibl n=August. conf. 1.1.1>1.1.1</bibl></ref>"; alternatively, you could replace the <bibl> elements with <ref> elements while you are performing this validation. You can now verify that every target corresponds to an actual id. If it does not, fix the content as well as the target attribute and the n attribute (if you have retained the <bibl> tags). Finally, when all the mistakes are corrected, strip the extraneous <ref> elements and replace the <bibl> elements if you removed them.

A scholarly commentary may quote other works in a variety of languages. It is not too hard to place <quote> tags with a program, but getting the languages correct is probably a manual process. One method is to guess the preponderant language (probably the language of the base text) and automatically set the lang attribute for all quotes to that language. Then run through the file with an editor, searching for "quote lang=", and change the language codes where required. This is tedious but can be done in under an hour per megabyte.


A translation should have the same general structure as the original text. Most likely, this means the translation will have the same <div>s or <milestone>s as the original, and if the original has both <div>s and <milestone>s, the translation probably has both as well. Matching the structure of the translation to that of the base text facilitates matching parallel sections.

There may be notes written by the translator as well as notes by the original author; the resp attribute on the notes will distinguish them.

Otherwise, translations are no more complicated to mark than original texts.

Footnotes and other annotations

Footnotes and other notes, as discussed above, are marked with the <note> element with the resp and place attributes. Notes frequently contain <bibl> references or <quote>s, and may contain almost anything else that might be inside a paragraph of text. It is convenient to start all <note> elements with a paragraph mark <p>, especially when this can be done automatically; this ensures that other elements that might be included in the <note> will not cause errors.

The resp attribute indicates who wrote the note: the author of the original text, the editor of a prior edition, or the present markup editor. By convention, the value of this attribute is the person's initials.

The place attribute indicates whether this is a footnote (place=foot, the default if this attribute is omitted), a marginal note (place=marg), or an embedded or "in-line" note (place=inline).

An ordinary footnote might appear as follows:

... from stray hints in the <title>Coloneus</title> <note place=foot resp=RCJ><p>This is the traditional Latin title</note> and in the <title>Tyrannus</title>.

This markup indicates that the footnote explains the word Coloneus: the footnote appears next to that word, and the remainder of the sentence continues after the note is over. The display engine interpreting the markup should put the footnote number exactly where the <note> element appears in the text, and should put the footnote itself in some other suitable place, such as the bottom of the display page.

When the electronic edition is being prepared from a printed text, our convention is that the place attribute records the place of the note in that printed text, as opposed to the place where the markup editor would like to see it displayed. This is only a convention, not a fixed rule, so markup editors are at liberty to use place for the place where the note should be displayed, as opposed to the place where it was displayed. Another possibility, though somewhat more complicated, is to create a separate display specification for this document; this is especially appropriate when there are several documents to be treated the same way, for example a series of commentaries on the plays of Sophocles. Display specifications are outside the scope of the present discussion.

Apparatus critici and textual notes

Some texts include critical notes on the text and the decisions made by the editor. Commentaries, in particular, frequently give an apparatus criticus as part of the discussion of a difficult passage. In this case, the apparatus will be inside the main text of the commentary. Perseus texts generally do not have separate apparatus critici, or apparatus as separate documents; in what follows, we will discuss only critical notes included inside a text or commentary.

Use the <app> element to encode a critical apparatus. Within this element, the <lem> element gives the main reading, and <rdg> elements give variants. For an apparatus in a commentary, use <lem> to record the reading chosen by the editor of the base text. In other cases, there may be no reason to distinguish one reading from the others, and then all can be marked <rdg>.

Each <lem> or <rdg> element can have a wit attribute giving the witnesses for this reading, normally in the form of sigla. Those sigla can be documented in a witness list, <witList>, typically at the head of the text; the <witList> element contains <witness> elements, each with a sigil attribute to give the abbreviation that will appear in <lem> and <wit> elements later.

The <lem> and <rdg> elements may carry the lang attribute. Because variant readings are often unusual spellings or non-words, however, it is not always helpful to request morphological analysis for them. As a result it is generally simpler not to specify a language for variant readings; then they will all be treated alike, and neither the actual words nor the mistaken ones will receive morphology links.

A simple apparatus fragment, in its context, looks like this:

<p><lemma lang=la>quaerentes enim inveniunt eum</lemma>:
Elsewhere Augustine describes the errors of his youth as a failure to knock and seek.
<app><lem wit="G O1 S Knoll Skut. Ver.">inveniunt</lem>
<rdg wit="C D O2 Maur.">invenient</rdg>
<rdg wit="F Q">invenirent</rdg>

That is, six prior editions read "inveniunt," agreeing with this editor, four read "invenient," and two read "invenirent." Note that the apparatus appears as part of the comments on a phrase of the text.

It is convenient to encode the editor's comments in the apparatus as inline notes, as follows:

<app><lem wit="D2 O2 Maur. Knoll Skut. Ver.">omniaque <note place=inline resp=JJOD>(O2 ut vid.; O1 unclear)</note></lem>
<rdg wit="C D1 G S">omniaquae</rdg>

Figures and diagrams

Many texts contain figures, diagrams, or illustrations. We do not currently have a standard way of marking them within the text, and in many cases the figures may not have been captured during optical scan or data entry of the original print edition.

Metrical schemata

Commentaries on verse may discuss the meter, and verse texts often use metrical symbols to indicate what is known about the shape of a lacuna. It is therefore sometimes necessary to include metrical symbols in a text. This rarely comes up with accentual-syllabic verse (as in English), but is not uncommon with quantitative verse (as in the classical languages).

We are working on a set of entities to describe metrical phenomena in quantitative verse. As a stop-gap, some entities in the ISO Diacriticals set can be used: &macr; for a macron (denoting a heavy syllable or a princeps element), &breve; for a breve (denoting a light syllable). There are no good entities yet for resolutions, contractions, ancipitia, or other metrical features.

Dictionaries and lexica

Dictionaries, encyclopedias, and lexica have complicated requirements of their own, beyond the scope of this document.

Validating the Markup

Syntactic validation

Once the text is marked up, it is necessary to check the SGML syntax. In our environment, this is done with the nsgmls parser program. When the parser reports no errors, the SGML is syntactically correct. It may not include everything that it needs, but it is at least a true SGML document. On a Unix or Linux system, the parser is most conveniently run from within emacs, though it can also be run from the command line. Similar tools are available for other platforms.

Note that nsgmls, like many other language processors, is prone to cascading errors. That is, one error in the SGML file may cause nsgmls to emit dozens or even hundreds of messages. Typically, cascading errors come from missing or mis-spelled end tags. Because the element whose end tag has been omitted is still open, the SGML parser complains about any subsequent elements that are not valid within the open one, and may complain about every <p>, <lemma>, <div3>, and so on all the way to the next <div1>. When there are many parser messages after only a small change to the source file, something like this has probably happened. A constructive strategy is to fix the first reported error and see whether the output improves.

What happens next?

After all the desired features have been tagged and the SGML syntax has been found to be correct, the text is ready for use. For Perseus and Stoa texts, this means a conversion to HTML, which takes place in several steps. The end result can be displayed in a web browser. Other utilities can use the SGML source in other ways, for example counting and analyzing the words tagged as Greek or Latin.


Examples in this document are drawn from: James J. O'Donnell's commentary on Augustine's Confessions, Oxford: 1992; Sir Richard C. Jebb's commentary on Sophocles's Oedipus at Colonus, Cambridge, UK: 1899; and E. T. Merrill's commentary on Catullus, Cambridge: 1893.

Document written by Anne Mahoney
HTML generated at 12:47:26, Tuesday, 21 December 1999


Please send your comments concerning The Stoa: A Consortium for Electronic Publication in the Humanities to Ross Scaife ( This document was published on: 21 December 1999