Words and Lemmatization

Tagging lexical words in the text and/or linking to lemmata for purposes of indexing or search.

Explicit markup of words (tokenization) and identification of their dictionary headwords (lemmatization) are both optional. Many projects simply leave these features unmarked, or rely on automated processes in search software to detect word-breaks and link to lemmatizing tools such as Morpheus. (The Papyrological Navigator, Perseus, and the TLG all use methods similar to this.)

To explicitly mark-up lexical words in a papyrological or epigraphic text, however, each word in the text should be enclosed in a <w> element. (For ease of processing, it is advised that inter-word spacing, punctuation and other features be left outside of this element, and if possible there should be no spaces or carriage returns within the <w> element.)

<w>maximo</w>
<w>tribunicia</w>
<w>potestate</w>
<num>XXIIII</num>
<w>imperatori</w>IRT: NaN

To record the lemma or dictionary headword of the word in question, the simplest solution is to enter the uninflected form in a lemma attribute, which may be used, for example, to generate the entries in a lexical index to the corpus.

<w lemma="ἵστημι">ἕστηκα</w>

A slightly more sophisticated solution is to enter in a lemmaRef attribute a URL or other URI that points to the entry for the word in question in a database or online dictionary. This solution allows for better disambiguation of homonymous words, for example, or linking to morphological and statistical information about the word.

<w
  lemmaRef="http://www.perseus.tufts.edu/hopper/morph?l=fero&amp;la=la">
tulisti</w>

Responsibility for this section

  1. Simona Stoyanova, author
  2. Gabriel Bodard, author
Date: 2013-05-02