Unicode Polytonic Greek for the World Wide Web
 Title ▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷

Unicode Polytonic Greek
for the World Wide Web

Version 0.9.7

D R A F T

Creating Markup Texts Utilizing Unicode Polytonic Greek

Introduction to the concept of markup (two paragraphs, with reference to Anne Mahoney's pages)

What is a markup language?

A markup language is a method of describing the structural relationships of the different elements of a text by means of inline annotations; for example, a markup language uses codes to indicate that a paragraph is a paragraph, that a heading is a heading, and so forth.

One could describe the Leiden method of transcribing inscriptions as a markup language. Those letters which are uncertain are marked with an underdot; those which are lost and supplied conjecturally are marked with brackets. The markings, or markup, indicate the structural relationship between the epigrapher's account of the inscription and the inscription itself.

Most modern markup languages used in computing follow a syntax originally developed by Goldfarb, Mosher, and Laurie for their Generalized Markup Language, and carried over into its descendant, the Standard Generalized Markup Language (SGML). This syntax models a document as a collection of elements, each of which has a specific structural function in the document. For instance, the title of a document is one element, having the function of identifying the entire element by a single name; the chapters are other elements, each one indicating logical divisions in the document's narrative; each chapter and section heading is an element, each one providing the identifying name for a given chapter or section; and paragraphs are elements, each one identifying the expression of a closely connected thought. (For reasons of feasibility, individual sentences are rarely treated as seperate elements, though lines of verse are.)

In SGML and its descendants, including XML, these elements are marked off from one another by means of tags, identifying labels for each element which are distinguished from the actual text of the element by means of angle brackets: for instance, in HTML (the most common SGML vocabulary, that used for web pages on the World Wide Web) and in TEIxLITE (the XML vocabulary most used for literary works), paragraph elements are distinguished by being labeled with the tag <p>. Note that tags are intended to mark the structural function, not the appearance, of a given element.

<p>This is a paragraph of text, contained between
two paragraph tags, together (the tags and the text)
comprising a paragraph element.<p>

Elements can also contain other elements, depending upon the rules for the vocabulary: for example, both the TEIxLite and XHTML (XHTML is the XML version of HTML) vocabularies allow ordered and bulleted list elements which contain list item elements as subelements; they are called subelements (or child elements to use the more correct term) in the context of their containing, or parent element.

<list type="ordered">
  <item>the first item in a list</item>
  <item>the second item in a list</item>
  <item>the third item in a list</item>
  <item>the fourth item in a list; each item is
  represented by a child list item element of the parent
  list element.</item>
  <list>

The meaning of these elements can be modified by the application of attributes, which narrow the application of a specific element tag. For instance, a document might be divided into a number of sections, including a table of contents, a sequence of chapters, and an index; in TEIxLITE, the each might be marked using the <div1> or first-order text division tag, and the type of division indicated using the type attribute: <div1 type="toc"> for the table of contents and <div1 type="chapter"> indicating a chapter.

<div1 type="chapter" no="1">
  <head>Chapter 1: Unicode Polytonic Greek for the Web</head>
  <p>Here we have a first-order division element, which is modified
  by two attributes, one indicating the type of division
  it represents, and the other giving the ordinal number for that division
  in the context of the document as a whole. This division element
  contains two sub-elements: a heading element, marked by the head tags and
  comprising both the head tags and the text contained within them,
  a paragraph element, marked by the paragraph tags and containing most of the
  explanatory text, a list element, marked by the list tags and modified by a
  type attributed indicating that it is a bulleted list, and itself containing
  four item elements, each of which represents one item in a list and which is
  marked off by item tags and contains the text of an item. So, this example
  first-order division element comprises all these things: </p>
  <list type="bulleted">
    <item>the tags</item>
    <item>the attributes</item>
    <item>the subelements, their tags and attributes</item>
    <item>the text contained within the subelements.</item>
  <list>
</div1>

In XML, elements must be marked off with both an opening and a closing tag (with special exceptions called empty elements). The opening tag begins with the element name and also includes any attributes (e.g., <div1 type="chapter">); the closing tag includes only a slash character followed by the element name (e.g., </div1>).

In SGML, closing tags are often optional.

Markup Languages and Vocabularies

Technically speaking, although it is called a markup language (remember, XML stands for eXtensible Markup Language), XML is actually a meta-language for markup languages: it provides the syntax which an XML markup language, or vocabulary, must follow. XML is an analogue to SGML with a simplified syntax which eases the parsing of documents by computer programs, and was designed with web publication applications in mind.

XML and SGML vocabularies are collections of element names and attribute names which can be used to describe a particular genre of document. For example, HTML is an SGML vocabulary designed to markup web pages, and XHTML is its XML analogue. DocBook is an vocabulary (with both SGML and XML versions) designed to markup computer program documentations.

In addition to XHTML, the vocabularies of most interest to classicists are those defined by the Text Encoding Initiative, or TEI, for literary texts. The Text Encoding Initiative has defined two markup languages for the markup of literary texts using SGML, TEI and TEI-Lite, the latter being appropriate for most editions of most literary texts, while the former is more appropriate for scholarly editions, especially editiones maiores and full bibliographic descriptions of editiones principes. TEI has also defined these vocabularies in terms of XML.

There are also a number of vocabularies defined by the World Wide Web Consortium (W3C) and other organizations for the manipulation of XML documents in general. The most important of these are those defined by the W3C: XSLT, RDF, XPath, XPointers, XLinks, XSD, and XSL-FO.

Finally, there are XML vocabularies which have been designed for special purposes. For instance, Bruce G. Robertson of [] has designed an XML (or SGML) vocabulary for the markup of historical events in historical texts called the Historical Events Markup Language, or HEML; Tom Elliot of the Ancient World Mapping Center at UNC/Chapel Hill (and others) has designed a vocabulary for the markup of inscriptions called EPIDOC; &c ad nauseam.

Details of XML

All XML vocabularies have the following requirements in common:

Web Pages: XHTML

Introduction to the concept of XHTML (two paragraphs).

The parts of a web page that are relevant to making an XHTML Unicode Polytonic Greek page

Choosing the Right Unicode

The recommended means of using Unicode in a World Wide Web document is to use the UTF-8 encoding, and to use NFC. (The full names for these are Unicode Transformation Form 8 and Normalization Form C.) The following instructions will assume that the publisher is in fact preparing his or her text in the UTF-8 encoding and in Normalization Form C. For an explanation of the reasons why these are the recommended methods, see the sections on encoding forms and normalization forms.

Note that to use NFC, one must use precomposed characters (and vice-versa).

The Encoding Declaration

 

To indicate that a web page is using the UTF-8 encoding of Unicode, an author needs to include an empty element called an encoding declaration in the headmatter of the page. For XHTML, the best-supported method of adding an encoding declaration is with a <meta> empty element with two attributes: an http-equiv attribute of "Content-Type", indicating that the meta element is defining the content type of the page, and a content attribute indicating that the content type of the page is html in the UTF-8 encoding, which is done with the value "text/html; charset=utf-8". These values must be typed exactly as shown in the example below.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Because the meta element is an empty element in an XML vocabulary, it is self-closing (i.e., has a slash at the end of the tag). Those who insist upon using non-well-formed HTML can omit that slash.

Using Cascading Style Sheets to Declare Fonts for Different Classes

Ideally, your browser will (as Mozilla and its derivatives can) be able to locate a font which is capable of displaying any character it encounters and automatically assign that font to the necessary character. In practice, most browsers will merely check to see if a list of fonts has been provided using either the deprecated <font> element tag or (the preferred method) Cascading Style Sheet language (CSS), and only use the first available font from whichever list has been provided, regardless of whether or not that font provides a glyph for the necessary character.

The following section explains how to use Cascading Style Sheets to define default fonts for various classes; specifically, how to define a class called "Greek" to mark off document elements containing Greek text, and how to set Unicode polytonic Greek fonts as the default fonts for all document element instances which are members of that class.

Ideally, one would prefer to use the <lang> attribute of XHTML rather than creating classes for each language. Unfortunately, this method does not seem to work properly with either IE6 or Mozilla, let alone earlier browsers.

To begin with, define two classes. Call one "Greek" and the other "Latin" or "English" or whichever you prefer (since strictly speaking we're discussing writing systems rather than languages, "Latin" would be more consistent; but "English" may be less confusing for now). One does so by creating a stylesheet, either in a separate file (usually called filename.css, where filename is whatever file name the author chooses), or more easily within a <style> element within the <head> element of the <html> element.

<style type="text/css"> <!-- .Greek {font-family: Palatino Linotype, Arial Unicode MS, Athena, Lucida Sans Unicode;} .English {font-family: Palatino Linotype, Palatino, Clas Garamond, Garamond, Book Antiqua, Times New Roman, serif;} //--> </style>

It is important to include the type attribute ("type="text/css"") in the element opening tag; while most XHTML documents do use CSS for stylesheets, you cannot can't be sure that every user agent will assume that unidentified style sheets are CSS.

Here is another example designed with the distinction between precomposed characters (NFC) and combining diacriticals (NFD) in mind. It is very, very unlikely that you will be using both NFC and NFD in the same document, but I have included this example to demonstrate how the differences in each font's support of the three relevant Unicode ranges, basic Greek, combining diacriticals, and extended Greek, can be handled in the style sheet.

<style type="text/css"> <!-- .Greek {font-family: 'Palatino Linotype','Athena Unicode',Cardo,'Vusillus Old Face','Arial Unicode MS','Georgia Greek','Athena Roman',Athena,Code2000,'TITUS Cyberbit Basic','Aisa Unicode','Lucida Sans Unicode';} .Combining {font-family: Cardo,'Vusillus Old Face','Arial Unicode MS','Georgia Greek',Code2000,'TITUS Cyberbit Basic','Lucida Sans Unicode';} .English {font-family: 'Palatino Linotype',Palatino,'Clas Garamond',Garamond,'Book Antiqua',Bembo,Georgia,'Times New Roman',serif;} //--> </style>

(The following comments may no longer be relevant): In theory, the font names should be in quotation marks; but some browsers object to double quotation marks in the style sheet, while others object to quotation marks altogether (i.e., they do not recognize the font names when they are in quotation marks). This behavior (failing to recognize the quotation marks) is a violation of the CSS recommendation, and should be considered a bug; but it is common enough that it is necessary for you to leave the quotation marks out. For my examples, I have followed the CSS recommendation; if your experience tells you that this does not work with the browsers used by your target audience, you should try the same style definitions without quotation marks. If you do use quotation marks, single quotes might be best for this purpose (unlike in attribute values in markup element names, where either single or double quotes can be used, and double quotes are often easier to read, at least for Americans).

The order in the font-family listing is by preference, from most preferred to least preferred. You should begin defining the classes and the fonts for each class with whichever writing system is dominant in your document; if Greek is dominant, create the Greek class selector first; if English (or Latin, etc.) is dominant, create the English (or Latin, etc.) selector first.

The reason for beginning with the dominant language and script in your document is because you want the type used for the two scripts to harmonize with each other, and because the font selection with the most influence on the overall look and feel of your document is the font used for the most dominant language and script in your document. It is best to choose the font for that writing system first, and then find a font for the other writing system that harmonizes with it.

For a document in English with Greek quotes, begin with your preferred Latin font. Personally, I prefer Palatino and Garamond fonts for most web publishing, so I have arranged a list in order of preference, from most preferred (first) to least preferred (last), of the fonts that I suggest for reading the English text in this document. The final entry, 'serif', is a generic font family name: it matches whatever is the default serif font in the web browser. Note that if any of the fonts listed before 'serif' is available to the reader, that font will be used in preference (thus my text will only appear in Book Antiqua if you do not have the Palatino Linotype, Palatino, Clas Garamond, or Garamond font available).

For Latin fonts (here I am using the word "Latin" to refer to the script as used by Latin, English, French, and many other languages), keep the following issues in mind:

Now create the second selector with the second class. Things to keep in mind when choosing a Greek font include the following:

Using UTF-8 Encoded Unicode Text

<div class="Greek">
<h5>Ἄλκηστις</h5>
<p>
<br />
Ἄδμηθ', ὁρᾷς γὰρ τἀμὰ πράγμαθ' ὡς ἔχει, <br />
λέξαι θέλω σοι πρὶν θανεῖν ἃ βούλομαι. <br />
ἐγώ σε πρεσβεύουσα κἀντὶ τῆς ἐμῆς <br />
ψυχῆς καταστήσασα φῶς τόδ' εἰσορᾶν <br />
θνῄσκω, παρόν μοι μὴ θανεῖν ὑπὲρ σέθεν, <br />
ἀλλ' ἄνδρα τε σχεῖν Θεσσαλῶν ὃν ἤθελον <br />
καὶ δῶμα ναίειν ὄλβιον τυραννίδι. <br />
κοὐκ ἠθέλησα ζῆν ἀποσπασθεῖσα σοῦ <br />
σὺν παισὶν ὀρφανοῖσιν, οὐδ' ἐφεισάμην <br />
ἥβης, ἔχους' ἐν οἷς ἐτερπόμην ἐγώ. <br />
</div>

Note that on this page, because I have used the monospacing font Courier New for all code examples, which does not contain glyphs for precomposed characters; unless you are using Mozilla, Netscape 6, or Netscape 7, you likely will not be able to read the accented characters. When editing your own documents, you may or may not be able to read all the characters, depending upon what editor you use. For instance, Figure 1 provides a sample of what this page's code looks like in the Mozilla view source window.

Screen shot of Mozilla view source window with polytonic Greek XHTML markup

Figure 1.

Unfortunately, not all text editors and HTML editors handle Unicode so well. Figure 2 shows what page's code looks like this in the main editing window of my favorite Windows development editor, HTML-Kit, in Windows ME.

Screen shot of HTML-Kit with polytonic Greek XHTML markup

Figure 2.

HTML-Kit now inlcudes a Unicode pad that allows the user to see Unicode text in Unicode, though unfortunately it is limited to one font, and therefore only the characters in that font.

An Alternative: Unicode Numeric Entities

Below you'll find the numeric entity codes to display the text of Euripides, Alcestis (280ff.) in Unicode Greek using precomposed polytonic characters.

<div class="Greek">
<strong class="speaker">
&#7948;&#955;&#954;&#951;&#963;&#964;&#953;&#962;</strong>
<br /><br />
&#7948;&#948;&#956;&#951;&#952;',
      &#8001;&#961;&#8119;&#962; &#947;&#8048;&#961;
      &#964;&#7936;&#956;&#8048;
      &#960;&#961;&#8049;&#947;&#956;&#945;&#952;' &#8033;&#962;
      &#7956;&#967;&#949;&#953;,
      <br />
&#955;&#8051;&#958;&#945;&#953; &#952;&#8051;&#955;&#969;
      &#963;&#959;&#953; &#960;&#961;&#8054;&#957;
      &#952;&#945;&#957;&#949;&#8150;&#957; &#7939;
      &#946;&#959;&#8059;&#955;&#959;&#956;&#945;&#953;.
      <br />
&#7952;&#947;&#8061; &#963;&#949;
      &#960;&#961;&#949;&#963;&#946;&#949;&#8059;&#959;&#965;&#963;&#945;
      &#954;&#7936;&#957;&#964;&#8054; &#964;&#8134;&#962;
      &#7952;&#956;&#8134;&#962;
      <br />
&#968;&#965;&#967;&#8134;&#962;
      &#954;&#945;&#964;&#945;&#963;&#964;&#8053;&#963;&#945;&#963;&#945;
      &#966;&#8182;&#962; &#964;&#8057;&#948;'
      &#949;&#7984;&#963;&#959;&#961;&#8118;&#957;
      <br />
&#952;&#957;&#8132;&#963;&#954;&#969;,
      &#960;&#945;&#961;&#8057;&#957; &#956;&#959;&#953;
      &#956;&#8052; &#952;&#945;&#957;&#949;&#8150;&#957;
      &#8017;&#960;&#8050;&#961; &#963;&#8051;&#952;&#949;&#957;,
      <br />
&#7936;&#955;&#955;' &#7940;&#957;&#948;&#961;&#945;
      &#964;&#949; &#963;&#967;&#949;&#8150;&#957;
      &#920;&#949;&#963;&#963;&#945;&#955;&#8182;&#957;
      &#8003;&#957; &#7972;&#952;&#949;&#955;&#959;&#957;
      <br />
&#954;&#945;&#8054; &#948;&#8182;&#956;&#945;
      &#957;&#945;&#8055;&#949;&#953;&#957;
      &#8004;&#955;&#946;&#953;&#959;&#957;
      &#964;&#965;&#961;&#945;&#957;&#957;&#8055;&#948;&#953;.
      <br />
&#954;&#959;&#8016;&#954;
      &#7968;&#952;&#8051;&#955;&#951;&#963;&#945;
      &#950;&#8134;&#957;
      &#7936;&#960;&#959;&#963;&#960;&#945;&#963;&#952;&#949;&#8150;&#963;&#945;
      &#963;&#959;&#8166;
      <br />
&#963;&#8058;&#957; &#960;&#945;&#953;&#963;&#8054;&#957;
      &#8000;&#961;&#966;&#945;&#957;&#959;&#8150;&#963;&#953;&#957;,
      &#959;&#8016;&#948;'
      &#7952;&#966;&#949;&#953;&#963;&#8049;&#956;&#951;&#957;
      <br />
&#7973;&#946;&#951;&#962;,
      &#7956;&#967;&#959;&#965;&#962;' &#7952;&#957;
      &#959;&#7991;&#962;
      &#7952;&#964;&#949;&#961;&#960;&#8057;&#956;&#951;&#957;
      &#7952;&#947;&#8061;.
      <br />			
</div>

 Unicode Polytonic Greek for the World Wide Web Version 0.9.7
 Copyright © 1998-2002 Patrick Rourke. All rights reserved.
D R A F T - Under Development
 Please do not treat this as a published work until it is finished!
▣ Home | ◈ Contents | △ Section | ◁ Previous | ◁ Next