Unicode Polytonic Greek for the World Wide Web
 Introduction ▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷

Unicode Polytonic Greek
for the World Wide Web

Version 0.9.7

D R A F T

Introduction

Unicode is a universal standard for character encoding, developed and published by the Unicode Consortium, that permits millions of separate characters to be referenced with one standard: enough for all the alphabets, syllabaries, logographic and mixed scripts used by modern readers as well as a large number of ancient scripts. Where the original ASCII font encoding used only seven bits for each character, allowing only 128 possible characters, and the more modern ISO 8859 encodings use one byte (eight bits), each thus allowing 256 possible characters, Unicode uses (depending upon encoding) anywhere from one to six bytes for each character, theoretically allowing 221 possible characters - if you exclude the private use area and other reserved code points, that's one million, seven hundred thousand characters.

The current version of The Unicode standard, Unicode 3.1.1, defines 102,655 characters, including all the characters for the major European, Asian, African, and American alphabetic and syllabic writing systems, the most important characters for the Chinese writing system and those ideographic/logographic systems derived from Chinese, and a number of other minority and historical writings systems.

For classicists, the most important facts are these:

  1. Unicode is a universal standard maintained by the International Standards Organization and the international Unicode Consortium, a standard which has been adopted by the internation World Wide Web Consortium as the standard method of encoding text for World Wide Web documents. Heretofore most ISO standards have had useful lives measured in decades; for instance, the ISO standard for text markup, SGML, was first adopted in 1981 and is today (in the forms of XML and HTML) the most widely used method of representing rich text documents in electronic form.
  2. Unicode is entirely platform independant: Unicode text can be read on Macintosh computers with OS X and either the OmniWeb (version 4.0 or higher) or Mozilla (version 0.9.6 or higher) web browsers, on Windows computers with either the Netscape (version 4.5 or higher), Mozilla browser (version M14-M18 and 0.6 or higher), or Internet Explorer (version 4.0 or higher) web browsers, on Linux computers with XFree86 (version 4.0 or higher) and either the Konqueror (version 1.0 or higher), Netscape (version 6.0 or higher) or Mozilla (version M16-M18 and 0.6 or higher) web browsers, and several other computing platforms. The only widely-used platform which is excluded by the use of Unicode is Macintosh OS versions 8.6 through 9.2.1, which is also excluded by many of the proprietary encodings.
  3. Unicode includes ranges for basic Greek and Coptic, extended Greek characters, and combining diacriticals, which together allow for the representation of all character and diacritical combinations in the polytonic classical Greek writing system. (There has been some discussion of severing Coptic from the Greek range, but that I believe remains only in the discussion stage.)
  4. Unicode also includes two Aegean scripts, Cypriot and Linear B, as well as the old Italic script used for Etruscan and several italic languages, and Byzantine musical symbols, and will likely be expanded in the future to represent other writing systems and symbol repertoires of importance to classicists

The main focus of this electronic book, Unicode Polytonic Greek for the World Wide Web (henceforth UPGW3) will be upon the use of Unicode for the representation of polytonic Greek for World Wide Web-based electronic publications (XML and XHTML documents) which require polytonic Greek text. The intended audience is the community of professional and dedicated amateur classicists with limited technical expertise in markup and electronic publishing, though it is hoped that these pages will be of use to readers with more developed technical knowledge and of use to readers interested in biblical and Byzantine scholarship, among others. The intended purpose is for UPGW3 to serve as a resource for those who want a comprehensive introduction to using the Unicode encoding to publish Greek text on the World Wide Web. Where possible, UPGW3 will provide the techniques required to

Although other encoding methods have been used in the past, Unicode provide a much richer, robust set of capabilities for the publication of electronic texts in multiple languages, and even texts exclusively in polytonic Greek. A discussion of the factors which make Unicode the best choice for the electronic encoding of polytonic Greek is provided in the section entitled Why Unicode?.

At the end of this page there is an example text from Euripides' Alcestis in Unicode polytonic Greek using the methods recommended in this book (for Unicode, UTF-8 and Normalization Form C [precomposed characters]; for the markup, XHTML and CSS1). Additional sections describe how to read this text in Windows, Macintosh OS X, and Red Hat Linux 7.0. Subsequent sections explain how to use the methods used to create this page to create your own pages utilizing Unicode-compliant tools to publish Greek text.

In order to read this text, you will need the following:

  1. An operating system that supports Unicode and the Unicode features of the font and the browser (Windows 95, 98, 98 Second Edition, NT 4.0, 2000, or XP; Macintosh OS X; Linux with XFree86 4.0; BeOS 5).
  2. A Unicode-enabled web browser that understands the Cascading Style Sheet language (Mozilla 0.9.6 or higher for Windows, OS X, or Linux, Netscape 6.2 for Windows or Linux, Netscape 4.7 for Windows, OmniWeb 4.0 for OS X, Konqueror for Linux with KDE 2, or NetPositive for BeOS 5.
  3. A Unicode font with support for polytonic Greek, specifically with support for precomposed characters.

How it works

There are many different ways of using Unicode to represet Greek text in an XML document or web page. For instance, you can use one of the Unicode encodings, e.g., UTF-8, UTF-16 ,UCS-2, or use numerical entities in a non-Unicode encoding (which any Unicode-supporting browser can translate into the proper characters); you can use one of two standardized methods of representing the diacriticals, Normalization Form C (which uses one character with the letter and all the diacriticals preassembled for accurate display, and only uses the simplest possible coding for each character or character and diacriticals combination) or Normalization Form D (which uses separate characters for each letter and diacritical, and in supporting operating systems and applications requires that the diacriticals be properly arranged electronically, and only uses the simplest possible representation for each character or diacritical), as well as a number of non-standard methods (which are strongly discouraged).

In accordance with the recommendations of the World Wide Web Consortium (W3C), authors of electronic texts utilizing HTML, XHTML, or other XML vocabularies should utilize Normalization Form C, which uses precomposed characters to represent polytonic Greek. Because it is the most widely supported Unicode encoding, authors of World Wide Web documents should use the UTF-8 encoding (rather than UTF-16) to represent Unicode text. Authors who are concerned that their readers will not be able to set their browsers to automatically detect the UTF-8 encoding, or who are publishing on web servers which they do not maintain and which provide an encoding other than UTF-8 in the hypertext transfer protocol header sent with each web page, may choose to use numerical entities in their pages instead, but this will prevent those operating systems and applications which use Unicode encodings as their native encodings from displaying the HTML source of the Greek text in text editor windows, so it is not recommended for other uses.

For the purpose of simplification, henceforth all references to "precomposed characters" will assume the use of Normalization Form C, and all references to "combining diacriticals" will assume the use of Normalization Form D; it will be assumed that all web documents will be prepared utilizing the UTF-8 encoding. For a more detailed discussion of UTF-8 and other Unicode encodings, see the Encodings page.

To simplify, Greek text can be typed using either precomposed characters, in which all the diacriticals occur as part of the same glyph or item of type as the character they modify, or using combining diacriticals, in which the diacritical combinations are on a separate glyph from the character they modify and are displayed in the same space as the character they modify.

Combining Diacriticals

One subrange of Unicode is dedicated to combining diacriticals. Combining diacriticals are characters which are used as diacriticals to modify other characters; when typed after a character (in normal Greek text) they are displayed above, below, to the side, around or within a character. For example, a combining acute accent following an alpha should be displayed above the alpha; a combining iota subscript following an alpha should be displayed below the alpha.

Combining diacricals can be stacked; for instance, one can follow an alpha character with a smooth aspirate, a circumflex accent, and an iota subscript, each from the combining diacriticals set, and expect a properly displayed alpha with a smooth aspirate, circumflex accent, and iota subscript. Combining diacriticals shoud be entered in a normalized order: beginning with the diacritical closest to and above the character to that furthest from and above, followed by the diacritical closest to and below the character to that furthest from and below.

Different Unicode compliant fonts and applications provide different levels of support for combining diacriticals. For example, in most Linux distributions there is no support for placing combining diacriticals properly, and they are usually displayed (when they are displayed at all) as overstrikes, which (depending upon the design of the font) can be very difficult to read. The same situation applies in the browsers on all platforms. On the other hand, Microsoft Word for Windows 2000 can place combining diacriticals exactly where needed.

Precomposed Characters

The obvious solution to this issue was to provide glyphs which precompose character and diacritical combinations: in other words, have separate glyphs each for alpha with smooth aspirate, alpha with smooth aspirate and circumflex accent, and alpha with smooth aspirate and acute accent. These were added to the Unicode Standard in version 2.0, as the extended Greek character block of the Unicode Standard. Unfortunately, this has certain consequences: programmers must program search engines and other character manipulation applications must convert precomposed characters to a more simplified normalization (the decomposed form) in order to properly search and manipulate text.

In the end, the World Wide Web Consortium (W3C) settled on Normalization Form C, which utilizes precomposed characters for polytonic Greek, as the recommended form of Unicode for World Wide Web documents (W3 Consortium recommendations are adhered to by all professional web publishers who are concerned with interoperability). Unfortunately, some websites publishing ancient Greek use tools which were programmed before this recommendation was promulgated, and use Normalization Form D for Unicode text. And other sites provide their content in either normalization form; in particular, the Perseus Digital Library and other resources which utilize the Perseus toolkit, including the Bryn Mawr Classical Review (BMCR) and many Stoa Consortium resources (e.g., the Suda On Line).

For more details on the Normalization Forms, see the Normalization Forms page.

In order to read electronic texts utilizing Normalization Form D for Unicode polytonic Greek, you will need the following:

  1. An operating system that supports Unicode and the Unicode features of the font and the browser (Windows 95, 98, 98 Second Edition, NT 4.0, 2000, or XP; Macintosh OS X; Linux with XFree86 4.0; BeOS 5).
  2. A Unicode-enabled web browser that understands the Cascading Style Sheet language (Mozilla 0.9.6 or higher for Windows, OS X, or Linux, Netscape 6.2 for Windows or Linux, Netscape 4.7 for Windows, OmniWeb 4.0 for OS X, Konqueror for Linux with KDE 2, or NetPositive for BeOS 5.
  3. A Unicode font with support for polytonic Greek, specifically with support for precomposed characters.

Reading Unicode Polytonic Greek

If you have one of the operating systems with support for Unicode (Windows 95, Windows 98, Windows NT, Windows 2000, Windows ME, and Windows XP, Mac OS X, Linux with XFree86 4.x, or BeOS 5), the next step is to download and install a Unicode-compatible font with support for basic Greek and either combining diacriticals (to read those electronic publications which utilize them, like the Perseus Digital Library, Bryn Mawr Classical Review, and the Suda On Line) or extended Greek precomposed characters (to read the texts on this web site, on the Thesaurus Linguae Graecae web site, the Perseus Digital Library, Bryn Mawr Classical Review, and other electronic publications which can utilize Normalization Form C), or both. To choose a font, see the section on Fonts With Support for Unicode Polytonic Greek, which provides details on the freeware or shareware fonts currently available.

Next, unless you have Windows 98, Windows ME, Windows 2000, or Windows XP, you should download a web browser with support for Unicode and the Cascading Style Sheet Language- usually this means either Netscape 6.2 (Linux), Mozilla 0.9.6 or higher (OS X, Linux, BeOS), Konqueror (comes with Linux distributions that include KDE 2, which Konqueror requires), and other Mozilla-based browsers (Galeon for Linux distributions with Gnome, Beonex for Linux), or OmniWeb 4.0 for Mac OS X. Internet Explorer 4, 5, and 6 for Windows (but not for Macintosh) support Unicode, and are preinstalled on Windows 98, Windows ME/2000, and Windows XP respectively. Then you merely need to configure your browser.

Specific hints for each platform are provided on the Quick Start Guide page. More detailed discussions are provided for each platform in the section on Platforms With Support for Unicode Polytonic Greek.

Writing Unicode Polytonic Greek

The purpose of Unicode Polytonic Greek for the World Wide Web, however, is not so much to explain how to read Unicode polytonic Greek as it is to explain how to use Unicode polytonic Greek to publish Greek texts on the World Wide Web. Toward this end, a comprehensive guide has been provided: how Unicode interacts with XML and Cascading Style Sheets, what text editing programs are available and how they can be used, and, for those who prefer to work with more user-friendly tools, how to use WYSIWYG ("what you see is what you get") web editors and Word Processors to create Unicode polytonic Greek texts.

Tools and Resources

Finally, a number of tools and resources have been provided to guide you in working with Unicode polytonic Greek: an annotated bibliography of both online and print resources, a sample PERL script that performs a conversion from betacode to UTF-8 in Normalization Form C, a group of Code Charts, some sample texts, a discussion of other Unicode ranges of interest to classicists, and a page of acknowledgments.


Euripides' Alcestis in Unicode Polytonic Greek, Utilizing XHTML and CSS


If this is not readable, visit the Stoa Consortium website's configuration page; select Unicode under Greek Display, then click the set configuration button at the bottom of the page. Then return to this page. If you have followed the instructions provided, this should appear in Unicode.

Ἄλκηστις

Ἄδμηθ', ὁρᾷς γὰρ τἀμὰ πράγμαθ' ὡς ἔχει,
λέξαι θέλω σοι πρὶν θανεῖν ἃ βούλομαι.
ἐγώ σε πρεσβεύουσα κἀντὶ τῆς ἐμῆς
ψυχῆς καταστήσασα φῶς τόδ' εἰσορᾶν
θνῄσκω, παρόν μοι μὴ θανεῖν ὑπὲρ σέθεν,
ἀλλ' ἄνδρα τε σχεῖν Θεσσαλῶν ὃν ἤθελον
καὶ δῶμα ναίειν ὄλβιον τυραννίδι.
κοὐκ ἠθέλησα ζῆν ἀποσπασθεῖσα σοῦ
σὺν παισὶν ὀρφανοῖσιν, οὐδ' ἐφεισάμην
ἥβης, ἔχους' ἐν οἷς ἐτερπόμην ἐγώ.
καίτοι ς' ὁ φύσας χἠ τεκοῦσα προύδοσαν,
καλῶς μὲν αὐτοῖς κατθανεῖν ἧκον βίου,
καλῶς δὲ σῶσαι παῖδα κεὐκλεῶς θανεῖν.
μόνος γὰρ αὐτοῖς ἦσθα, κοὔτις ἐλπὶς ἦν
σοῦ κατθανόντος ἄλλα φιτύσειν τέκνα.
κἀγώ τ' ἂν ἔζων καὶ σὺ τὸν λοιπὸν χρόνον,
κοὐκ ἂν μονωθεὶς σῆς δάμαρτος ἔστενες
καὶ παῖδας ὠρφάνευες. ἀλλὰ ταῦτα μὲν
θεῶν τις ἐξέπραξεν ὥσθ' οὕτως ἔχειν.
εἶεν: σύ νύν μοι τῶνδ' ἀπόμνησαι χάριν:
αἰτήσομαι γάρ ς' ἀξίαν μὲν οὔποτε
ψυχῆς γὰρ οὐδέν ἐστι τιμιώτερον,
δίκαια δ', ὡς φήσεις σύ: τούσδε γὰρ φιλεῖς
οὐχ ἧσσον ἢ γὼ παῖδας, εἴπερ εὖ φρονεῖς:
τούτους ἀνάσχου δεσπότας ἐμῶν δόμων
καὶ μὴ πιγήμῃς τοῖσδε μητρυιὰν τέκνοις,
ἥτις κακίων οὖς' ἐμοῦ γυνὴ φθόνῳ
τοῖς σοῖσι κἀμοῖς παισὶ χεῖρα προσβαλεῖ.
μὴ δῆτα δράσῃς ταῦτά γ', αἰτοῦμαί ς' ἐγώ.
ἐχθρὰ γὰρ ἡ πιοῦσα μητρυιὰ τέκνοις
τοῖς πρόσθ', ἐχίδνης οὐδὲν ἠπιωτέρα.
καὶ παῖς μὲν ἄρσην πατέρ' ἔχει πύργον μέγαν
[ὃν καὶ προσεῖπε καὶ προσερρήθη πάλιν]:
σὺ δ', ὦ τέκνον μοι, πῶς κορευθήσῃ καλῶς;
ποίας τυχοῦσα συζύγου τῷ σῷ πατρί;
μή σοί τιν' αἰσχρὰν προσβαλοῦσα κληδόνα
ἥβης ἐν ἀκμῇ σοὺς διαφθείρῃ γάμους.
οὐ γάρ σε μήτηρ οὔτε νυμφεύσει ποτὲ
οὔτ' ἐν τόκοισι σοῖσι θαρσυνεῖ, τέκνον,
παροῦς', ἵν' οὐδὲν μητρὸς εὐμενέστερον.
δεῖ γὰρ θανεῖν με: καὶ τόδ' οὐκ ἐς αὔριον
οὐδ' ἐς τρίτην μοι μηνὸς ἔρχεται κακόν,
ἀλλ' αὐτίκ' ἐν τοῖς οὐκέτ' οὖσι λέξομαι.
χαίροντες εὐφραίνοισθε: καὶ σοὶ μέν, πόσι,
γυναῖκ' ἀρίστην ἔστι κομπάσαι λαβεῖν,
ὑμῖν δέ, παῖδες, μητρὸς ἐκπεφυκέναι.



Disclaimer

The author of this site makes no guarantee or warrantee that the instructions provided will work and will not damage your computer. They worked for him. Anything that happens is your own fault.

 Unicode Polytonic Greek for the World Wide Web Version 0.9.7
 Copyright © 1998-2002 Patrick Rourke. All rights reserved.
D R A F T - Under Development
 Please do not treat this as a published work until it is finished!
▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷