Unicode Polytonic Greek for the World Wide Web
 Why Unicode? ▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷

Unicode Polytonic Greek
for the World Wide Web

D R A F T

Why Unicode?

Unicode is a universal standard maintained by the International Standards Organization and the international Unicode Consortium, a standard which has been adopted by the internation World Wide Web Consortium as the standard method of encoding text for World Wide Web documents. Heretofore most ISO standards have had useful lives measured in decades; for instance, the ISO standard for text markup, SGML, was first adopted in 1981 and is today (in the forms of XML and HTML) the most widely used method of representing rich text documents in electronic

Unicode is a universal standard for character encoding that permits millions of separate characters to be referenced: enough for all the alphabets, syllabaries, logographic and mixed scripts used by modern readers as well as a large number of ancient scripts. Where the original ASCII font encoding uses only one byte for each character, allowing only 256 possible characters, Unicode uses (depending upon encoding) anywhere from one to six bytes for each character, theoretically allowing 221 possible characters - that's well over one million characters. The current version of the Unicode standard, Unicode 3.1.1, defines 102,655 characters, including all the characters for the Latin, Greek, Cyrillic, Coptic, Hebrew, Devanagari, Hangul, Hiragana and Katakana, Tamil, writing systems, and the so-called CJK unified ideographs sets, which defines tens of thousands ideographs which can be represented graphically as hanzi, kanji, hanja, and chu Han ideographs in the Chinese, Japanese, Korean, and Vietnamese scripts, respectively, and which were unified from a much larger number of different ideograph forms along etymological principles (and can be displayed in different forms by using different fonts for each language).

The number of characters can be increased further through the use different fonts, etc., but these are issues that are not appropriate for this page; and it should be understood that throughout I am simplifying issues, e.g., by pretending that there is no distinction between the UTF-8, UTF-16, and UTF-32 encodings).

The previous version of the Unicode standard, Unicode 3.0, defined 49,194 of those characters, including all the characters for the Latin, Greek, Cyrillic, Coptic, Hebrew, Devanagari, Hangul, Hiragana and Katakana, Tamil, and the so-called CJK Unified Ideographs set, which defines 27,786 ideographs which can be represented graphically as hanzi, kanji, hanja, and chu Han ideographs in the Chinese, Japanese, Korean, and Vietnamese scripts, respectively, which were unified from 120,000 different ideograph forms along etymological principles (and can be displayed in different forms by using different fonts for each language).

Unicode is a Universal Standard

All other methods of representing polytonic Greek characters in fonts are proprietary: the character mappings are peculiar to a specific input method editor/keyboard utility (such as GreekKeys, WinGreek, and the WordPerfect character set) and cannot be easily converted to another such encoding (except with the help of Sean Redmond's extremely useful Greek font converter, which does convert GreekKeys, WinGreek, and beta code to Unicode, among others). Of these, GreekKeys seems to be the most popular among classical scholars; however, it is no longer supported for use in Windows, the author (Jeffrey Rusten) having decided that users should transition to the use of Unicode.

The Unicode standard, on the other hand, is an accepted ISO/ANSI standard; it is not proprietary. As such, different input utilities can use different fonts and keyboard scripts, but as long as these utilities share the same encoding, the Unicode encoding, users can be assured that texts will convert between them (as long as the vendors are consistent in their application of Unicode).

The methods used by GreekKeys fonts, etc. are proprietary font encodings: each imposes a different set of glyphs on the ISO 8859-1 encoding of the Latin characters. The computer is told that the font is an ISO 8859-1 font. It "sees" a but displays α (alpha). On the other hand, there is the ISO 8859-7 encoding for modern Greek; when this encoding is used, the computer "sees" α. However, ISO 8859-7 doesn't provide for polytonic Greek, and there are other advantages to the use of a single encoding (e.g., the UTF-8 encoding, which is one of the Unicode-compliant encodings) for multiple scripts.

Unicode is Platform-Independant

  • Unicode is entirely platform independant: Unicode text can be read on Macintosh computers with OS X and either the OmniWeb (version 4.0 or higher) or Mozilla (version 0.9.6 or higher) web browsers, on Windows computers with either the Netscape (version 4.5 or higher), Mozilla browser (version M14-M18 and 0.6 or higher), or Internet Explorer (version 4.0 or higher) web browsers, on Linux computers with XFree86 (version 4.0 or higher) and either the Konqueror (version 1.0 or higher), Netscape (version 6.0 or higher) or Mozilla (version M16-M18 and 0.6 or higher) web browsers, and several other computing platforms. The only widely-used platform which is excluded by the use of Unicode is Macintosh OS versions 8.6 through 9.2.1, which is also excluded by many of the proprietary encodings.
  • Unicode Includes All the Ranges of Characters of Interest to Classicists, or Soon Will

  • Unicode includes ranges for basic Greek and Coptic, extended Greek characters, and combining diacriticals, which together allow for the representation of all character and diacritical combinations in the polytonic classical Greek writing system.
  • Unicode also includes two Aegean scripts, Cypriot and Linear B, Etruscan, and Byzantine musical symbols, and will likely be expanded in the future to represent other writing systems and symbol repertoires of importance to classicists
  • Other Font Encodings

    The most widely used method of storing Greek text on the web is beta code, which was developed for use in the TLG database of Greek literature. The good thing about beta code is that it takes up barely more memory than an equivalent text in a modern Latin-script language; the bad thing is that beta code isn't really Greek, but a transliteration modified with non-alphabetic characters to preserve diacritical information and with unused alphabetic characters to preserve the distinctions between long and short vowels that are made in the Greek alphabet, and it's downright ugly and hard to read. I was originally an opponent of the use of beta code for e-mail and ASCII text display when I first encountered it in the very late eighties or very early nineties (traditionally, typesetters without access to a Greek font have simply transliterated Greek in italic Latin characters), but now I have to admit that it's a good temporary standard for data storage and is as universal as I believe Unicode will become; but I'd still argue that it's not something one should force anyone to read. Fortunately, there has been some success in writing utilities to convert betacode in a database into a higher-level encoding for display purposes; the most interesting are Sean Redmond's page linked above and the Perseus Project's display scripts; the Suda On Line also uses a variation of the Perseus system, and the TLG itself is working on a similar script. However, while Sean Redmond's page uses the precomposed characters (he doesn't say so, but I suspect he is using Normalization Form C), the Perseus script uses combining diacriticals, which can be a problem with the Athena and Palatino Linotype fonts.

    Beta code can be considered a proprietary character encoding without a font; although Ralph Hancock's Betaread font does provide what little font support for beta code is feasible. I've done a small page called the "betapal" palette that gives a reduced version of the beta code encoding for ease of use; it can currently be found linked from the Suda On Line page at http://www.stoa.org/sol/betapal.shtml. A graphic character map provided by Perseus can also be seen on that page.

    The SGreek font encoding is a modification of betacode which allows distinction between upper case and lower case characters without the use of an additional byte.

    SMK GreekKeys, WinGreek, the WordPerfect character sets and other Greek keyboard/font utilities are useful for creating printed documents, and they can be adapted to the web - but they are all proprietary to one degree or another, and so less platform independent. In order to read a web page with Greek in GreekKeys, WinGreek, or WordPerfect Greek, the reader must have the appropriate font installed. For WinGreek, this means one must have WinGreek; for WordPerfect, one must have WordPerfect, and for the GreekKeys encoding, one must have GreekKeys or have downloaded the freely-available Athenian font provided for users of Perseus. As more Unicode fonts become available, the choice in fonts will expand dramatically for Unicode users.

    Unicode is the future; eventually, the makers of Greek input methods will find it necessary to adopt Unicode as their encoding or fall behind. With the freeware Athena font and the widely distributed (but not really free, since one must buy one of the Office 2000 products or Windows 2000, respectively) Arial Unicode MS and Palatino Linotype sharing one standardized non-proprietary encoding, web authors can probably best serve their readers by providing content in the Unicode encoding. They will also serve themselves: Unicode pages will not have to be converted to a future encoding, while almost certainly GreekKeys and WinGreek will eventually have to be converted to Unicode to avoid obsolescence.

    Why It's a Bad Idea to Use Graphic Representations of Greek Instead of Greek Text Characters

    Of course, one way to get Greek text on the web is to type it into some program that can easily save a graphic file: for example, PowerPoint can easily output a jpeg graphic of any slide typed in PowerPoint. There are several problems with this method, though.

    1. Graphic files are much larger than text and so take much longer to download. I usually remind the students in my Internet classes that pictures are worth a thousand words; the file sizes really are that different in size. A full-screen image (bitmapped with a 32 bit color depth and no compression) of an 800 × 600 pixel screen takes up 1,920,000 bytes - 1.83 megabytes! By comparison, this page is less than 80,000 bytes, and yet on my 1024 × 768 pixel it fills the screen almost 17 times. Even if everything on this page were in ancient Greek, doubling the file size, it would still be (very approximately) 200 times smaller than a comparable graphic file.
    2. Graphics are not searchable in the way that text is - when Unicode becomes more universal, most search engines will have the capability to search Greek text in Unicode documents! This will require some kind of standardized input method editor, something analogous to GreekKeys. Right now, Ralph Hancock has already created Antioch for Word 97/2000, which can be obtained at the Antioch web site, but which doesn't work outside Word 97/2000 or in Macintosh, Windows NT, Linux, or apparently with native bidi systems; I haven't tried this software, but Ralph does know his fonts. This is a $50 software package. There is also a Visual Basic for Applications script designed to work in Word 97, written by (available at http://members.aol.com/AtticGreek/), I haven't tried it and know little about it. There are also Unicode text editors (on the Text Editor page, I've described SC UniPad).
    3. Text fonts are scalable, web graphics are not. In text documents the fonts (well, TrueType and OpenType fonts) are scalable to the resolution of the printer, while web graphics, which are almost universally raster (bitmapped) graphics, are not scalable, and usually look terrible when printed on high-resolution printers (and sometimes on low-resolution printers). For a masthead in a document without Greek text, or for a number of other limited purposes, graphics make sense; but ultimately they are far less effective as representations of text than Greek fonts used in HTML encodings.
    4. The use of graphics complicates accessibility issues for users with visual disabilities.


     Unicode Polytonic Greek for the World Wide Web Version 0.9.7
     Copyright © 1998-2002 Patrick Rourke. All rights reserved.
    D R A F T - Under Development
     Please do not treat this as a published work until it is finished!
    ▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷