Unicode Polytonic Greek for the Web
 Unicode Encodings ▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷

Unicode Polytonic Greek
for the World Wide Web

Version 0.9.1

D R A F T

Unicode Encodings

UTF-8

UTF-8 is the most common Unicode encoding used on the web, and the one I have recommended for use in these pages. UTF-8 represents each ASCII character with one byte, exactly equivalent to the ASCII value for that character. UTF-8 uses two bytes to represent several common scripts, including the non-ASCII Latin characters, and the Cyrillic, Greek and Coptic, Arabic, Hebrew, Syriac, Armenian, and Thaana scripts, as well as combining diacriticals: the leading byte is a value that indicates that the character uses a two-byte code and indicates which range of characters is relevant, while the second byte that indicates the specific character in the set.

UTF-8 uses three bytes to represent the remaining 63,488 characters in the so-called Basic Multilingual Plane, including (among very many others) the extended Greek precomposed characters as well as the main Indic scripts (Devanagari, Tamil, Malayalam, Telugu, etc.), Georgian, Mongolian, Tibetan, the katakana and hiragana, the Hangul, and the basic set of CJK unified ideographs, which represent the ideographic characters sets of Chinese, Japanese, Korean, and pre-Latin orthography Vietnamese. The leading two bytes indicate that the character uses a three-byte code and indicates which range of characters is relevant, while the third byte indicates the specific character in the set.

Finally, UTF-8 uses four bytes to represent the characters in the planes above the Basic Multilingual Plane, including the Aegean scripts, Etruscan, and any future additions to Unicode of relevance to Classicists, as well as additional CJK ideographs and a number of other scripts. As you can guess, the first three bytes indicate a four-byte character and the character set of relevance, while the final byte indicates which character of the set is relevant.

For classicists, this means that basic Greek characters and combining diacriticals require two bytes (which means that e.g. an alpha with an iota subscript, a smooth aspirate and a circumflex accent will require a total of eight bytes),

UTF-16

UTF-16 is an encoding which represents all Basic Multilingual Plane characters (including basic Greek, extended Greek, and combining diacriticals) with two bytes and all other characters (including Aegean scripts, Etruscan, and any future scripts of relevance to classicists) with four bytes. The four-byte combinations use surrogates, two-byte combinations appearing before a two-byte code which indicate that the intended character is outside the Basic Multilingual Plane, and exactly in which plane the intended character is located.

UTF-16 is a superset of UCS-2. For Latin, basic Greek, extended Greek, and combining diacriticals, UTF-16 is exactly equivalent to UCS-2. Note that all four ranges are within the Basic Multilingual Plane, and so can be represented in UTF-16 with only two bytes. Aegean scripts, Etruscan, and any future ranges of interest to classicists will be assigned outside the Basic Multilingual Plane, and so will require 4 bytes for their representation.

Technical Details of the UTF-8 and UTF-16 Encodings

Table 1. Mapping of Unicode Scalar Values to UTF-16 and UTF-8 Byte Values: Binary Representation

Unicode Scalar ValueUTF-16 ValueUTF-8
1st Byte2nd Byte3rd Byte4th Byte
00 00 00 00 0x xx xx xx00 00 00 00 0x xx xx xx  0x xx xx xx  
00 00 0y yy yx xx xx xx00 00 0y yy yx xx xx xx11 0y yy yy10 xx xx xx  
zz zz yy yy yx xx xx xxzz zz yy yy yx xx xx xx11 10 zz zz10 yy yy yy10 xx xx xx  
u uu uu zz zz yy yy yx xx xx xx  110110wwwwzzzzyy +
110111yyyyxxxxxx  
11 11 0u uu*10 uu zz zz  10 yy yy yy  10 xx xx xx  

Note that xx in first row maps to xx in second through fifth rows, yy in first row maps to yy in second through fifth rows.
*uu uu u = wwww + 1 (to account for the addition of 1000016 for surrogates; see The Unicode Standard 3.0, Section 3.7, "Surrogates."
Table reproduced from The Unicode Standard 3.0


Table 2. Ranges of Unicode Scalar Values as Represented in UTF-16 and UTF-8 Byte Values: Decimal Values

Unicode Scalar ValueUTF-16 Value1st Byte2nd Byte3rd Byte4th Byte
0 - 1270 - 127  0 - 127  NoneNoneNone
128 - 2047128 - 2047192 - 223128 - 191NoneNone
2048 - 655352048 - 65535224 - 239128 - 191128 - 191 None
65536 - 2097151  55296 - 56319 &
56320 - 57343  
240 - 247  128 - 191 128 - 191 128 - 191  

Table 3. Ranges of Unicode Scalar Values as Represented in UTF-16 and UTF-8 Byte Values: Hexadecimal Values

Unicode Scalar ValueUTF-16 Value1st Byte2nd Byte3rd Byte4th Byte
0016 - 7F160016 - 7F16 0016 - 7F16 NoneNoneNone
8016 - 07 FF168016 - 07 FF16C016 - DF168016 - BF16NoneNone
08 0016 - FF FF1608 0016 - FF FF16E016 - EF168016 - BF168016 - BF16 None
01 00 0016 - 1F FF FF16  D8 0016 - DB FF16 &
DC 0016 - DFFF16  
F016 - F716 8016 - BF16 8016 - BF16 8016 - BF16 

 

UCS-2

UCS-2 is an encoding which represents only the Basic Multilingual Plane characters (including basic Greek, extended Greek, and combining diacriticals) using two bytes. For the characters which UCS-2 will represent, it is byte-for-byte equivalent to UTF-16; but UCS-2 cannot be used to represent the characters above the Basic Multilingual Plane.

In other words, UCS-2 is a subset of UTF-16.

The UCS-2 encoding is part of the ISO 10646-1 standard, which is an International Standards Organization encoding stardard which is exactly equivalent to Unicode 2.0; one can effectively use ISO 10646 and Unicode interchangeably, for they indicate two different standards bodies working in cooperation to create and maintain an identical standard.

Windows NT, Windows 2000, and Windows XP use UCS-2 as their native encoding standard for all text.

UTF-32

UCS-4

Unicode Entities in Other Encodings

The codes you will see in the source code (reproduced on an attached page) represent Greek polytonic glyphs (letters and letter/diacrical combinations). Each line of the source code represents one Greek word; each numeric "entity" represents a single Greek character with or without diacritical marks.

Non-Unicode Encodings

ISO-8859-1 (Latin)

Microsoft Reference

ISO-8859-7 (Greek)

Microsoft Reference

Windows Code Page 1252 (Latin)

Microsoft Reference

Windows Code Page 1253 (Greek)

Microsoft Reference

MacRoman (Latin)

No character map is available for MacRoman on the Microsoft, Apple, or Unicode websites; however, Alan Wood does provide a character map at his web site.

GreekKeys for MS-DOS/Windows (APA)

GreekKeys for Macintosh (APA)

Beta Code (TLG)

SGreek (Silver Mountain Software)

SPIonic

LaserGreek

Porson Greek

WinGreek and Son of WinGreek


 Unicode Polytonic Greek for the World Wide Web Version 0.9.7
 Copyright © 1998-2002 Patrick Rourke. All rights reserved.
D R A F T - Under Development
 Please do not treat this as a published work until it is finished!
▣ Home | ◈ Contents | △ Section | ◁ Previous | Next ▷