Приглашаем посетить

Section 6.3. Unicode Transformation Formats

6.3. Unicode Transformation Formats

As we've mentioned, with hundreds of thousands of characters in a character repertoire now, it's no longer possible to fit one character into one byte. We've introduced the concept of UTF-16, the native character encoding for Unicode, but there are several other standard encodings. Those starting with UTF are defined by the Unicode Standard or associated Unicode Technical Reports; the two UCS encodings are defined by the ISO 10646 standard.

6.3.1. UCS-2

UCS-2 is the two-byte ISO 10646 encoding. Recall that ISO defines the UCS in terms of groups and planes, where planes consist of 256 rows and 256 columns. In UCS-2, the first byte encodes the row, and the second encodes the column. Hence, UCS-2 can only encode the 65,536 characters in the Basic Multilingual Plane; furthermore, ISO does not recognize the surrogate pair extension mechanism, so UCS-2 cannot be used to access any characters outside the BMP.

6.3.2. UTF-8

Formerly known as File System Safe UCS Transformation Format, UTF-8 is the Unicode encoding supported natively by Perl. It is an integral part of the Unicode Standard and is recognized by the ISO standard.

Unlike all the other UTFs, UTF-8 is a variable-width encoding; this is regarded as a compromise, as you may remember that one of the Unicode design goals was that encodings should be fixed width.

One redeeming feature of UTF-8, however, is that it is a superset of seven-bit ASCII. That is, data that is purely seven-bit ASCIIcontaining no bytes 128 or aboveis valid UTF-8. Additionally, UTF-8 encodes codepoints 128 and above using only bytes 128 and above, so that the bytes 0 to 127 in a UTF-8 encoded string only ever correspond to the codepoints 0 to 127 (the ASCII characters). This means that any application that gives special meaning to some ASCII characters but is unaware of UTF-8 cannot be confused or tricked, such as a filesystem that allows bytes 0 to 255 in filenames and treats "/" as a directory separator.

UTF-8's encoding algorithm is slightly complex, because the algorithm used depends on the codepoint. For codepoints up to 128 (U+007F), the character is encoded as in ASCII: one byte per codepoint. From U+0080 up to U+07FF, the codepoint is converted to its bit pattern, and this bit pattern is split over two bytes. For instance, U+0169 LATIN CAPITAL LETTER U WITH TILDE has the bit sequence 0000000101101001. The six least significant bits and the next five significant bits are 101001 and 00101. We prefix the five with 110 to make 11000101, and the six with 10 to make 10101001. Hence, our character in UTF-8 is encoded as 11000101 10101001; that is, character 197 and character 169. The following Perl code demonstrates this technique and extends it to characters requiring three or four bytes to encode:

    $d = "";
    if ($uv < 0x800) {
        $d .= chr(( $uv >> 6)   | 0xc0);
        $d .= chr(( $uv & 0x3f) | 0x80);
        return $d;
    }
    if ($uv < 0x10000) {
        $d .= chr(( $uv >> 12)         | 0xe0);
        $d .= chr((($uv >>  6) & 0x3f) | 0x80);
        $d .= chr(( $uv        & 0x3f) | 0x80);
        return $d;
    }
    if ($uv < 0x200000) {
        $d .= chr(( $uv >> 18)         | 0xf0);
        $d .= chr((($uv >> 12) & 0x3f) | 0x80);
        $d .= chr((($uv >>  6) & 0x3f) | 0x80);

        $d .= chr(( $uv        & 0x3f) | 0x80);
        return $d;
    }

6.3.3. UTF-16BE

Unicode's own native encoding is the two-byte UTF-16. This is available in big-endian and little-endian flavors; data sent over a network is expected to be in network order (big-endian).

UTF-16 is very similar to UCS-2; in fact, any UCS-2 encoded data is valid UTF-16BE. However, UTF-16 is extended to characters beyond the BMP by the use of surrogate pairs.

The surrogate pair mechanism uses two characters, one from the High Surrogate Zone, which ranges from U+D800 to U+DBFF, and one from the Low Surrogate Zone, which stretches from U+DC00 up to U+DFFF. The codepoint of a pair of characters so used is calculated as (HIGH - 0xD800) * 0x400 + (LOW - 0xDC00) * 0x10000. With 1024 high and 1024 low surrogates, the surrogate pair mechanism extends UTF-16 with another 1,048,576 characters.

6.3.4. UTF-16LE

UTF-16LE is the little-endian version of UTF-16BE.

6.3.5. UCS-4

UCS-4 is the four-byte ISO 10646 encoding. Whereas UCS-2 encoded just the row and column coordinates of a character cell in the BMP, UCS-4 encodes the four-dimensional coordinates of group, plane, row, and column; within the BMP, the first two bytes of each character will be zero. The advantage of UCS-4 is that it can encode every single codepoint in the UCS; the disadvantage is that every single character requires four bytes.

6.3.6. UTF-32

UTF-32 is, roughly speaking, the Unicode equivalent of ISO's UCS-4. The only difference is that UTF-32 is restricted to encoding the same range of codepoints as UTF-16; it even uses the surrogate pairs mechanism to extend its range beyond those codepoints. It can therefore be thought of as a wide form of UTF-16, and, like UTF-16, comes in big-endian and little-endian flavors. Nobody seems to know what it's for.

6.3.7. UTF-EBCDIC

UTF-EBCDIC is the encoding method designed for EBCDIC systems; it is specified in Unicode Technical Report 15. It's not intended as an open interchange format and should only be used internally to EBCDIC machines. Unless you're exchanging data with or using an EBCDIC system, you're unlikely to need this.

6.3.8. UTF-7

UTF-7 is another of those UTFs you'll probably never need. It's used in environments where eight-bit cleanliness is not guaranteed, such as passing mail through a VAX. Ergh. But fear not, Perl versions 5.8.1 and higher know how to translate it.

Table of Contents