Section 6.2. What Is Unicode?

In the bad old days of data handling, if you wanted to work with text in a different language, you'd probably have to deal with a different character set. This could mean a different character repertoire, or a different character encoding, or both. Applications that needed to process Japanese data had to deal with at least two major character sets in any of three encodingsmore if you needed to deal with Latin 1, as well. Each encoding would need special-case code to handle it, and programming was not fun.

Unicodeor, more formally, the Unicode Standardis an attempt to put that right. The core of the Unicode Standard defines a universal character repertoire; it then also defines standard encodings for that repertoire. The Unicode Standard is augmented by a series of Unicode Technical Reports (UTRs), which provide additional information: more encodings, additions, and corrections to the standard; algorithms for collation; and so on.

The Unicode effort started in the late '80sthe term Unicode was first used in 1987by programmers working at Xerox and Apple. The first edition of the Unicode Standard was released in 1990.

Unicode is based on four primary design principles (quoted from Tony Graham's book Unicode: A Primer):

Universal. The character repertoire should be large enough to encompass all characters likely to be used in general text interchange.
Efficient. Plain text, composed of a sequence of fixed-width characters, is simple to parse, and software does not need to maintain state, look for special escape sequences, or search forward or backward through text to identify characters.
Uniform. A fixed-length character code allows efficient sorting, searching, display and editing of text.
Unambiguous. Any given 16-bit value always represents the same character.

These goals were obtained by a combination of an extensive character repertoire and a fixed-width native coding scheme, UTF-16.

6.2.1. What Is UCS?

At the same time as the Unicode teams at Apple and Xerox were putting together a universal character set, the ISO standards organization was developing an international character set standard, ISO 10646. Realizing the futility of having two standard, universal character sets, the Unicode team and the ISO working group (ISO/IEC JTC1/SC2/WG2) agreed in 1991 to join forces. This has ensured that the industry standard, Unicode, and the international standard, ISO 10646, have remainedto all intents and purposesidentical.

However, since we have two cooperating standards, we have two sets of terminology to deal withunfortunately, ISO standards tend to use different terms from industry standards. Hence, the Unicode character repertoire, as defined by the Unicode Standard, is known as the Universal Character Set, or UCS in ISO legalese. UCS is also slightly different: while it is character-for-character identical with the Unicode character repertoire, it allows for much more expansion.

As far as the Unicode Standard is concerned, the character repertoire consists of a maximum of 65,536 characters. This was initially thought to be far more than required for all the world's languages. By the time the second edition of the Unicode Standard was published, there were still 18,000 unassigned codepoints; by Unicode 3.0, there were 8,000 code points to go. This is obviously not enough, especially with the thousands of rare Chinese and Japanese characters that have been submitted for inclusion. The Unicode way of coping with this is to extend to two characters by means of the surrogate pair extension mechanism. In ISO 10646, however, the 65,536 characters form something called the Basic Multilingual Plane (BMP) and the UCS is made up of multiple planes.

The UCS is conceptually a series of cubes, or groups. There are 256x256 cells in a plane, and 256 planes in a group. There are 128 groups in total, allowing UCS to encode a massive 256x256x256x128 = 2,147,483,648 characters. These will never all be assigned, of course; Unicode's native encoding format UTF-16, with its surrogate pair mechanism, can only encode 16 planes (1,048,576 characters).

The ISO standard also defines two encoding mechanisms for UCS: UCS-2 and UCS-4. UCS-2 is conceptually identical to UTF-16. We will examine both encodings in the section on UTFs later in this chapter.

6.2.2. What is the Unicode Consortium?

After the ISO and Unicode efforts merged, a consortium of interested parties was set up to manage and develop the Unicode portion of the combined standard. The Unicode Consortium was founded in 1990 and incorporated as Unicode, Inc., in 1991.

The technical work of the consortium is carried out by the Unicode Technical Committee (UTC), which publishes the Unicode Standard and also issues Unicode Technical Reports.

The consortium also maintains many mailing lists, FAQs, and other resources available from http://www.unicode.org/, the Unicode Consortium web site.

Membership in the consortium is open to anyone, and there are a variety of membership levels. Perl is a member of the consortium, represented at associate member level, the first programming language to be independently represented to the consortium.

6.2.3. Why Should I Care?

The most important thing that this chapter can teach you about Unicode is that you should find out more about it and start being aware of it in your own programs. Unicode is coming.

If you're already working with data in various languages, you'll know the hell you need to go through to get everything working. Unicode makes it a lot easier.

If you're not already working with different languages, you will. Unicode can help you internationalize and localize your programs; Unicode awareness and support can make multinationalization a great deal more straightforwardonce your program is Unicode-aware, common tasks such as sorting, searching, and regular expression matching just work in any language.

And if you don't think you will ever work with different languages, you still need to know about Unicode. Will you be receiving data from external sources? There's a growing possibility this data will be in Unicode, and you're going to need to know how to handle it.

If you're a Perl module author, there's absolutely no excuse; you have no idea how people will use your module or what data they might throw at it. If it can't cope with that data, it's broken, and people will blame you.

Finally, even if you're sure you'll never ever touch data that's not in good ol' ASCII, it does you good to know about Unicode anyway, since it is the way the world's going. Unicode support is very easy to achieve, especially in Perl, and it makes you a better programmer. The Perl value of laziness is important, but good laziness means you'll take the time to make your programs Unicode-aware first, so you won't need to make any changes when the time comes to support non-ASCII data.