It's a good idea to take a little time out, before we think about what Unicode is and what problem it solves, to clarify in our minds a few terms that have been widely used and abused in the programming world. In particular, the term character set is more troublesome than it might appear.
We often talk about the ASCII character set, but this relates to many different ideasit could mean the actual suite of characters involved, or the order in which they are placed in that suite, or the way that a piece of text is represented in bytes. In fact, when people talk about text from an ASCII system, it may not even be ASCII. The potential for confusion comes because ASCII is a seven-bit character set, whereas for the past 25 years or so, computers have had eight-bit bytes. ASCII only defines the meaning of the first 128 entries in the set, so what should be done with the other 128? Rather than leave them unused and wasted, nearly every ASCII system chooses to define them in some way, usually with accented characters and extra symbols. Many manufacturers chose to make their machines use one of the range of national sets as defined by ISO standard 8859. Of these sets, ISO-8859-1--generally called "Latin 1"--was the most popular because it provides all the accented letters needed by most Western European languages. It is also the default encoding assumed by protocols such as HTTP. So prior to Unicode, many computers supposedly using ASCII actually produced text using all 8 bits and assumed that any machine that they exchanged data with also happened to associate the same meaning for the 128 non-ASCII characters. You can see the potential for mistakes here, and that's just with the data. There's also ambiguity about what the term character set means, so really we want avoid it altogether and replace it with some more precise terms:
A character is somewhat easier to define; it is the abstract description of a symbol, devoid of any formatting expectations. There are any number of ways that one might format the character that Unicode calls LATIN SMALL LETTER A: a, a, a, a, and so on. However, they all represent the same character. This is distinct from a glyph.
A glyph is the physical, visual representation of a character. A glyph concerns itself with shape, typeface, point size, boldness, slant, and so on; a character does not. "a" and "a" are the same character, but different glyphs.
Unicode does not concern itself with glyphs in any way; it does not determine how its characters should look, just what they are. On the other hand, character repertoires such as the Japanese standards JIS do specify not just the collection of characters used, but also their appearance.
- character repertoire
A character repertoire is a collection of characters. Latin 1 has a character repertoire of 256 characters. The character repertoire itself does not specify the order in which the characters appear, nor does it map characters to codepoints. (See below.)
- character code
The order and the mapping is specified by the character code. This is what tells us that, for instance, the Unicode character LATIN SMALL LETTER B comes directly after LATIN SMALL LETTER A.
A character's codepoint is the number relating to the position of a character in a given character code. The Perl function to get a character's codepoint is ord.
- character encoding
When dealing with a 256-character repertoire such as Latin 1, it is easy to see how the codepoints should be represented to a computereach codepoint is simply encoded as one byte. When we get to 65,536 characters and above, on the other hand, we need to specify rather precisely how we're going to represent each character as a sequence of bytes. This is the character encoding of our data.
Unicode typically uses a set of well-specified character encodings it calls Unicode Transformation Formats or UTFs. We'll look at the most commonly used UTFs later on in the chapter.