Приглашаем посетить

Section 6.4. Handling UTF-8 Data

6.4. Handling UTF-8 Data

So much for theory. Let's now look at what Unicode means for Perl.

Let's suppose we've got some text encoded in UTF-8, and we want to mess about with it in Perl. You'd think we could just open the file and it would all magically work, but fortunately, Perl's not that clever. I say fortunately because we don't actually want Perl to automatically treat data as UTF-8; imagine, for instance, handling a piece of binary data, such as a JPEG image, and Perl obliviously tries to treat it as UTF-8.

Instead, Perl has two distinct processing modes for databyte mode and character mode. The default is byte mode, and this works equally well for binary data and text encoded in a system that requires one byte per character, such as ordinary Latin 1. Character mode, on the other hand, treats the data as UTF-8. What does this mean in practice? Well, let's suppose we have the following text file, encoded in UTF-8:

The UTF-8 representation of that string is:

    C3 9C C3 B1 C3 AE C3 A7 C3 B6 C3 B0 C3 A8 0A

(0x0A is the newline at the end.)

So we can see that the file itself is 15 bytes long. And if we don't inform Perl, we get 15 bytes:

    % perl -e 'open IN, "foo.utf8"; $a = <IN>; print length ($a)'
    15

But we also know that, although there are 15 bytes in the file, there are only 8 UTF-8 characters. So we tell Perl to open the file as UTF-8, and now:

    % perl -e 'open IN, "<:utf8", "foo.utf8"; $a = <IN>; print length ($a)'
    8

Once you have your UTF-8 data correctly treated as UTF-8, everything works as you would expect; Perl converts the UTF-8 data to its internal Unicode format^[*] on input, and you can use length (as demonstrated), substr, index, and all other built-in Perl functions on character data, and they'll use character positions instead of byte positions.

^[*] Perl's internal Unicode format happens to be UTF-8, but you don't need to know these implementation details to be able to use Unicode in Perl unless you write XS code. Use a recent 5.8.x release and simply treat the internals as a black box.

If you get your input and output correct, most of the rest of your problems go away. Convert your input to Unicode right away, as it enters the program. Convert your output to the desired encoding at the last possible point, just as it leaves your program. This ensures all the data inside your program is Unicode and doesn't need to be converted. If you add conversions somewhere inside your dataflow, you run the risk of performing the conversions more than once, or wrongly concatenating data in two different encodings. The most common symptom of this kind of problem is outputting double encodingthat is, the UTF-8 encoding of the UTF-8 encoding of a Unicode string. This is similar to entity-encoding text in a web page that's already entity-encoded, so a literal > would give you &lt; in the HTML source instead of the correct <. Convert at the boundaries and let Perl keep track of things internallyit's what it's good at.

6.4.1. Entering Unicode Characters

We've looked at how to read in UTF-8 data from external sources (filehandles); how about generating Unicode from inside our program? There are three main ways to do this.

The first way is perhaps the most obvious: functions like chr are automatically extended to produce Unicode strings when they need to. In fact, for lack of a decent Unicode editor, I generated some of my test files for this chapter using code like this:

    binmode(STDOUT, ":utf8");
    print chr $_ for
    (0x30b8, 0x30a7, 0x30c3, 0x30cb, 0x306f, 0x5927, 0x597d, 0x304d, 0xff01);

In the same way that we told Perl to treat data from a particular input filehandle as UTF-8, we also need to tell it that a particular output filehandle expects UTF-8 data. The call to binmode in the previous example sets this UTF-8 handling on a filehandle that's already open.

The second way of entering Unicode data is as string literals. In this case the \x notation is extended beyond \xFF by means of curly braces:

    binmode(STDOUT, ":utf8");
    print "\x{30B8}\x{30A7}...";

And third, if your Unicode characters happen to have names defined in the Unicode Standard, you can use the \N literal notation in conjunction with the charnames pragma.

    use charnames ":full";
    binmode(STDOUT, ":utf8");

    print "I \N{HEAVY BLACK HEART} Unicode\n";

Writing the full names can be tedious sometimes, particularly when you're entering characters from particular alphabets. Instead, charnames provides a shorter form to access characters from particular Unicode blocks:

    use charnames ":short";
    binmode(STDOUT, ":utf8");

    print "\N{hebrew:alef} \N{greek:omega}\n";

This only works where the Unicode name is of the form SCRIPT LETTER NAME or SCRIPT CAPITAL/SMALL LETTER NAME. Capitals can be obtained, intuitively, by starting the character name with a capital:

    use charnames ":short";
    binmode(STDOUT, ":utf8");

    print "\N{greek:Sigma}\N{greek:iota}\N{greek:mu}\N{greek:omicron}\N{greek:nu}";

But as you can see, this also gets tedious if you're working in the same alphabet. The charnames pragma allows you to import particular alphabets, like so:

    use charnames qw(greek hebrew);
    binmode(STDOUT, ":utf8");

    print "\N{Sigma}\N{iota}\N{mu}\N{omicron}\N{nu}\n";
    print "\N{alef}\N{bet}\N{gimel}\n";

On a Unicode terminal, this may output:

Notice that although Perl outputs the Hebrew characters in alphabetical order, the terminal is responsible for handling the right-to-left aspects of the Hebrew output.

Of course, perhaps the most intuitive way of all for getting Unicode characters into Perl literals is simply to dump them into the middle of a string. You can do this, so long as you use the utf8 pragma. Perl allows you to use Unicode characters for string literals, comments, and, if you feel so inclined, the names of Perl identifiers.

6.4.2. Unicode Regular Expressions

Perl's regular expression engine supports what it calls polymorphic regular expressions; when matching against Unicode data, regular expression operators have character semantics, and when matching against non-Unicode data, the same regular expressions have byte semantics. No change to your code is needed to make regular expressions do the right thing in each context.

What does it mean for a regular expression to have character semantics? The first and most obvious thing is that operators such as . don't just match a single byte, but match an entire Unicode character:

    use charnames qw(katakana);
    binmode(STDOUT, ":utf8");

    $x = "\N{sa}\N{i}";

    $x =~ /(.)$/;
    print $1;

This prints , the last character in the string, instead of the last byte in the UTF-8 representation, which is \xA4. So far so goodPerl does what you mean.

If this isn't what you mean, and you do want to slice up a string into its component bytes, you have two ways of doing so; the first is the lexically scoped use bytes pragma, which pretends we are in 5.005 land, where Unicode does not even exist:

    use charnames qw(katakana);
    binmode(STDOUT, ":utf8");

    $x = "\N{sa}\N{i}";

    {
      use bytes;
      $x =~ /(.)$/;
      printf "\\x%X\n", $1;
    }

This one does, indeed, print out \xA4. Your other alternative is to use the new \C match operator, which matches an individual byte.^[*] Both of these methods require some caution, as they make it easy to generate ill-formed UTF-8.

^[*] \C was named, perhaps unwisely, after C's char data typechar, of course, is a byte, not a character.

Other properties of Unicode regular expressions are much friendlier. For instance, character classes such as \d and \w take their definitions from the Unicode Standard; we now know about more numbers than just our Arabic digits:

    use charnames qw(:full);
    binmode(STDOUT, ":utf8");

    $x = "Some numbers: \N{DEVANAGARI DIGIT TWO}\N{DEVANAGARI DIGIT SIX}";

    print "Found a number: $1" if $x =~ /(\d+)/;

    Found a number:

Sadly, there's no easy way to get at the digit value yet, and non-Arabic numbers do not convert between strings and numbers in the usual Perl way. You cannot (yet) say "\N{DEVANAGARI DIGIT TWO}" + 3 to get 5.

The Unicode Standard also declares that particular characters have particular properties, and regular expressions can match against these properties using the \p{...} notation. For instance, $, , and all have the Unicode CurrencySymbol property so /\p{CurrencySymbol}/ matches against them. A negated version, \P{...}, matches all characters that don't have the named property.

Finally, Perl provides the \X shortcut for matching complete decomposed characters. Many characters in the Unicode character repertoire can combine with a myriad variety of accents, marks, voicings, and other decorations; naturally, it's not practical for the character repertoire to include all possible combinations of marks on each character. Instead, base characters can be followed by combining characters that should all be treated as a single unit. For example, as Table 6-1 shows, the polytonic Greek character can be broken down (decomposed) into three constituent parts.

Table 6-1. Decomposing Unicode characters
u
+
̈
+
´
GREEK SMALL LETTER UPSILON

COMBINING DIAERESIS

COMBINING ACUTE ACCENT

\X allows you to match these single decomposed units:

    % perl -le 'my $x = chr(0x03c5).chr(0x0308).chr (0x0301); $x=~/(\X)/ and print length $1'

    3

In general, you should use \X rather than . to pick out individual, meaningful characters; for example, I was recently asked to write some code that displayed names vertically by putting a newline in between each character.^[*] Doing this with print "$_\n" for $name =~/(.)/g led to the occasional surprise with decomposed data:

^[*] Thankfully, I was guaranteed that the names would not be in any of the right-to-left or top-to-bottom scripts, which would have led to a whole world of pain.

The right answer, of course, was to use /(\X)/g:

Section 6.4. Handling UTF-8 Data
The most important thing for most people to know about handling Unicode data in Perl, however, is that if you don't ever use any Unicode dataif none of your files are marked as UTF-8 and you don't use UTF-8 localesthen you can happily pretend that you're back in Perl 5.005_03 land; the Unicode features will in no way interfere with your code unless you're explicitly using them. Sometimes the twin goals of embracing Unicode but not disturbing old-style byte-oriented scripts has led to compromise and confusion, but it's the Perl way to silently do the right thing, which is what Perl ends up doing.

Table of Contents