Приглашаем посетить
Естествознание (es.niv.ru)

Section 6.5.  Encode

Previous
Table of Contents
Next

6.5. Encode

Life would be so much easier if everything in the world was already Unicode. We'd have this one standard data interchange format, data processing would be trivial, world peace would be easily achievable, and Perl programmers could get back to finding cures for cancer and watching Buffy the Vampire SlayerDVDs.

Sadly, that hasn't happened yet, and we still have to deal with a wide variety of character encodings in existing data. Based on initial work by Nick Ing-Simmons, Dan Kogai has produced the Encode family of modules, which do an admirable job of converting data between various character encodings and Perl's own internal format. We'll examine these modules in a little more detail later in the chapter.

Suppose I have some text in Shift JIS, the standard Japanese encoding for Windows machines, and I want to manipulate it in Perl. I can read the bytes from a file into the scalar $text, but Perl sees opaque bytes, rather than a sequence of Unicode characters. So I need to start by changing it into Perl's internal format, using the Encode function decode:

    use Encode;
    my $intern = decode("shiftjis", $text);

The text in $intern is Unicode characters, which Perl can understand. Internally, Perl stores them as UTF-8, but unless you're dealing directly with Perl's internals, the representation of Unicode characters isn't important. Because ASCII is a strict subset of UTF-8, $intern is also a valid ASCII string if the input happened to contain only characters in the ASCII range (though, given that it was initially Japanese, this is unlikely here). Indeed, on a UTF-8 terminal, I can now print out $intern as Unicode text:

    binmode(STDOUT, ":utf8");
    print $intern;

Or, I can perform the same conversion on the command line using the -C command-line option to set STDOUT to UTF-8:

    % perl -C2 -MEncode -MFile::Slurp\
    -e 'print decode("shiftjis", read_file("japanese.sjis"));'

    
Section 6.5.  Encode

Section 6.5.  Encode

The -C command-line option sets UTF-8 handling on STDIN, STDOUT, STDERR, @ARGV, and the PerlIO layer. The PERL_UNICODE environment variable is equivalent and takes the same options as -C. These are available in Perl Versions 5.8.1 and higher. Read more in the perlrun documentation file.


There's also a corresponding function called encode for turning data from Perl's internal representation into another representation; we can use these two functions to make a cheap and cheerful character set convertor:

    #!/usr/bin/perl -n0 -MEncode
    BEGIN{($from, $to) = splice @ARGV,0,2};

    print encode($to, decode($from, $_));

This allows us to say, for instance:

    % transcode shiftjis euc-jp < japanese.sjis > japanese.euc

to convert a file between two of the more common Japanese encodings. (Transcoding is the jargon for converting from one encoding to another.)

Section 6.5.  Encode

The conversion direction of the two functions encode and decode isn't instantly memorable. It may help to remember that the Perl interpreter only understands UTF-8 and subsets of UTF-8 (ASCII, Latin 1), and so anything else needs decoding before the interpreter can understand it as text.


How do we know what encodings are available? Well, we can ask Encode to tell us:

    % perl -MEncode -le 'print for Encode->encodings(":all")'

    7bit-jis
    AdobeStandardEncoding
    AdobeSymbol
    AdobeZdingbat
    ascii
    ...

We use the :all parameter to include not just the standard set of encodings that Encode provides, but also those defined in any Encode::* modules that it's been able to find; for instance, many of the Japanese encodings are stored in Encode::JP.

There's also a handy shorthand for transcoding, called from_to. The only thing to note about this is that it converts the string in-place, modifying its input.

6.5.1. The PerlIO Trick

Perl 5.8.0 came with a very neat feature called PerlIO, which is a complete standard I/O library written exclusively for Perl. Normally, this would only excite really hard-core Perl maintainers (I must confess to being pretty baffled by most of it), but it provides a number of useful hooks to allow Perl modules to play about with any data going through the I/O system.

The upshot is that you can tell Perl to automatically encode and decode data as it's read from and written to a filehandle. If we want to transcode a file from Shift-JIS to EUC, we can just say:

    use Encode;
    open IN,  "<:encoding(shiftjis)", "data.jis" or die $!;
    open OUT, ">:encoding(euc-jp)",   "data.euc" or die $!;
    print OUT <IN>;

Anything read from IN will be decoded from Shift-JIS into Perl's internal format; similarly, anything written to OUT will be encoded as EUC.

6.5.2. The Gory Details

Section 6.5.  Encode

You should probably not read this section unless you're either working with XS code that handles Unicode data, or if you're doing extremely clever things with Unicode and you can't get Encode to do what you want.


There are two dirty secrets about Encode and handling Unicode data in Perl. The first dirty secret is that Perl knows very little about Unicode, but it knows a lot about UTF-8. That's to say, Perl primarily cares about whether or not a string is UTF-8 encoded, and it cares little about the string's actual character code; knowing that a string is encoded in UTF-8 does not tell you whether it's Unicode, Latin 1, or anything else. Perl does not keep track of the character code anywhere, but assumes, for the purposes of regular expression matching, that things that are marked as UTF-8 will be Unicode. Many of the problems that people have with Unicode come about by thinking that once they've got data in UTF-8, they can do Unicode things with it; that's not the case. Similarly, you can't assume anything about the character coding of a string that isn't UTF-8. It might be Latin 1, but it might be something else entirely.

Section 6.5.  Encode

UTF-8 is just a character encoding, and it implies nothing about character repertoires.


The other dirty secret is how Perl decides how to treat a string. There isn't a global setting as to whether we're in byte or character mode; the decision about what to do with a string is made on a string-by-string basis. Each Perl string has a flag inside it that determines whether it's in UTF-8 encoding or not. There's only one flag to determine both whether a string is internally stored as UTF-8, and whether a string is to be treated with Unicode semantics by the regular expression engine and functions such as lc. So, if a string is converted to UTF-8 internally, it will be treated as Unicode.[*]

[*] Arguably this is a bug, but it's one we have to live with until Perl 6.

This has historically led to some interesting conundrums with what to do when data of one type meets data of another. Take this piece of Perl code:

    my $acute = chr(193);
    print $acute;

    $identity = $acute . chr(194); chop $identity;
    print $identity;

    $itentity = $acute . chr(257); chop $identity;
    print $itentity;

Character 193 in Latin 1 is a capital A with acute accent (Á), so when I run this, I would expect to see ÁÁÁ. This works nicely on Perl 5.8.0, but on Perl 5.6.0, I see ÁÁÃ.

This is a leakage of what's going on inside Perl's Unicode support. When our non-UTF-8 string ($acute) meets the UTF-8 string chr(257), Perl has to recode the original character in UTF-8 before concatenating it. This is to avoid situations where the original string contains valid UTF-8 representations of a completely different character. It's similar to the situation where you have to escape text before putting it inside HTML, as symbols like <, >, and & have different meanings there.

So our Á is now encoded in UTF-8, and when Perl 5.6.0 comes to print it out, it prints the UTF-8 bytes. The first byte is character 195, Ã. Oops. Perl 5.8.0 corrects this by attempting to downgrade strings from UTF-8 to Latin 1 when they're output to filehandles not explicitly marked as UTF-8, but it gives you an idea of the shenanigans that are required to make the byte-character duality work.

What does this mean for troubleshooting Unicode problems? Well, the most common problems occur when a scalar's internal UTF-8 flag is incorrectly set and Perl treats the string with the wrong semantics. If the flag is wrongly turned off, then Perl treats what should be a Unicode string as a sequence of bytes. These bytes are the UTF-8 encoding of the Unicode characters, because Perl's internal representation of Unicode has been accidentally exposed. If the flag is wrongly turned on, then Perl provides Unicode semantics for that scalar and treats whatever sequence of bytes were in the scalar as UTF-8. Perl's internals will expect the bytes to be valid UTF-8, and will issue loud warnings if they are not. The easiest way to get this internal flag incorrect is by marking a filehandle as UTF-8 when it is not, or forgetting to mark it when it is.

For instance, writing this chapter, I had my sample file containing Section 6.5.  Encode, encoded in UTF-8, and I ran the following code in a UTF-8-aware terminal:

    open IN, "<:utf8", "foo.utf8" or die $!;
    $a = <IN>;
    print $a;

I was mildly surprised to get gibberish thrown back at mesince I know how the internals store Unicodeuntil I remembered what was going on: standard output was not marked as expecting UTF-8, so Perl automatically downgraded the string to Latin 1 on printing it. The downgraded string was not valid UTF-8, so my terminal went mad. The upshot is that this new-fangled Unicode-aware Perl code didn't work on a new-fangled Unicode-aware terminal, although it works just fine on an old-fashioned Latin 1 terminal.

The Encode module allows you to generate the UTF-8 encoding of any Perl string with encode("utf8", $string).

    use Encode;

    open IN, "<:utf8", "foo.utf8" or die $!;
    $a = <IN>;
    $b = encode("utf8", $a);
    print $b;

This made my UTF-8 terminal happy again, because Perl's output is a string of bytes that is valid UTF-8. Perls doesn't know (or care) that the characters $b contains happen to be UTF-8. They're just characters between 0 and 255, and as standard output is taking bytes (the default), it will output one byte per character. If we were to ask Perl for the lengths of the two strings, we'd see that $a had 8 characters and $b had 15. As internals gurus we know that they are probably stored in memory as the same sequence of bytes, but the interface Perl presents to the programmer is that strings are built from characters, and how those characters are stored should remain hidden.

If you have the opposite problemdata that you believe to be Unicode but which Perl is still storing as a sequence of UTF-8 bytesyou can convert a string to Unicode using decode("utf8", $string). These functions can be handy for ensuring that data coming into or going out of your routines will be in the form you expect.

So far we haven't worked out how to determine whether any given string uses byte or character semantics, because the Perl way is that you shouldn't have to care and Perl should transparently do the right thing. But since we're discussing how to deal with situations where Perl is not doing the right thing, let's look at how to deal with the UTF-8 flag directly.

Encode provides three internal-use functions that we can import on demand: is_utf8, _utf8_on, and _utf8_off.

Let's suppose we've just read some data from an I/O socket, using read. By default, Perl will assume that this data has byte semantics. The only thing that can determine whether the string is bytes or UTF-8 encoded characters is the specification for the protocol that we're readingare we expecting to see UTF-8 data? If we are, then we can take advantage of our knowledge that Perl stores its Unicode strings internally as UTF-8. We just need some way of telling Perl to treat the data that it just read as Unicode. _utf8_on comes to our rescue here:

    use Encode qw(_utf8_on);
    my ($length, $data);
    read(SOCKET, $length, 2);
    read(SOCKET, $data, $length);
    _utf8_on($data);

Now we can use $data with the correct semantics. There is another way to achieve the same effect without using Encode; whether it is considered more or less ugly is a matter of taste. It relies on the new U modifier to packpack("U", $number) is now equivalent to chr($number). The difference is that if U is the first template in the call to pack, it is guaranteed to return a UTF-8-on string:

    use Encode qw(is_utf8);

    $s1 = chr(70);
    print "String 1 is ", (is_utf8($s1) ? "" : "not "), "UTF-8 encoded\n";

    $s2 = pack("C", 70);
    print "String 2 is ", (is_utf8($s2) ? "" : "not "), "UTF-8 encoded\n";

    $s3 = pack("U", 70);
    print "String 3 is ", (is_utf8($s3) ? "" : "not "), "UTF-8 encoded\n";

This produces:

    String 1 is not UTF-8 encoded
    String 2 is not UTF-8 encoded
    String 3 is UTF-8 encoded

To force a string to be treated as containing Unicode characters, we create a pack format that begins with U, but packs zero characters. Internally, pack creates a string with the UTF-8 flag set. Then we fill the string up with ordinary characters using the C* patternthis special pattern tells pack to ignore whether the scalars are internally encoded as UTF-8 and to directly use the raw bytes stored, so it will fill up the string with whatever UTF-8 encoded bytes you throw in. You're directly manipulating the internal representation of scalars here, so you need to be sure of what you're doingpack won't check that the UTF-8 sequence it is building is valid. In this case, as long as we pass in valid UTF-8 byte sequences, all will be fine. The end result is to turn on Perl's internal UTF-8 flag without changing the raw bytes, which makes Perl treat those bytes as Unicode characters. The code to do this looks like this:

    $string = pack("U0C*", unpack("C*", $string));

Another useful feature is the bytes pragma, which lexically turns off any kind of UTF-8 processing and allows you to see any string as its byte representation, no matter what:

    open IN, "<:utf8", "foo.utf8" or die $!;
    $a = <IN>;
    chomp $a;

    print length $a; # 8

    {
      use bytes;
      print length $a; # 15
    }

This can be handy if we're dealing with data that has to be sent over a network connection, or packed into a fixed-length structure.

    Previous
    Table of Contents
    Next