Приглашаем посетить

Section 6.6. Unicode for XS Authors

6.6. Unicode for XS Authors

If you write XS routines, Unicode means a whole new set of rules for processing strings. Standard C tricks for iterating over the characters in a string no longer work in the Unicode world. Instead, Perl provides a series of functions and macros that make handling Unicode strings a little easier.

6.6.1. Traversing Strings

The first problem everyone comes across is that they have a large amount of legacy code that assumes that everything is in some seven- or eight-bit character encoding, and they can write:

    while (*s++) {
       /* Do something with *s here */
    }

Along comes a string that has its data encoded in UTF-8, and it all goes horribly wrong. What can we do about this situation?

First, we should take note that this situation means we can no longer pass raw char* strings around; we need to know whether or not such a C string is encoded in UTF-8. The most obvious way to do that is to pass around SVs instead of char*s, but where this isn't possible, you either need to use an explicit interface convention between the functions of your XS code, or pass around a boolean denoting the UTF-8 encoding of the string.

Once we have a string and know whether it's supposed to be encoded in UTF-8, we can use some of Perl's Unicode handling functions to help us walk along it. The most obviously useful one is utf8_to_uvchr, which pulls a code point out of a string:

    STRLEN len;
    while (*s) {

         UV c = utf8_to_uvchr(s, &len);
         printf("Saw a character with codepoint %d, length %d\n", c, len);
         s += len;
    }

Perl deals with Unicode codepoints as UVs, unsigned integer values. This actually gives Perl support for UTF-8 characters beyond the range that the Unicode Standard provides, but that's OK. Maybe they'll catch up with us one day.

If you want to avoid extra work in the case of invariant charactersthose that look just the same in UTF-8 and in byte encodingsyou can use the UTF8_IS_INVARIANT( ) macro to test for this:

    while (*s) {
         if (UTF8_IS_INVARIANT(*s)) {
            /* Use *s just like in the good old ASCII days */
            s++;
         } else {
            STRLEN len;
            UV c = utf8_to_uvchr(s, &len);
            /* Do the Unicode thing. */
            s += len;
         }
    }

If you're not interested in looking at the Unicode characters, you can just skip over them, but you have to do this in a sensible way. If you just skip the first byte in the character, you can end up horribly misaligned and seeing characters that aren't there. Instead, use the UTF8SKIP( ) macro to fetch the length of the character, and use that to skip over it:

    while (*s) {
         if (UTF8_IS_INVARIANT(*s)) {
            /* Use *s just like in the good old ASCII days */
            s++;
         } else {
            /* Don't care about these scary high characters */
            s += UTF8SKIP(*s);
         }
    }

6.6.2. Encoding Strings

As well as getting data out of strings, we might occasionally find ourselves wanting to put Unicode characters into a string. We can do this in a number of ways. First, we can enter characters one codepoint at a time, much in the same way as we traversed strings one character at a time. When getting Unicode codepoints out of strings, we used utf8_to_uvchr, so it should be no surprise that to put Unicode codepoints into strings, we can use uvchr_to_utf8. As UTF-8 is a variable-length encoding, we cannot infer the number of bytes needed to store our string from the number of characters, so allocating the correct amount of memory is tricky. The easiest thing to do is loop twice, once to work out the number of bytes needed, and once to act.

    /* Convert an array of numbers into a Unicode string */
    I32 len, i;
    STRLEN strlen = 0;
    SV* sv;
    char* s;

    len = av_len(av) + 1;

    for (i = 0; i < len; i++) {
        SV** sav = av_fetch(av, i, 0);
        if (! sav) continue;
        strlen += UNISKIP(SvUV(*sav));
    }

    /* Allocate space for the string */
    sv = newSV(strlen);
    s = SvPVX(sv);

    for (i = 0; i < len; i++) {
        SV** sav = av_fetch(av, i, 0);
        if (! sav) continue;
        s = uvchr_to_utf8(s, SvUV(*sav));
    }

    /* Perl internally expects a NUL byte after every buffer, so write one */

    s = '\0';

    /* Tell Perl how long our scalar is, that it has a valid string
    buffer, and that the buffer holds UTF-8 */

    SvCUR_set(sv, strlen);
    SvPOK_on(sv);
    SvUTF8_on(sv);

As can be seen from this example, uvchr_to_utf8 returns the advanced pointer after the new character has been added. This is the recommended UTF-8-aware way of adding a character to a buffer, unlike *s++ = c;, which assumes all characters are the same size. The UNISKIP function returns the number of bytes required to UTF-8-encode a Unicode codepoint.

If we have a string that is Unicode but stored as bytes instead of UTF-8, you can use the sv_utf8_upgrade function, which converts an existing SV to UTF-8. Conversely, if you have a string that is valid UTF-8 but Perl doesn't know that fact yet, you can use the SvUTF_on(sv) macro to turn on the UTF-8 flag:

    sv_gets(sv, fp, 0);
    /* But we expect that to be Unicode */
    SvUTF8_on(sv);

Of course, the problem here is that we haven't checked that the data really is valid UTF-8 before telling Perl that it is. We can do this with is_utf8_string to avoid problems later:

    STRLEN len;
    char *s;

    sv_gets(sv, fp, 0);
    s = SvPV(sv, len);
    if (is_utf8_string(s, len)) {
       SvUTF8_on(sv);
    } else {
       /* Not really UTF-8--what is going on? */
    }

Transcoding with XS is quite tricky, and you would be best doing that stage in Perl. There are plans to allow easy transcoding from C in the future, but for the moment, the only available option is to do something like this to get an Encode::XS object:

    ENTER;
    SAVETMPS;

    PUSHMARK(sp);
    XPUSHp("euc-jp", 6);
    PUTBACK;
    call_pv("Encode::find_encoding", G_SCALAR);
    SPAGAIN;
    encoding_obj = POPs;
    PUTBACK;

And then use this object to perform decoding and encoding:

    PUSHMARK(sp);
    XPUSHs(encoding_obj);
    XPUSHs(euc_data);
    XPUSHi(0);
    PUTBACK;
    if (call_method("decode", G_SCALAR) != 1) {
        Perl_die(aTHX_ "panic: decode did not return a value");
    }
    SPAGAIN;
    uni = POPs;
    PUTBACK;

It isn't pretty, but it works. The code in ext/PerlIO/encoding/encoding.xs in the Perl source tree is probably the only example of this around at the moment.

Table of Contents