Приглашаем посетить

Section 1.1. Introspection

1.1. Introspection

First, though, introspection. These introspection techniques appear time and time again in advanced modules throughout the book. As such, they can be regarded as the most fundamental of the advanced techniqueseverything else will build on these ideas.

1.1.1. Preparatory Work: Fun with Globs

Globs are one of the most misunderstood parts of the Perl language, but at the same time, one of the most fundamental. This is a shame, because a glob is a relatively simple concept.

When you access any global variable in Perlthat is, any variable that has not been declared with mythe perl interpreter looks up the variable name in the symbol table. For now, we'll consider the symbol table to be a mapping between a variable's name and some storage for its value, as in Figure 1-1.

Note that we say that the symbol table maps to storage for the value. Introductory programming texts should tell you that a variable is essentially a box in which you can get and set a value. Once we've looked up $a, we know where the box is, and we can get and set the values directly. In Perl terms, the symbol table maps to a reference to $a.

Figure 1-1. Consulting the symbol table, take 1

You may have noticed that a symbol table is something that maps names to storage, which sounds a lot like a Perl hash. In fact, you'd be ahead of the game, since the Perl symbol table is indeed implemented using an ordinary Perl hash. You may also have noticed, however, that there are several things called a in Perl, including $a, @a, %a, &a, the filehandle a, and the directory handle a.

This is where the glob comes in. The symbol table maps a name like a to a glob, which is a structure holding references to all the variables called a, as in Figure 1-2.

Figure 1-2. Consulting the symbol table, take 2

As you can see, variable look-up is done in two stages: first, finding the appropriate glob in the symbol table; second, finding the appropriate part of the glob. This gives us a reference, and assigning it to a variable or getting its value is done through this reference.

1.1.1.1 Aliasing

This disconnect between the name look-up and the reference look-up enables us to alias two names together. First, we get hold of their globs using the *name syntax, and then simply assign one glob to another, as in Figure 1-3.

Figure 1-3. Aliasing via glob assignment

We've assigned b's symbol table entry to point to a's glob. Now any time we look up a variable like %b, the first stage look-up takes us from the symbol table to a's glob, and returns us a reference to %a.

The most common application of this general idea is in the Exporter module. If I have a module like so:

    package Some::Module;
    use base 'Exporter';
    our @EXPORT = qw( useful );

    sub useful { 42 }

then Exporter is responsible for getting the useful subroutine from the Some::Module package to the caller's package. We could mock our own exporter using glob assignments, like this:

    package Some::Module;
    sub useful { 42 }

    sub import {
    no strict 'refs';
    *{caller(  )."::useful"} = *useful;
    }

Remember that import is called when a module is used. We get the name of the calling package using caller and construct the name of the glob we're going to replacefor instance, main::useful. We use a symbolic reference to turn the glob's name, which is a string, into the glob itself. This is just the same as the symbolic reference in this familiar but unpleasant piece of code:

    $answer = 42;
    $variable = "answer";

    print ${$variable};

If we were using the recommended strict pragma, our program would die immediatelyand with good reason, since symbolic references should only be used by people who know what they're doing. We use no strict 'refs'; to tell Perl that we're planning on doing good magic with symbolic references.

Section 1.1. Introspection
Many advanced uses of Perl need to do some of the things that strict prevents the uninitiated from doing. As an initiated Perl user, you will occasionally have to turn strictures off. This isn't something to take lightly, but don't be afraid of it; strict is a useful servant, but a bad master, and should be treated as such.

Now that we have the *main::useful glob, we can assign it to point to the *useful glob in the current Some::Module package. Now all references to useful( ) in the main package will resolve to &Some::Module::useful.

That is a good first approximation of an exporter, but we need to know more.

1.1.1.2 Accessing parts of a glob

With our naive import routine above, we aliased main::useful by assigning one glob to another. However, this has some unfortunate side effects:

    use Some::Module;
    our $useful = "Some handy string";

    print $Some::Module::useful;

Since we've aliased two entire globs together, any changes to any of the variables in the useful glob will be reflected in the other package. If Some::Module has a more substantial routine that uses its own $useful, then all hell will break loose.

All we want to do is to put a subroutine into the &useful element of the *main::useful glob. If we were exporting a scalar or an array, we could assign a copy of its value to the glob by saying:

    ${caller(  )."::useful"} = $useful;
    @{caller(  )."::useful"} = @useful;

However, if we try to say:

    &{caller(  )."::useful"} = &useful;

then everything goes wrong. The &useful on the right calls the useful subroutine and returns the value 42, and the rest of the line wants to call a currently non-existant subroutine and assign its return value the number 42. This isn't going to work.

Thankfully, Perl provides us with a way around this. We don't have to assign the entire glob at once. We just assign a reference to the glob, and Perl works out what type of reference it is and stores it in the appropriate part, as in Figure 1-4.

Figure 1-4. Assigning to a glob's array part

Notice that this is not the same as @a=@b; it is real aliasing. Any changes to @b will be seen in @a, and vice versa:

    @b = (1,2,3,4);
    *a = \@b;

    push @b, 5;
    print @a; # 12345

    # However:
    $a = "Bye"
    $b = "Hello there!";
    print $a; # Bye

Although the @a array is aliased by having its reference connected to the reference used to locate the @b array, the rest of the *a glob is untouched; changes in $b do not affect $a.

You can write to all parts of a glob, just by providing the appropriate references:

    *a = \"Hello";
    *a = [ 1, 2, 3 ];
    *a = { red => "rouge", blue => "bleu" };

    print $a;        # Hello
    print $a[1];     # 2
    print $a{"red"}; # rouge

The three assignments may look like they are replacing each other, but each writes to a different part of the glob depending on the appropriate reference type. If the assigned value is a reference to a constant, then the variable's value is unchangeable.

    *a = \1234;
    $a = 10; # Modification of a read-only value attempted

Now we come to a solution to our exporter problem; we want to alias &main::useful and &Some::Module::useful, but no other parts of the useful glob. We do this by assigning a reference to &Some::Module::useful to *main::useful:

    sub useful { 42 }
    sub import {
    no strict 'refs';
    *{caller(  )."::useful"} = \&useful;
    }

This is similar to how the Exporter module works; the heart of Exporter is this segment of code in Exporter::Heavy::heavy_export:

    foreach $sym (@imports) {
        # shortcut for the common case of no type character
        (*{"${callpkg}::$sym"} = \&{"${pkg}::$sym"}, next)
            unless $sym =~ s/^(\W)//;

        $type = $1;
        *{"${callpkg}::$sym"} =
            $type eq '&' ? \&{"${pkg}::$sym"} :
            $type eq '$' ? \${"${pkg}::$sym"} :
            $type eq '@' ? \@{"${pkg}::$sym"} :
            $type eq '%' ? \%{"${pkg}::$sym"} :
            $type eq '*' ?  *{"${pkg}::$sym"} :
            do { require Carp; Carp::croak("Can't export symbol:$type$sym") };
        }

This has a list of imports, which have either come from the use Some::Module '...'; declaration or from Some::Module's default @EXPORT list. These imports may have type sigils in front of them, or they may not; if they do not, such as when you say use Carp 'croak';, then they refer to subroutines.

In our original case, we had set @EXPORT to ("useful"). First, Exporter checks for a type sigil and removes it:

    (*{"${callpkg}::$sym"} = \&{"${pkg}::$sym"}, next)
        unless $sym =~ s/^(\W)//;

Because $sym is "useful"with no type sigilthe rest of the statement executes with a result similar to:

    *{"${callpkg}::$sym"} = \&{"${pkg}::$sym"};
    next;

Plugging in the appropriate values, this is very much like our mock exporter:

    *{$callpkg."::useful"} = \&{"Some::Module::useful"};

On the other hand, where there is a type sigil the exporter constructs the reference and assigns the relevant part of the glob:

    *{"${callpkg}::$sym"} =
       $type eq '&' ? \&{"${pkg}::$sym"} :
       $type eq '$' ? \${"${pkg}::$sym"} :
       $type eq '@' ? \@{"${pkg}::$sym"} :
       $type eq '%' ? \%{"${pkg}::$sym"} :
       $type eq '*' ?  *{"${pkg}::$sym"} :
       do { require Carp; Carp::croak("Can't export symbol: $type$sym") };

Accessing Glob Elements

The *glob = ... syntax obviously only works for assigning references to the appropriate part of the glob. If you want to access the individual references, you can treat the glob itself as a very restricted hash: *a{ARRAY} is the same as \@a, and *a{SCALAR} is the same as \$a. The other magic names you can use are HASH, IO, CODE, FORMAT, and GLOB, for the reference to the glob itself. There are also the really tricky PACKAGE and NAME elements, which tell you where the glob came from.

These days, accessing globs by hash keys is really only useful for retrieving the IO element. However, we'll see an example later of how it can be used to work with glob references rather than globs directly.

1.1.1.3 Creating subroutines with glob assignment

One common use of the aliasing technique in advanced Perl is the assignment of anonymous subroutine references, and especially closures, to a glob. For instance, there's a module called Data::BT::PhoneBill that retrieves data from British Telecom's online phone bill service. The module takes comma-separated lines of information about a call and turns them into objects. An older version of the module split the line into an array and blessed the array as an object, providing a bunch of read-only accessors for data about a call:

    package Data::BT::PhoneBill::_Call;
    sub new {
      my ($class, @data) = @_;
      bless \@data, $class;
    }

    sub installation { shift->[0] }
    sub line         { shift->[1] }
    ...

Closures

A closure is a code block that captures the environment where it's definedspecifically, any lexical variables the block uses that were defined in an outer scope. The following example delimits a lexical scope, defines a lexical variable $seq within the scope, then defines a subroutine sequence that uses the lexical variable.

{ my $seq = 3; sub sequence { $seq += 3 } } print $seq; # out of scope print sequence; # prints 6 print sequence; # prints 9

Printing $seq after the block doesn't work, because the lexical variable is out of scope (it'll give you an error under use strict. However, the sequence subroutine can still access the variable to increment and return its value, because the closure { $seq += 3 } captured the lexical variable $seq.

See perlfaq7 and perlref for more details on closures.

Of course, the inevitable happened: BT added a new column at the beginning, and all of the accessors had to shift down:

    sub type         { shift->[0] }
    sub installation { shift->[1] }
    sub line         { shift->[2] }

Clearly this wasn't as easy to maintain as it should be. The first step was to rewrite the constructor to use a hash instead of an array as the basis for the object:

    our @fields = qw(type installation line chargecard _date time
                     destination _number _duration rebate _cost);

    sub new {
      my ($class, @data) = @_;
      bless { map { $fields[$_] => $data[$_] } 0..$#fields } => $class;
    }

This code maps type to the first element of @data, installation to the second, and so on. Now we have to rewrite all the accessors:

    sub type         { shift->{type} }
    sub installation { shift->{installation} }
    sub line         { shift->{line} }

This is an improvement, but if BT adds another column called friends_and_family_discount, then I have to type friends_and_family_discount three times: once in the @fields array, once in the name of the subroutine, and once in the name of the hash element.

It's a cardinal law of programming that you should never have to write the same thing more than once. It doesn't take much to automatically construct all the accessors from the @fields array:

    for my $f (@fields) {
        no strict 'refs';
        *$f = sub { shift->{$f} };
    }

This creates a new subroutine in the glob for each of the fields in the arrayequivalent to *type = sub { shift->{type} }. Because we're using a closure on $f, each accessor "remembers" which field it's the accessor for, even though the $f variable is out of scope once the loop is complete.

Creating a new subroutine by assigning a closure to a glob is a particularly common trick in advanced Perl usage.

1.1.2. AUTOLOAD

There is, of course, a simpler way to achieve the accessor trick. Instead of defining each accessor individually, we can define a single routine that executes on any call to an undefined subroutine. In Perl, this takes the form of the AUTOLOAD subroutinean ordinary subroutine with the magic name AUTOLOAD:

    sub AUTOLOAD {
        print "I don't know what you want me to do!\n";
    }

    yow(  );

Instead of dying with Undefined subroutine &yow called, Perl tries the AUTOLOAD subroutine and calls that instead.

To make this useful in the Data::BT::PhoneBill case, we need to know which subroutine was actually called. Thankfully, Perl makes this information available to us through the $AUTOLOAD variable:

    sub AUTOLOAD {
        my $self = shift;
        if ($AUTOLOAD =~ /.*::(.*)/) { $self->{$1} }

The middle line here is a common trick for turning a fully qualified variable name into a locally qualified name. A call to $call->type will set $AUTOLOAD to Data::BT::PhoneBill::_Call::type. Since we want everything after the last ::, we use a regular expression to extract the relevant part. This can then be used as the name of a hash element.

We may want to help Perl out a little and create the subroutine on the fly so it doesn't need to use AUTOLOAD the next time type is called. We can do this by assigning a closure to a glob as before:

    sub AUTOLOAD {
    if ($AUTOLOAD =~ /.*::(.*)/) {
       my $element = $1;
       *$AUTOLOAD = sub { shift->{$element} };
       goto &$AUTOLOAD;
    }

This time, we write into the symbol table, constructing a new subroutine where Perl expected to find our accessor in the first place. By using a closure on $element, we ensure that each accessor points to the right hash element. Finally, once the new subroutine is set up, we can use goto &subname to try again, calling the newly created Data::BT::PhoneBill::_Call::type method with the same parameters as before. The next time the same subroutine is called, it will be found in the symbol tablesince we've just created itand we won't go through AUTOLOAD again.

Section 1.1. Introspection
goto LABEL and goto &subname are two completely different operations, unfortunately with the same name. The first is generally discouraged, but the second has no such stigma attached to it. It is identical to subname(@_) but with one important difference: the current stack frame is obliterated and replaced with the new subroutine. If we had used $AUTOLOAD->(@_) in our example, and someone had told a debugger to set a breakpoint inside Data::BT::PhoneBill::_Call::type, they would see this backtrace:

. = Data::BT::PhoneBill::_Call::type ... . = Data::BT::PhoneBill::_Call::AUTOLOAD ... . = main::process_call

In other words, we've exposed the plumbing, if only for the first call to type. If we use goto &$AUTOLOAD, however, the AUTOLOAD stack frame is obliterated and replaced directly by the type frame:

. = Data::BT::PhoneBill::_Call::type ... . = main::process_call

It's also concievable that, because there is no third stack frame or call-return linkage to handle, the goto technique is marginally more efficient.

There are two things that every user of AUTOLOAD needs to know. The first is DESTROY. If your AUTOLOAD subroutine does anything magical, you need to make sure that it checks to see if it's being called in place of an object's DESTROY clean-up method. One common idiom to do this is return if $1 eq "DESTROY". Another is to define an empty DESTROY method in the class: sub DESTROY { }.

The second important thing about AUTOLOAD is that you can neither decline nor chain AUTOLOADs. If an AUTOLOAD subroutine has been called, then the missing subroutine has been deemed to be dealt with. If you want to rethrow the undefined-subroutine error, you must do so manually. For instance, let's limit our Data::BT::PhoneBill::_Call::AUTOLOAD method to only deal with real elements of the hash, and not any random rubbish or typo that comes our way:

    use Carp qw(croak);
    ...
    sub AUTOLOAD {
        my $self = shift;
        if ($AUTOLOAD =~ /.*::(.*)/ and exists $self->{$1}) {
            return $self->{$1}
        }
        croak "Undefined subroutine &$AUTOLOAD called"; }

1.1.3. CORE and CORE::GLOBAL

Two of the most misunderstood pieces of Perl arcana are the CORE and CORE::GLOBAL packages. These two packages have to do with the replacement of built-in functions. You can override a built-in by importing the new function into the caller's namespace, but it is not as simple as defining a new function.

For instance, to override the glob function in the current package with one using regular expression syntax, we either have to write a module or use the subs pragma to declare that we will be using our own version of the glob typeglob:

    use subs qw(glob);

    sub glob {
        my $pattern = shift;
        local *DIR;
        opendir DIR, "." or die $!;
        return grep /$pattern/, readdir DIR;
    }

This replaces Perl's built-in glob function for the duration of the package:

    print "$_\n" for glob("^c.*\\.xml");

    ch01.xml
    ch02.xml
    ...

However, since the <*.*> syntax for the glob operator is internally resolved to a call to glob, we could just as well say:

    print "$_\n" for <^c.*\\.xml>;

Neither of these would work without the use subs line, which prepares the Perl parser for seeing a private version of the glob function.

If you're writing a module that provides this functionality, all is well and good. Just put the name of the built-in function in @EXPORT, and the Exporter will do the rest.

Where do CORE:: and CORE::GLOBAL:: come in, then? First, if we're in a package that has an overriden glob and we need to get at Perl's core glob, we can use CORE::glob( ) to do so:

    @files = <ch.*xml>;      # New regexp glob
    @files = CORE::glob("ch*xml"); # Old shell-style glob

CORE:: always refers to the built-in functions. I say "refers to" as a useful fictionCORE:: merely qualifies to the Perl parser which glob you mean. Perl's built-in functions don't really live in the symbol table; they're not subroutines, and you can't take references to them. There can be a package called CORE, and you can happily say things like $CORE::a = 1. But CORE:: followed by a function name is special.

Because of this, we can rewrite our regexp-glob function like so:

    package Regexp::Glob;
    use base 'Exporter';
    our @EXPORT = qw(glob);

    sub glob {
        my $pattern = shift;
        return grep /$pattern/, CORE::glob("*");
    }
    1;

There's a slight problem with this. Importing a subroutine into a package only affects the package in question. Any other packages in the program will still call the built-in glob:

    use Regexp::Glob;
    @files = glob("ch.*xml");      # New regexp glob

    package Elsewhere;
    @files = glob("ch.*xml");      # Old shell-style glob

Our other magic package, CORE::GLOBAL::, takes care of this problem. By writing a subroutine reference into CORE::GLOBAL::glob, we can replace the glob function throughout the whole program:

    package Regexp::Glob;

    *CORE::GLOBAL::glob = sub {
        my $pattern = shift;
        local *DIR;
        opendir DIR, "." or die $!;
        return grep /$pattern/, readdir DIR;
    };

    1;

Now it doesn't matter if we change packagesthe glob operator and its <> alias will be our modified version.

So there you have it: CORE:: is a pseudo-package used only to unambiguously refer to the built-in version of a function. CORE::GLOBAL:: is a real package in which you can put replacements for the built-in version of a function across all namespaces.

1.1.4. Case Study: Hook::LexWrap

Hook::LexWrap is a module that allows you to add wrappers around subroutinesthat is, to add code to execute before or after a wrapped routine. For instance, here's a very simple use of LexWrap for debugging purposes:

    wrap 'my_routine',
       pre => sub { print "About to run my_routine with arguments @_" },
       post => sub { print "Done with my_routine"; }

The main selling point of Hook::LexWrap is summarized in the module's documentation:

Unlike other modules that provide this capacity (e.g. Hook::PreAndPost and Hook::WrapSub), Hook::LexWrap implements wrappers in such a way that the standard "caller" function works correctly within the wrapped subroutine.

It's easy enough to fool caller if you only have pre-hooks; you replace the subroutine in question with an intermediate routine that does the moral equivalent of:

    sub my_routine {
        call_pre_hook(  );
        goto &Real::my_routine;
    }

As we saw above, the goto &subname form obliterates my_routine's stack frame, so it looks to the outside world as though my_routine has been controlled directly.

But with post-hooks it's a bit more difficult; you can't use the goto & trick. After the subroutine is called, you want to go on to do something else, but you've obliterated the subroutine that was going to call the post-hook.

So how does Hook::LexWrap ensure that the standard caller function works? Well, it doesn't; it actually provides its own, making sure you don't use the standard caller function at all.

Hook::LexWrap does its work in two parts. The first part assigns a closure to the subroutine's glob, replacing it with an imposter that arranges for the hooks to be called, and the second provides a custom CORE::GLOBAL::caller. Let's first look at the custom caller:

    *CORE::GLOBAL::caller = sub {
        my ($height) = ($_[0]||0);
        my $i=1;
        my $name_cache;
        while (1) {
            my @caller = CORE::caller($i++) or return;
            $caller[3] = $name_cache if $name_cache;
            $name_cache = $caller[0] eq 'Hook::LexWrap' ? $caller[3] : '';
            next if $name_cache || $height-- != 0;
            return wantarray ? @_ ? @caller : @caller[0..2] : $caller[0];
        }
    };

The basic idea of this is that we want to emulate caller, but if we see a call in the Hook::LexWrap namespace, then we ignore it and move on to the next stack frame. So we first work out the number of frames to back up the stack, defaulting to zero. However, since CORE::GLOBAL::caller itself counts as a stack frame, we need to start the counting internally from one.

Next, we do a slight bit of trickery. Our imposter subroutine is compiled in the Hook::LexWrap namespace, but it has the name of the original subroutine it's emulating. So if we see something in Hook::LexWrap, we store its subroutine name away in $name_cache and then skip over it, without decrementing $height. If the thing we see is not in Hook::LexWrap, but comes directly after something that is, we replace its subroutine name with the one from the cache. Finally, once $height gets down to zero, we can return the appropriate bits of the @caller array.

By doing this, we've created our own replacement caller function, which hides the existence of stack frames in the Hook::LexWrap package, but in all other ways behaves the same as the original caller. Now let's see how our imposter subroutine is built up.

Most of the wrap routine is actually just about argument checking, context propagation, and return value handling; we can slim it down to the following for our purposes:

    sub wrap (*@) {
        my ($typeglob, %wrapper) = @_;
        $typeglob = (ref $typeglob || $typeglob =~ /::/)
            ? $typeglob
            : caller(  )."::$typeglob";
        my $original = ref $typeglob eq 'CODE'
                       ? $typeglob
                       : *$typeglob{CODE};
        $imposter = sub {
            $wrapper{pre}->(@_) if $wrapper{pre};
            my @return = &$original;
            $wrapper{post}->(@_) if $wrapper{post};
            return @return;
        };
        *{$typeglob} = $imposter;
    }

To make our imposter work, we need to know two things: the code we're going to run and where it's going to live in the symbol table. We might have been either handed a typeglob (the tricky case) or the name of a subroutine as a string. If we have a string, the code looks like this:

    $typeglob = $typeglob =~ /::/ ? $typeglob : caller(  )."::$typeglob";
    my $original = *$typeglob{CODE};

The first line ensures that the now badly named $typeglob is fully qualified; if not, it's prefixed with the calling package. The second line turns the string into a subroutine reference using the glob reference syntax.

In the case where we're handed a glob like *to_wrap, we have to use some magic. The wrap subroutine has the prototype (*$); here is what the perlsub documentation has to say about * prototypes:

A "*" allows the subroutine to accept a bareword, constant, scalar expression, typeglob, or reference to a typeglob in that slot. The value will be available to the subroutine either as a simple scalar or (in the latter two cases) as a reference to the typeglob.

So if $typeglob turns out to be a typeglob, it's converted into a glob reference, which allows us to use the same syntax to write into the code part of the glob.

The $imposter closure is simple enoughit calls the pre-hook, then the original subroutine, then the post-hook. We know where it should go in the symbol table, and so we redefine the original subroutine with our new one.

So this relatively complex module relies purely on two tricks that we have already examined: first, globally overriding a built-in function using CORE::GLOBAL::, and second, saving away a subroutine reference and then glob assigning a new subroutine that wraps around the original.

1.1.5. Introspection with B

There's one final category of introspection as applied to Perl programs: inspecting the underlying bytecode of the program itself.

When the perl interpreter is handed some code, it translates it into an internal code, similar to other bytecode-compiled languages such as Java. However, in the case of Perl, each operation is represented as the node on a tree, and the arguments to each operation are that node's children.

For instance, from the very short subroutine:

    sub sum_input {
        my $a = <>;
        print $a + 1;
    }

Perl produces the tree in Figure 1-5.

Figure 1-5. Bytecode tree

The B module provides functions that expose the nodes of this tree as objects in Perl itself. You can examineand in some cases modifythe parsed representation of a running program.

There are several obvious applications for this. For instance, if you can serialize the data in the tree to disk, and find a way to load it up again, you can store a Perl program as bytecode. The B::Bytecode and ByteLoader modules do just this.

Those thinking that they can use this to distribute Perl code in an obfuscated binary format need to read on to our second application: you can use the tree to reconstruct the original Perl code (or something quite like it) from the bytecode, by essentially performing the compilation stage in reverse. The B::Deparse module does this, and it can tell us a lot about how Perl understands different code:

    % perl -MO=Deparse -n -e '/^#/ || print'

    LINE: while (defined($_ = <ARGV>)) {
        print $_ unless /^#/;
    }

This shows us what's really going on when the -n flag is used, the inferred $_ in print, and the logical equivalence of X || Y and Y unless X.^[*] (Incidentally, the Omodule is a driver that allows specified B::* modules to do what they want to the parsed source code.)

^[*] The -MO=Deparse flag is equivalent to use O qw(Deparse);.

To understand how these modules do their work, you need to know a little about the Perl virtual machine. Like almost all VM technologies, Perl 5 is a software CPU that executes a stream of instructions. Many of these operations will involve putting values on or taking them off a stack; unlike a real CPU, which uses registers to store intermediate results, most software CPUs use a stack model.

Perl code enters the perl interpreter, gets translated into the syntax tree structure we saw before, and is optimized. Part of the optimization process involves determining a route through the tree by joining the ops together in a linked list. In Figure 1-6, the route is shown as a dotted line.

Figure 1-6. Optimized bytecode tree

Each node on the tree represents an operation to be done: we need to enter a new lexical scope (the file); set up internal data structures for a new statement, such as setting the line number for error reporting; find where $a lives and put that on the stack; find what filehandle <> refers to; read a line from that filehandle and put that on the stack; assign the top value on the stack (the result) to the next value down (the variable storage); and so on.

There are several different kinds of operators, classified by how they manipulate the stack. For instance, there are the binary operatorssuch as addwhich take two values off the stack and return a new value. readline is a unary operator; it takes a filehandle from the stack and puts a value back on. List operators like print take a number of values off the stack, and the nullary pushmark operator is responsible for putting a special mark value on the stack to tell print where to stop.

The B module represents all these different kinds of operators as subclasses of the B::OP class, and these classes contain methods allowing us to get the next module in the execution order, the children of an operator, and so on.

Similar classes exist to represent Perl scalar, array, hash, filehandle, and other values. We can convert any reference to a B:: object using the svref_2object function:

    use B;

    my $subref = sub {
        my $a = <>;
        print $a + 1;
    };

    my $b = B::svref_2object($subref); # B::CV object

This B::CV object represents the subroutine reference that Perl can, for instance, store in the symbol table. To look at the op tree inside this object, we call the START method to get the first node in the linked list of the tree's execution order, or the ROOT method to find the root of the tree.

Depending on which op we have, there are two ways to navigate the op tree. To walk the tree in execution order, you can just follow the chain of next pointers:

    my $op = $b->START;

    do {
        print B::class($op). " : ". $op->name." (".$op->desc.")\n";
    } while $op = $op->next and not $op->isa("B::NULL");

The class subroutine just converts between a Perl class name like B::COP and the underlying C equivalent, COP; the name method returns the human-readable name of the operation, and desc gives its description as it would appear in an error message. We need to check that the op isn't a B::NULL, because the next pointer of the final op will be a C null pointer, which B handily converts to a Perl object with no methods. This gives us a dump of the subroutine's operations like so:

    COP : nextstate (next statement)
    OP : padsv (private variable)
    PADOP : gv (glob value)
    UNOP : readline (<HANDLE>)
    COP : nextstate (next statement)
    OP : pushmark (pushmark)
    OP : padsv (private variable)
    SVOP : const (constant item)
    BINOP : add (addition (+))
    LISTOP : print (print)
    UNOP : leavesub (subroutine exit)

As you can see, this is the natural order for the operations in the subroutine. If you want to examine the tree in top-down order, something that is useful for creating things like B::Deparse or altering the generated bytecode tree with tricks like optimizer and B::Generate, then the easiest way is to use the B::Utils module. This provides a number of handy functions, including walkoptree_simple. This allows you to set a callback and visit every op in a tree:

    use B::Utils qw( walkoptree_simple );
    ...
    my $op = $b->ROOT;

    walkoptree_simple($op, sub{
        $cop = shift;
        print B::class($cop). " : ". $cop->name." (".$cop->desc.")\n";
    });

Note that this time we start from the ROOT of the tree instead of the START; traversing the op tree in this order gives us the following list of operations:

    UNOP : leavesub (subroutine exit)
    LISTOP : lineseq (line sequence)
    COP : nextstate (next statement)
    UNOP : null (null operation)
    OP : padsv (private variable)
    UNOP : readline (<HANDLE>)
    PADOP : gv (glob value)
    COP : nextstate (next statement)
    LISTOP : print (print)
    ...

Working with Perl at the op level requires a great deal of practice and knowledge of the Perl internals, but can lead to extremely useful tools like Devel::Cover, an op-level profiler and coverage analysis tool.

Table of Contents