Приглашаем посетить
Биографии (biografii.niv.ru)

Section 4.2.  Object Serialization

Previous
Table of Contents
Next

4.2. Object Serialization

Now we want to move on from the relatively simple key-value mechanism of DBMs to the matter of saving and restoring more complex Perl data structures, chiefly objects. These data structures are interesting and more difficult than scalars, because they come in many shapes and sizes: an object may be a blessed hashor it might be a blessed arraywhich could itself contain any number and any depth of nesting of hashes, including other objects, arrays, scalars, or even code references.

While we could reassemble all our data structures from their original sources every time a program is run, the more complex our structures become, the more efficient it is to be able to store and restore them wholesale. Serialization is the process of representing complex data structures in a binary or text format that can faithfully reconstruct the data structure later. In this section we're going to look at the various techniques that have been developed to do this, again with reference to their implementation in CPAN modules.

4.2.1. Our Schema and Classes

To compare the different techniques here and in the rest of the chapter, we're going to use the same set of examples: some Perl classes whose objects we want to be somehow persistent. The schema and classes are taken from the example application used by Class::DBI: a database of CDs in a collection, with information about the tracks, artists, bands, singers, and so on.

We'll create our classes using the Class::Accessor::Assert module, which not only creates constructors and accessors for the data slots we want, but also ensures that relationships are handled by constraining the type of data that goes in the slots. So, for instance, the CD class would look like this:

    package CD;
    use base "Class::Accessor::Assert";
    _ _PACKAGE_ _->mk_accessors(qw(
       artist=CD::Artist title publishdate=Time::Piece songs=ARRAY
    ));

This checks that artist is a CD::Artist object, that publishdate is a Time::Piece object, and that TRacks is an array reference. (Sadly, we can't check that it's an array of CD::Song objects, but this will do for now.) Notice that things are going to be slightly different between the schema and the Perl codefor instance, we don't need a separate class for CD::Track, which specifies the order of songs on a CD, because we can just do that with an array of songs.

With that in mind, the rest of the classes look like this:

    package CD::Song;
    use base 'Class::Accessor';
    _ _PACKAGE_ _->mk_accessors("name");

    package CD::Person;
    use base 'Class::Accessor::Assert';
    _ _PACKAGE_ _->mk_accessors(qw(gender haircolor birthdate=Time::Piece));

    package CD::Band;
    use base 'Class::Accessor::Assert';
    _ _PACKAGE_ _->mk_accessors( qw( members=ARRAY
                                   creationdate=Time::Piece
                                   breakupdate=Time::Piece ));

    package CD::Artist;
    use base 'Class::Accessor::Assert';
    _ _PACKAGE_ _->mk_accessors(qw( name popularity person band ));

    # Dispatch "band" accessors if it's a band
    for my $accessor (qw(members creationdate breakupdate)) {
        *$accessor = sub {
           my $self = shift;
           return $self->band->$accessor(@_) if $self->band
        };
    }

    # And dispatch "person" accessors if it's a person
    for my $accessor (qw(gender haircolor birthdate)) {
        *$accessor = sub {
           my $self = shift;
           return $self->person->$accessor(@_) if $self->person
        };
    }

Now we can create artists, tracks, and CDs, like so:

    my $tom = CD::Artist->new({ name => "Tom Waits",
                                person => CD::Person->new(  ) });

    $tom->popularity(2);
    $tom->haircolor("black");

    my $cd = CD->new({
       artist => $tom,
       title => "Rain Dogs",
       songs => [ map { CD::Song->new({title => $_ }) }
                  ("Singapore", "Clap Hands", "Cemetary Polka",
                   # ...
                  ) ]
    });

The rest of the chapter addresses how we can store these objects in a database and how we can use the classes as the frontend to an existing database.

4.2.2. Dumping Data

One basic approach would be to write out the data structure in full: that is, to write the Perl code that could generate the data structure, then read it in, and revive it later. That is, we would produce a file containing:

    bless( {
      'title' => 'Rain Dogs'
      'artist' => bless( {
           'popularity' => 2,
               'person' => bless( { 'haircolor' => 'black' }, 'CD::Person' ),
                 'name' => 'Tom Waits'
          }, 'CD::Artist' ),
      'songs' => [
        bless( { 'title' => 'Singapore'      }, 'CD::Song' ),
        bless( { 'title' => 'Clap Hands'     }, 'CD::Song' ),
        bless( { 'title' => 'Cemetary Polka' }, 'CD::Song' ),
        # ...
      ],
    }, 'CD' )

and later use do to reconstruct this data structure. This process is known as serialization, since it turns the complex, multidimensional data structure into a flat piece of text. The most common module used to do the kind of serialization shown above is the core module Data::Dumper.

This process of serialization is also incredibly important during the debugging process; by dumping out a representation of a data structure, it's very easy to check whether it contains what you think it should. In fact, pretty much my only debugging tool these days is a carefully placed:

    use Data::Dumper; die Dumper($whatever);

If you're using the Data::Dumper module for serializing objects, however, there's a little more you need to know about it than simply the Dumper subroutine. First, by default, Dumper's output will not just be the raw data structure but will be an assignment statement setting the variable $VAR1 to the contents of the data structure.

You may not want your data to go into a variable called $VAR1, so there are two ways to get rid of this: first, you can set $Data::Dumper::Terse = 1, which will return the raw data structure without the assignment, which you can then assign to whatever you like; second, you can provide a variable name for Data::Dumper to use instead of $VAR1. This second method is advisable since having an assignment statement rather than a simple data structure dump allows Data::Dumper to resolve circular data structures. Here's an example that sets up a circular data structure:

    my $dum = { name => "Tweedle-Dum" };
    my $dee = { name => "Tweedle-Dee" };
    $dee->{brother} = $dum;
    $dum->{brother} = $dee;

If we dump $dum using the Data::Dumper defaults, we get:

    $VAR1 = {
              'brother' => {
                             'brother' => $VAR1,
                             'name' => 'Tweedle-Dee'
                           },
              'name' => 'Tweedle-Dum'
            };

This is fine for debugging but cannot reconstruct the variable later, since $VAR1 is probably undef while the hash is being put together. Instead, you can set $Data::Dumper::Purity = 1 to output additional statements to fix up the references:

    $VAR1 = {
              'brother' => {
                             'brother' => {  },
                             'name' => 'Tweedle-Dee'
                           },
              'name' => 'Tweedle-Dum'
            };
    $VAR1->{'brother'}{'brother'} = $VAR1;

Naturally, this is something that we're going to need when we're using Data::Dumper to record real data structures, but it cannot be done without the additional assignments and, hence, a variable name. You have two choices when using Data::Dumper for serialization: either you can specify the variable name you want, like so:

    open my $out, "> dum.pl" or die $!;
    use Data::Dumper;
    $Data::Dumper::Purity = 1;
    print $out Dumper([ $dee ], [ "dee" ]);

or you can just make do with $VAR1 and use local when you re-evalthe code.

Data::Dumper has spawned a host of imitators, but none more successful than YAML (YAML Ain't Markup Language). This is another text-based data serialization format that is not Perl-specific and is also optimized for human readability. Using YAML's Dump or DumpFile on the Tom Waits CD gives us:

    --- #YAML:1.0 !perl/CD
    artist: !perl/CD::Artist
      name: Tom Waits
      person: !perl/CD::Person
        haircolor: black
      popularity: 2
    songs:
      - !perl/CD::Song
        title: Singapore
      - !perl/CD::Song
        title: Clap Hands
      - !perl/CD::Song
        title: Cemetary Polka
      ...
    title: Rain Dogs

This is more terse and, hence, easier to follow than the equivalent Data::Dumper output; although with Data::Dumper, at least you're reading Perl. Once you know that YAML uses key: value to specify a hash pair, element for an array element, indentation for nesting data structures, and ! for language-specific processing instructions, it's not hard.

YAML uses a system of references and links to notate circular structures; Tweedle-Dum looks like this:

    --- #YAML:1.0 &1
    brother:
      brother: *1
      name: Tweedle-Dee
    name: Tweedle-Dum

The *1 is a reference to the target &1 at the top, stating that Tweedle-Dee's brother slot is the variable. This is much neater, as it means you can save and restore objects without messing about with what the variable name ought to be. To restore an object with YAML, use Load or LoadFile:

    my $dum = YAML::Load(<<EOF);
    --- #YAML:1.0 &1
    brother:
      brother: *1
      name: Tweedle-Dee
    name: Tweedle-Dum
    EOF

    print $dum->{brother}{brother}{name}; # Tweedle-Dum

4.2.3. Storing and Retrieving Data

As well as the text-based serialization methods, such as Data::Dumper and YAML, there are also binary serialization formats; the core module Storable is the most well known and widely used of these, but the CPAN module FreezeThaw deserves an honorable mention.

Storable can store and retrieve data structures directly to a file, like so:

    use Storable;
    store $dum, "dum.storable";

    # ... later ...

    my $dum = retrieve("dum.storable");

This technique is used by the CPANPLUS module to store a parsed representation of the CPAN module tree. This is perhaps the ideal use of serializationwhen you have a very large data structure that was created by parsing a big chunk of data that would be costly to reparse. For our examples, where we have many relatively small chunks of interrelated data, the process has a problem.

4.2.4. The Pruning Problem

The problem is that we serialize every reference or object that we store, but the serializations don't refer to each other. It's as if each object is the root of a tree, and everything else is subordinate to it; unfortunately, that's not always the case. As a simple example, let's take our two variables in circular reference. When we serialize and store them, our serializer sees the two variables like this:

    $dum = {
              'brother' => {
                             'brother' => $dum,
                             'name' => 'Tweedle-Dee'
                           },
              'name' => 'Tweedle-Dum'
            };
    $dee = {
              'brother' => {
                             'brother' => $dee,
                             'name' => 'Tweedle-Dum'
                           },
              'name' => 'Tweedle-Dee'
            };

We've been serializing them one at a time, so the serializer is forced to serialize everything it needs to fully retrieve either one of these two variables; this means it has to repeat information. In the worst case, where all the data structures we store are interconnected, each and every piece of data we store will have to contain the data for the whole set. If there was some way to prune the data, so that the serializer saw:

    $dum = {
              'brother' => (PLEASE RETRIEVE $dee FOR THIS DATA),
              'name' => 'Tweedle-Dum'
            };
    $dee = {
              'brother' => (PLEASE RETRIEVE $dum FOR THIS DATA),

              'name' => 'Tweedle-Dee'
            };

then all would be well. But that requires a lot more organization. We'll see techniques to handle that later in the chapter.

4.2.5. Multilevel DBMs

Besides the pruning problem, there's another problem with the file-based serialization we've been using so far. If we're dealing with more than one data structure which programs tend to dowe need to either put everything we want to deal with into one big array or hash and store and retrieve that, which is very inefficient, or we have a huge number of files around and we have to work out how we're going to manage them.

DBM files are one solution, as they relate one thing (an ID or variable name for the data structure) to another (the data structure itself) and hence organize individual data structures in a single file in a random-access way. However, when we last left DBMs, we were lamenting the fact that they cannot store and retrieve complex data structures, only scalars. But now that we've seen a way of turning a complex data structure into a scalar and back again, we can use these serialization techniques to get around the limitations of DBMs.

There are two ways of doing this: the new and reckless way, or the old and complicated way. We'll start with the new and reckless way since it demonstrates the idea very well.

In recent versions of Perl, there's a facility for adding filter hooks onto DBM access. That is, when you store a value into the database, a user-defined subroutine gets called to transform the data and, likewise, when you retrieve a value from the database. Your subroutine gets handed $_, you do what you need to it, and the transformed value gets used in the DBM. This filter facility has many uses. For instance, you can compress the data that you're storing to save space:

    use Compress::Zlib;

    $db = tie %hash, "DB_File", "music.db" or die $!;
    $db->filter_store_value(sub { $_ = compress($_)   });
    $db->filter_fetch_value(sub { $_ = uncompress($_) });

Or you can null-terminate your strings, for both keys and values, to ensure that C programs can use the same database file:

    $db->filter_fetch_key  ( sub { s/\0$//    } ) ;
    $db->filter_store_key  ( sub { $_ .= "\0" } ) ;
    $db->filter_fetch_value( sub { s/\0$//    } ) ;
    $db->filter_store_value( sub { $_ .= "\0" } ) ;

Or you can do what we want to do, which is to use Storable's freeze and thaw functions to serialize any references we get passed:

    use Storable qw(freeze thaw);

    $db->filter_store_value( sub { $_ = freeze($_) } );
    $db->filter_fetch_value( sub { $_ = thaw($_)   } );

That's the easy way, but it has some disadvantages. First, it ties you down, as it were, to using Storable for your storage. It also requires the DBM filter facility, which came into Perl in version 5.6.0this shouldn't be much of a problem these days, but you never know. The most serious disadvantage, however, is that it's unfamiliar to other programmers, which means maintainance coders may not appreciate the significance of these two lines in your program.

The way to scream to the world that you're using a multilevel DBM is to use the MLDBM module. Eventually, this ought to be rewritten to use the DBM filter hooks, but you don't need to care about that. MLDBM abstracts both the underlying DBM module and the seralization module, like so:

    use MLDBM qw(DB_File Storable); # Use a Sleepycat DB and Storable

    tie %hash, "MLDBM", "music.db" or die $!;

    my $tom = CD::Artist->new({ name => "Tom Waits",
                              person => CD::Person->new(  ) });
    $martyn->popularity(1);

    $hash{"album1"} = CD->new({
          artist => $tom,
          title  => "Rain Dogs",
          tracks => [ map { CD::Song->new({title => $_ }) }
                      ("Singapore", "Clap Hands", "Cemetary Polka", ...)
                    ]
    });

We could also choose FreezeThaw or Data::Dumper to do the serialization, or any of the other DBM drivers for the storage.

Section 4.2.  Object Serialization

One thing people expect to be able to do with MLDBM, but can't, is write to intermediate references. Let's say we have a simple hash of hashes:

    use MLDBM qw(DB_File Storable); # Use a Sleepycat DB and Storable
    tie %hash, "MLDBM", "hash.db" or die $!;
    $hash{test} = { "Hello" => "World" };

This works fine. But when we do:

    $hash{test}->{Hello} = "Mother";

the assignment seems to have no effect. In short, you can't store to intermediate references using MLDBM. If you think how MLDBM works, this is quite obvious. Our assignment has done a fetch, which has produced a new data structure by thawing the scalar in the database. Then we've modified that data structure. However, modifying the data structure doesn't cause a STORE call to write the new data to the database; STORE is only called when we write directly to the tied hash. So to get the same effect, we need the rather more ugly:

    $hash{test} = { %{$hash{test}}, Hello => "Mother" };


Since MLDBM uses a deep serializer, our example not only stores the CD object, but also the CD::Song objects and the CD::Artist object. When we retrieve album1 again, everything is available.

4.2.6. Pixie

The Pixie module from CPAN is an automated, ready-made implementation of all that we've been talking about in this section. It uses Storable to serialize objects, and then stores them in a data storea relational database using DBI by default, but you can also define your own stores.

Pixie has two advantages over the hand-knit method we've used. First, and most important, it solves the pruning problem: it retrieves each new object in the data structure as it's referenced, rather than pulling everything in as a lump. If, for instance, we have a tree data structure where every object can see every other object, something based on MLDBM would have to read the entire tree structure into memory when we fetched any object in it. That's bad. Pixie doesn't do that.

The other advantage, and the way Pixie gets around this first problem, is that it stores each new object in the data structure separately. So when we stored our Tom Waits CD with MLDBM, we serialized the whole thing, including all the CD::Song and CD::Artist objects, into a scalar and stored that. If we stored a different CD by the same artist, we'd serialize all of its data, including the CD::Artist object, into a scalar and store that as well. We now have two copies of the same artist data stored in two different albums. This can only get worse. In the worst case of a tree structure, every object we serialize and store will have to contain the entire contents of the tree. That's bad. Pixie doesn't do that, either.

To demonstrate using Pixie, we'll use the default DBI data store. Before we can start storing objects, we first have to deploy the data storethat is, set up the tables that Pixie wants to deal with. We do this as a separate setup process before we use Pixie the first time:

    use Pixie::Store::DBI;
    Pixie::Store::DBI->deploy("dbi:mysql:dbname=pixie");

The deploy method creates new tables, so it will fail if the tables already exist. Now if we have pure-Perl, pure-data objects, Pixie just works. Let's take our Rain Dogs CD again, since that's what I was listening to when I wrote this chapter:

    my $cd = CD->new({
       artist => $tom,
       title => "Rain Dogs"
       songs => [ map { CD::Song->new({title => $_ }) }
                  ("Singapore", "Clap Hands", "Cemetary Polka",
                   # ...
                  ) ]
    });
    my $pixie = Pixie->new->connect("dbi:mysql:dbname=pixie");
    my $cookie = $pixie->insert($cd);

This will store the data and return a GUID (globally unique identifier)mine was EAAC3A08-F6AA-11D8-96D6-8C22451C8AE2, and yours hopefully will not be. Now I can use this GUID in a completely different program, and I get the data back:

    use Pixie;
    use CD;
    my $pixie = Pixie->new->connect("dbi:mysql:dbname=pixie");
    my $cd = $pixie->get("EAAC3A08-F6AA-11D8-96D6-8C22451C8AE2");

    print $cd->artist->name; # "Tom Waits"

Notice that Pixie has not only stored the CD object that we asked it about, but it has also stored the CD::Artist, CD::Person and all the CD::Song objects that related to it. It only retrieves them, however, when we make the call to the relevant accessor. It's very clever.

For our purposes, that's all there is to Pixie, but that's because our purposes are rather modest. Pixie works extremely well when all the data belonging to an object is accessible from Perl spacea blessed hash or blessed array reference. However, objects implemented by XS modules often have data that's not available from PerlC data structures referred to by pointers, for instance. In that case, Pixie doesn't know what to do and requires help from the programmer to explain how to store and reconstruct the objects.

We'll use a pure Perl example, however, to demonstrate what's going on. In our example, we have a bunch of Time::Piece objects in our storage. If these were instead DateTime objects, we'd have to store all this every time we store a date:

    $VAR1 = bless( {
                     'tz' => bless( {
                                      'name' => 'UTC'
                                    }, 'DateTime::TimeZone::UTC' ),
                     'local_c' => {
                                    'quarter' => 3,
                                    'minute' => 13,
                                    'day_of_week' => 7,
                                    'day' => 19,
                                    'day_of_quarter' => 81,
                                    'month' => 9,
                                    'year' => 2004,
                                    'hour' => 13,
                                    'second' => 3,
                                    'day_of_year' => 263
                                  },
                      ...,

                   }, 'DateTime' );

This is not amazingly efficient, just to store what can be represented by an epoch time. Even though this is all pure Perl data, we can make it a bit tidier by making DateTime complicit with Pixie.

To do this, we implement a few additional methods in the DateTime namespace. First we use a proxy object to store the essential information about the DateTimeobject:

    sub DateTime::px_freeze {
        my $datetime = shift;
        bless [ $datetime->epoch ], "Proxy::DateTime";
    }

Now when Pixie comes to store a DateTime object, all it does instead is convert it to a Proxy::DateTime object that knows the epoch time and stores that instead.[*] Next, we need to be able to go from the proxy to the real DateTime object, when it is retrieved from the database. Remember that this needs to be a method on the proxy object, so it lives in the Proxy::DateTime namespace:

[*] Design pattern devotees call this the "memento" pattern.

    sub Proxy::DateTime::px_thaw {
        my $proxy = shift;
        DateTime->from_epoch(epoch => $proxy->[0]);
    }

Some objectslike blessed scalars or code refsare a bit more tricky to serialize. Because of this, Pixie won't serialize anything other than hash- or array-based classes, unless we explicitly tell it that we've handled the serialization ourselves:

    sub MyModule::px_is_storable { 1 }

And that, really, is all there is to it.

    Previous
    Table of Contents
    Next