wftk tutorial - (unresolved tag 02-b-title)

Storing data in a tab-delimited file
Specifying the fields for a flat file
(Reading data from an existing file
(Using the file adapter for non-file storage
(Repository configuration II: permanently configuring a storage list with file-like data predefined

The last section introduced the list construct used by the wftk for encapsulating data. Lists stored in memory are useful for intermediate results, and it's certainly fun to write SQL queries against lists of hashes we write on the fly -- but the real purpose of the list data structure, of course, is to represent data which is stored in some permanent storage location.

For each type of data storage, the wftk defines a subclass of the Workflow::wftk::Data class that defines in-memory lists. The first and simplest of these is one which stores data as individual tab-delimited lines in a flat file in the filesystem. This is, appropriately enough, ``file:''.

The file adapter has some different variants it can support. For instance, it can either write a header line with the names of fields, or define the fields in its invocation -- either of these modes is useful in different circumstances.

Storing data in a tab-delimited file

The basic functionality of the file adapter is exactly like the in-memory list -- which is, of course, the whole point of formalizing the notion of lists of data. So let's run doewn through the basic operations we did for in-memory lists, and confirm that it all works.

   use Workflow::wftk;

   my $wftk = Workflow::wftk->new();

   # Create a list stored in 'test1.txt' without a header line:
   my $list = $wftk->open('mylist', 'file:test1.txt');

   # Add a record the easy way, with a hashref.
   my $key = $list->add ( {field1 => 'value1', field2 => 'value2'} );

   # Look at the fields in the list again:
   @fields = $list->fields();          # Returns ('field1', 'field2') now, just like a memory list.

   # Let's check what we just did:
   open IN, 'test1.txt';
   my $line = scalar <IN>;
   close IN;
   print $line;                        # This prints "value1\tvalue2\n". Note that we didn't save the
                                       # field names to the file -- but the list object still knows them.

   my $key2 = $list->add ( ['different value', 'value2'] );
   my $key3 = $list->add ( ['third value', 'value2'] );

   # Get a list of keys:
   my @keys = $list->keys();

   # Iterate through the list:
   for (my $key = $list->first_key(); $key; $key = $list->next_key()) {
      # ...
   }

   # Get a record:
   $rec = $list->get($key2);

   # Get a value:
   print "Value is " . $rec->get('field2');

   # Modify an existing record.
   $rec->put ('field1', 'new value 1');
   $list->mod ($key2, $rec);

   # Delete a record.
   $list->del ($key3);

   # Display the contents of the list:
   print $list->as_text;

   # The print statement above will return the following:
   # +-----+-------------+--------+
   # | key | field1      | field2 |
   # +-----+-------------+--------+
   # | 1   | value1      | value2 |
   # | 2   | new value 1 | value2 |
   # +-----+-------------+--------+

   # Display the contents of the file:
   system 'cat test1.txt';

   # That (if you're running Unix) will print this:
   # value1   value2
   # new value 1     value2

I won't go over the command-line, SQL, DBI, or hash tying interfaces in detail here, because it would be a waste of time. They all work exactly as you would expect. We'll test some of the other interfaces in our further testing below; this subsection was just a demonstration that file storage really and truly works.

Specifying the fields for a flat file

In the example above, we just created the list as an amorphous thing, and left it to the first ``put'' to define the fields for the file. When working with memory files, this is often as structured as we need to be, but the point of working in the file system (besides persistence) is to be able to interface with other systems. In that case, we probably need to know the list of fields in advance, and so there is a ``fields'' parameter available for the list specifier. We've already seen it in 02a Memory lists, but let's look at it again.

   use Workflow::wftk;

   # Create a list stored in 'test1.txt' with specified fields.
   my $list = $wftk->open('mylist', 'file:test1.txt;fields=from_value,to_value');

   # Now we can add records without naming the fields.
   my $key = $list->add ( ['value1', 'value2'] );

   # Let's check what we just did:
   open IN, 'test1.txt';
   my $line = scalar <IN>;
   close IN;
   print $line;                        # This prints "value1\tvalue2\n", just as expected.

But what happens if we now add a new record using a hash? If the hash keys match the existing field names, nothing is changed; but if we introduce a new name, will the field be added?

   my $key2 = $list->add ( { value3 => 'different value'} );
   @fields = $list->fields(); # => returns ('from_value', 'to_value', 'value3')

It will, for the file adapter, because files are pretty flexible. But in a case where other systems expect a certain schema, this might constitute an error. In that case, you can set the fixed flag on the list definition, and the wftk will refuse to add fields that aren't already there. It won't be an error; they'll just be ignored if you try to add them.

   my $list2 = $wftk->open('list2', 'file:test2.txt;fields=from_value,to_value;fixed');
   $key = $list2->add (['value1', 'value2']);
   $key2 = $list2->add ( { value3 => 'value3' });

   @fields = $list2->fields(); # => returns ('from_value', 'to_value');

And note that if we now retrieve the record just added, it's going to be very boring:

   @fields = $list2->get($key2)->values(); # => returns ('', '');

(Reading data from an existing file

Flat file data storage is a convenient way to cache data casually, because it can be edited by hand, with simple scripts, or with applications like Excel. Here's an example of how to generate data with a simple Perl script, then read it into a file list in the wftk.

(Using the file adapter for non-file storage

Sometimes we have tab-delimited line data that isn't actually in a file. One choice is to write it to a file first, but does that sound like something I'd make you do? No, we can persuade the file adapter to treat non-file storage like file storage. The wftk already has a couple of examples; strings, streams, and fields in existing objects. You can also write your own pretty easily, but if I put that into this tutorial it's going to be a little further down the road.

Like files, the storage for non-files is updated when a change is made to the data. That means, for instance, that if I bind a string to a file list as its storage location, then make changes to the data in that list, the string will magically change. This is useful because I can then store it elsewhere at my leisure. And of course, the same applies to data stored in the field of another record somewhere; we'll use this later for something I call ``data in data'', which provides sublists and references within records.

A similar set of functionality is the ability to treat non-file resources as pseudofiles; for instance, we can give the list a string reference or an open __DATA__ section as input data (along the lines of DBD::File) and have it parse out a list for us. With a little more care, we can even have the list update the data source if data is changed. The presence or absence of a header is given as a Boolean ``header'' parameter. Its default is ``off''; you can turn it on by simply mentioning it in the ad-hoc list specification string (e.g. open file:test.txt;header) or by adding a line +header to the list definition in the repository configuration. (For more on the syntax of these configuration parameters, see 02h Records.)

(Repository configuration II: permanently configuring a storage list with file-like data predefined

Of course, the repository definition mechanism shown at the end of the last chapter also works for file lists. And for file lists, we can even specify the data right in the repository definition; this is useful for relatively short, constant lists.

Here is an example of using a string to specify the configuration when opening a repository session. Note that the data for one list is stored in a string, and the other is loaded from a __DATA__ section.

This record specification uses a lot of features we haven't seen before. For the details of everything you're seeing here, you'll probably want to check 02-h Records, but in the meantime, just assume that everything you see is perfectly intuitive, because it is.

   use Workflow::wftk;

   my $wftk = Workflow::wftk->new(<<"   CONF");
       ! [mylist: list]
       ! type: memory:load_as=file;fields=field1,field2
       ! +readonly
       ! load_from: DATA
       !
       ! [second_list: list]
       ! type: memory
       ! load_as: file
       ! *fields: field1 field2 field3
       ! description: My second list
       ! data: <<EOF
       ! AAA\t234\tyes
       ! CCC\t123\tno
       ! DDD\t123\tyes
       ! EOF
       !
      CONF

   %lists = $wftk->lists();      # Before opening anything, our list of lists
                                 # shows us what we have defined.

   $wftk->execute_sql ("select mylist.field1, second_list.field3 from mylist, second_list
                        where mylist.field2=second_list field1 order by mylist.field1");
   print $wftk->list()->as_text;

   __DATA__
   Alice   AAA
   Bob     BBB
   Carla   CCC
   Doug    DDD
   Edith   EEE
   Frank   FFF

This defines two lists, specifies their data in two different ways, and then executes an SQL query against both of them, dumping the result in conveniently formatted form. I know I'm expected to think this, but ... I love the wftk.

That concludes the basic functionality for storage of data in delimited lines in a file or file-like object. The file storage class is not intended to provide great performance -- it's here for convenience. For performance, use a relational database such as MySQL (addressed in 02-d Storing Data in MySQL).