Using the wftk: repository definition

Whenever you talk to the wftk, you are actually talking to a specific repository. The repository can be thought of as the context for an application, and there may be any number of repositories on a single physical machine. By default, a wftk repository occupies a directory in the file system of its home machine. You might be able to use the APIs to convince the wftk to work with a dynamically-defined repository which is not bound to a given directory, but I've never tried it. So to simplify the situation, let's just assume that the repository is a directory.

Within that directory is the repository definition, a single XML file which defines the data resources (and other things) used by that repository. When you start up a conversation with a wftk system, the first thing the wftk does is to read the repository definition so it knows the context in which you will be working.

For instance, when starting the command-line repmgr (repository manager) under Windows, you might use the following command:

C:\projects\sites\test>repmgr -r site.opm
+000: Repository open.
repmgr v1.0 listening: type 'help' for help.
++done++

Note the use of the -r parameter to name the file containing the repository definition. That file is opened and read into an XML structure, which is then kept for the duration of the session. If you were speaking to a SOAP system, that file is read when the SOAP server is started; under AOLserver, the file is read when the AOLserver is started. When using the C or Python API, you start by opening a repository, and you pass the open repository structure into all subsequent calls to the API.

No matter how you end up addressing the repository, the definition file is always in the same format, and the explanation of that format is the purpose of this document.

Here is how this document is structured:

The simplest possible repository
More complex data storage (LIST_localdir storage)
Storing data in relational databases

The simplest possible repository

[ prev ] [ top ] [ next ]

A repository's basic component is a set of lists. Each list corresponds to a set of uniquely keyed objects made up of fields. (Objects can also contain things other than fields, but we'll worry about that later.) The simplest system possible, then, is a repository with a single list. To make things very simple, we'll store the list as tab-delimited fields in a file.

<site>
  <list id="mylist" storage="delim:mylist.txt">
    <field id="field1"/>
    <field id="field2"/>
  </list>
</site<

I'm going to assume you understand the basics of how XML works (if not, this document and the wftk are both going to be tough going). Even if you don't, though, this is a pretty simple file. The key is the <list> tag, which defines the list within the repository. The id attribute identifies the list as mylist, and the storage attribute defines where mylist will be stored.

If you look more closely at the storage spec, you'll see that it consists of an adaptor name delim, followed by a colon, followed by a filename. This tells the wftk that the data in this list is managed using the LIST_delim adaptor and the adaptor then knows that the file to store the data in should be mylist.txt. Each adaptor knows how to handle its own storage specs; some may also use additional attributes on the list definition to store additional information about the storage of the data.

The delimited text adaptor LIST_delim is a simple way to store simple objects. This adaptor stores each record in the list as a single line in the text file named. To illustrate this usage, let's assume the mylist.txt file contains the following, where [tab] is replaced by actual tab characters:

# This is a comment line.

First line [tab] second field
Second line [tab] blah blah

Third line [tab] blorgh

# Maybe another comment
Here is another [tab] hahaha

In this file, note that blank lines are ignored, as are comment lines (those beginning with #). Any line not a comment line or a blank line is assumed to be a record in mylist.

So if we run repmgr from the command line, we can poke around a little. The following transcript is from Windows.

C:\projects\sites\test>repmgr -r site.opm
+000: Repository open.
repmgr v1.0 listening: type 'help' for help.
++done++
list
+100: OK, data follows.  1 key(s) found:
 mylist
+000: OK ++done++
list mylist
+100: OK, data follows.  4 key(s) found:
 first_line
 second_line
 third_line
 here_is_another
+000: OK ++done++
get tabs here_is_another
+200: OK, XML follows.
<rec id="here_is_another" key="here_is_another" list="mylist">
<field id="field1">Here is another</field>
<field id="field2">hahaha</field>
<field id="linenum">3</field>
</rec>
>>
bye
+000: Ciao ragazzo. ++done++

There are some interesting points to be made about this little conversation:

The "delim" list storage adaptor takes the first field by default and makes it a key.
Keys in repmgr can't contain uppercase letters, punctuation, or spaces, so the keys are munged to make them work.
The "list" command without a list ID calls repos_list on the pseudolist "_lists" to get a list of the lists defined in the system.
The "list" command otherwise calls repos_list on the named list and returns a list of keys. The repos_list API call, though, actually returns a list of (sometimes simplfied) records, each of which is marked with a key. Sometimes this is useful, and we'll see why, later.
The "get" command calls repos_get on the list ID and a key, and retrieves an XML structure which encodes the object. In this case, this is built from the corresponding line in mylist.txt, but sometimes there may be a lot more involved.
The "++done++" after every response is for handshaking when a repository defines a list which is actually stored by a remote repository. We'll get to that later, too.

However, the most crucial point here is that retrieval of data is very simple in the wftk. You give a key, and you get an object back which is an XML structure. The delim: adaptor has a pretty strictly delineated object structure; other adaptors are free to return whatever XML may be appropriate, but since delim: is building the XML for you from the line of text it sees, there's not a lot of flexibility. But for all adaptors, "get" is always the same: give a key, get an object.

delim: always delivers XML that is a <rec> element, containing a list of <field> elements precisely as defined in the list definition. It then tacks another field on the end with the line number; this can be used for any purpose.

The wftk allows you not only to read this data, but also to modify it and add new entries. That will be covered in the data storage portion of the User's Manual.

More complicated object structures (LIST_localdir storage)

[ prev ] [ top ] [ next ]

Although delim: is a very easy and flexible way to slap some data into a file and get the wftk to use it, it has some limitations. So it isn't the default storage adaptor. The default is the localdir: adaptor (local directory), which stores each object as a named XML file in a subdirectory of the repository's local directory. The localdir: adaptor can also store attachments as separate files in that same directory, which is used for versioned document management and for storage of process definitions for workflow.

The localdir: adaptor's XML storage is completely arbitrary. It can be anything at all. Let's look at an example.

Storing data in relational databases

[ prev ] [ top ] [ next ]

All fine and well, but let's face it, most data in the world is in relational databases, and for a very good reason: they work really well. The wftk has adaptors for a number of different databases (ODBC under Windows, Oracle, and MySQL) and writing new database code is quite easy, so if you have a favorite database in your system environment, you can get the wftk to talk to it without a lot of hassle.

At any rate, we'll use MySQL to illustrate database connectivity, with the assurance that any other database will work in the exact same way.

The wftk doesn't have the ability to create tables in your database, so essentially what the list definition is doing for a list stored in a database is to define the storage, not establish it. This example will use a MySQL table created using the following code. (If you use a different database, of course, you'll modify this correspondingly.)

create table mylist (
  field1 varchar(50),
  field2 varchar(50)
);

Given that table, we can now define a list in the repository to access it, as follows: