lpml alpha version

lpml, alpha version Perl script


This is the first run at the lpml (Literate Programming Markup Language). It is a simple Perl script, doesn't do much fancy yet -- but it works. My eventual plans for it are extensive, and I don't expect any of the code now in existence to survive long.

I'd be overjoyed if someone else found this system to be useful. If you do, please tell me about it and of course feel free to update me with any changes you might have made to the code.

Table of contents: [##itemlist##] The completed code (a Perl script) is available here as a plain text file suitable for download. To install it, just replace the first line (which is the path to my Perl interpreter) with the path to your own Perl interpreter. I haven't written any user documentation except a list of the elements I use. The overall setup of a presentation isn't described anywhere yet.

You can, of course, study the XML-ish code I used to generate this code and documentation. This is pretty thin user documentation but soon I hope to augment it.


This code and documentation are released under the terms of the GNU license. They are additionally copyright (c) 2000, Vivtek. All rights reserved except those explicitly granted under the terms of the GNU license.
lpml alpha: [##label##]

[##label##]

[Previous: [##prevlabel##]] [Top: [##indexlabel##]] [Next: [##nextlabel##]]

[##body##]
[Previous: [##prevlabel##]] [Top: lpml alpha] [Next: [##nextlabel##]]


This code and documentation are released under the terms of the GNU license. They are additionally copyright (c) 2000, Vivtek. All rights reserved except those explicitly granted under the terms of the GNU license.
This is the overall layout of the file, and indeed of most single-file Perl scripts. As such, it will serve as a useful pattern someday when this project gets to the point where patterns are explicitly supported.

I don't know how other people write Perl scripts, but I start out by processing the input, so I can quit early if there's already a problem. Then globals and subroutine declaration (in most of my larger pieces, of course, I eschew globals and I put subroutines into separate files.) Then the meat of the code, which executes the following basic algorithm:

  • Scan
    The initial scan reads the file and creates a list of items keyed by name. Each item's information includes its label (for printing into the documentation) and its initial text pieces (for tangling.) Note that some pieces concatenate onto other items than the one they're in; in this case, the label may not even be known before the first text piece is added, but that won't stop us, we're Perl.
  • Tangle
    Once the scan is complete, we have all the information we need to generate the text targets defined by this file. (That's tangling.) So we get that task out of the way. The other reason that tangle goes first is that in a later version than this, the tangle will accumulate index data which will be used during the weave phase (where documentation is generated.)
  • Prepare indices
    Using information gained during the initial scan and possibly augmented during tangle, we can prepare various indices which can then be used in the generation of the documentation. At the moment, these indices are simply a hierarchical list of items and a list of objects generated; no information is created during tangle. Knuth's original WEB, of course, generated a comprehensive list of crossreferences and a list of identifiers during tangle.
  • Weave
    The weave step generates the documentation, one page per item, by name. If an item is called "index" (thus generating "index.html") then it will be omitted from the list of items, and its label will be available for use in subsequent pages. The <insert> tag is replaced by the label of its item, suitably formatted. Text pieces are formatted as code blocks.
#!/usr/local/bin/perl # This is the lpml alpha Perl script. # Copyright (c) 2000 Vivtek. All rights reserved except those explicitly granted # under the GNU Public License. # http://www.vivtek.com/lpml/lpml_alpha/index.html for more information.
The intial scan of the input file is pretty straightforward. It simply reads everything and builds a list of items, keyed by name and containing their labels and their text. The only weird case you have to watch out for is when a piece concatenates to an item that hasn't been encountered yet. In that case, the piece is stashed anyway, then when the item is defined, if it has a text piece in it then that piece will be inserted before any text already collected.

Due to the crude nature of my current weave, I have all this in one big blob of text. This is because I can't bring myself to break it onto separate pages. And of course the other reason for this is that I'm still not literate-adapted; I have always tended to write code in a rather monolithic fashion (breaking code into subroutines to increase readability has always really irritated me. I guess that's why I'm working on a literate programming system instead.)

One assumption I'm making here: the input file is open on INPUT. It will have to be rewound before doing the weave. Tangle won't require a further pass, because this scan step will gather everything we need for tangling. First, let's set up some globals we'll be using. @items = (); @objects = (); @formats = (); $name = ''; $piecename = ''; $parentname = ''; $formatname = ''; The way I'm doing this is that I'm effectively using the name of the current item, and the name of the item receiving the current piece, as state variables. When we leave the element in each case, I reset the value to blank, so that the scanner can tell we're not in an item or piece respectively.

So at the end of the loop, you see the two little chunks that actually read all the interesting things in this pass: each line is attached to a piece if the piecename is set, or to a format if the formatname is set. Obviously, the two shouldn't be set at the same time, but there's nothing in the code which formally precludes that. while ([[INPUT>) { if ($piecename ne '') { $pieces{$piecename} .= $_; } if ($formatname ne '') { $format{$formatname} .= $_; } } Tag handling is pretty easy: each time we have a line, we check it for a match with one of the tags that we know how to handle. This means that all LPML tags must be on lines by themselves, but so far that requirement hasn't been too onerous. When I build in a real XML tokenizer, this whole section will look a lot different.

The tags we're looking for are <object>, <item>, and <piece>. Oh, and I've just added <format>.

One of the things the tag handlers do is to set and unset various state markers. For instance, I terminate the <item> tag by setting the $name global to blank. I also set the $piecename global to blank in case the user forgot to terminate the current piece. I know that violates the principles of XML tokenization, but again, later we'll get into real XML tokenization and I don't want to mess with it yet. if (/([[object .*>)/i) { } if (/([[item .*>)/i) { next; } if (/([[\/item\s*>)/i) { if ($name !~ /\./) { $parentname = $name; } $name = ''; $piecename = ''; next; } if (/([[piece.*>)/i) { next if $name eq ''; # Pieces are silent outside of items. next; } if (/([[\/piece\s*>)/i) { $piecename = ''; next; } if (/([[format.*>)/i) { next; } if (/([[\/format\s*>)/i) { $formatname = ''; next; } The object scanner is a little simpler than the item and piece scanners, so I'll explain it first. As each line is scanned, it's checked for being an <object> tag. Note that this is assuming that the tag will be the only thing on the line. I don't want to get into real tokenizing of the XML input, because that will be the province of the QDMT, which is my next four-letter vowelless acronym. The next version of lpml will use the QDMT to tokenize its input.

At any rate, if the object tag is encountered, I read its attributes into the $thistag hash. The other handlers reuse this hash. $tag = $1; $tag =~ s/^[[object\s+//i; $attr = ""; %thistag = (name => '', language => '', item => ''); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thistag{$attr} = $piece; $attr = ''; } } Now that we've loaded the current object's attributes, let's check the attributes for consistency. This is kind of a validation of the kind-of DTD. if ($thistag{name} eq '') { print STDERR "$. : Nameless object encountered.\n"; next; } if ($thistag{item} eq '') { print STDERR "$. : Object '$thistag{item}' has no starting item.\n"; next; } Validation complete, we'll build up a list of objects, and stash the start item of this particular object. The language isn't being used yet, so we won't stash it. @objects = (@objects, $thistag{name}); $starter{$thistag{name}} = $thistag{item}; Handling items works pretty much the same way, except that there's more to keep track of. $tag = $1; $tag =~ s/^[[item\s+//i; $attr = ""; %thistag = (name => '', label => '', pattern => '', language => '', format => 'default'); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thistag{$attr} = $piece; $attr = ''; } } if ($thistag{name} eq '') { print STDERR "$. : Nameless item encountered.\n"; next; } $name = $thistag{name}; $lastchild{$name} = $name; $children{$name} = 0; if ($name !~ /\./) { $parentname = ''; $parent{$name} = ''; } else { $parentname = $name; $parentname =~ s/\..*?$//; $parent{$name} = $parentname; $lastchild{$parentname} = $name; $children{$parentname} += 1; } push @items, $name; if (defined $label{$name}) { print STDERR "$. : Duplicate item name '$name'.\n"; } if ($thistag{label} eq '') { $thistag{label} = $name; } $label{$name} = $thistag{label}; if ($parentname eq '') { $url{$name} = "$name.html"; } else { $n = $name; $n =~ s/^.*?\.//; $url{$name} = $url{$parentname} . '#' . $n; } And finally, the <piece> tag, which is pretty analogous to <item>. $tag = $1; $tag =~ s/^[[piece\s*//i; $attr = ""; %thistag = (add-to => '', language => ''); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thistag{$attr} = $piece; $attr = ''; } } $piecename = $name; $piecename = $thistag{'add-to'} if $thistag{'add-to'} ne ''; The <format> tag simply takes its content and stashes it into a hash, just like pieces. The only attribute we care about in a format tag is its name. $tag = $1; $tag =~ s/^[[format\s*//i; $attr = ""; %thistag = (name => ''); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thistag{$attr} = $piece; $attr = ''; } } if ($thistag{name} eq '') { print STDERR "$. : Nameless format encountered.\n"; next; } $formatname = $thistag{name}; push @formats, $formatname; The tangle step is really pretty easy. What we do in this step is use the $pieces hash we populated during the scan step to generate code. We insert items in other items by using the <insert> tag.

Since our coding language is XML, we can't use the character '<' explicitly in our code. Irritating, but that's one of the tiny little disadvantages of using XML, and it's offset by so many advantages that living with it isn't so hard. All we do is use '[[' instead (hard-coded in this version, soon to be specifiable.) So the tangle step also takes care of substituting our '<' back in.

Tangle is inherently recursive. I'm dumbly following the whole recursive structure in a depth-first manner; the smarter thing would be to cache items as they're tangled in case of reuse (then I wouldn't have to retangle that item). Since I'm working with small files here, that probably isn't worth the effort, so I'm not going into it.

Note also that since Perl can define subroutines right in the middle of the rest of the code, I found it more natural simply to include the recursive routine here right in the middle of things, instead of packing it off into some subroutine block. foreach $obj (@objects) { open OUT, ">$obj"; tangle_one ($starter{$obj}); close OUT; } The tangle procedure is recursive, and writes out the tangled output to the object file. Among other things, it needs to convert special characters to their literal equivalents; special characters are those, like < and &, which XML doesn't allow anywhere but in meaningful positions. These replacements occur right at the bottom. sub tangle_one { my $item = shift; foreach $_ (split /\n/, $pieces{$item}) { if (/([[insert .*>)/i) { $tag = $1; $tag =~ s/^[[insert\s+//i; $attr = ""; %thisobject = (name => ''); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thisobject{$attr} = $piece; $attr = ''; } } if ($thisobject{name} =~ /^\./) { $thisobject{name} = $item . $thisobject{name}; } if ($thisobject{name} eq '') { print STDERR "$. : Item to be inserted not named.\n"; next; } tangle_one ($thisobject{name}); next; } s/\[\[/[[/g; s/#\^7/#^7/g; s/#\^lt#/[[/g; print OUT "$_\n"; } } The weave step is the other main function of traditional literate programming: it takes the items as defined and creates the documentation file. Since the documentation in this case is a set of pages, the philosophy is a little different. For each main item, we write a separate documentation page (subitems are left on the same page as their respective parents, with headings.)

The formats used to generate pages are, of course, those read in on the initial scan into the format hash. If no format is specified for an item, the format 'default' is used; if a format is named but not defined, then the body of the item is simply printed (i.e. an undefined format is effectively equal to '[##body##]'). This sounds useless but it comes in handy for embedding fixed pages (such as index.html) directly into the XML document.

The file <project>_format.html is used to generate each of these pages; a more elaborate formatting control mechanism would be nice later, but this will do the trick for now. Code pieces are written enclosed in <code> tags, and <insert> tags are replaced with the label, not the name, of the respective items.

The overall code for weave looks a lot like the scan code, because it's doing very similar things. It scans lines of input, using global variables to mark its current state.

Setup is straightforward: we rewind the INPUT filehandle, load the template into a list (because we'll be scanning it once for each item page) and set some globals to blank values. seek INPUT, 0, 0; $name = ''; $item = 0; $piece = 0; while ([[INPUT>) { So let's get down to business scanning. The thing weave is interested in is items, so let's look for <item> tags first. For each item we encounter, we'll set up replacement tags as follows:

  • [##next##] and [##prev##]
    These contain the URLs of the next and previous items in the list, respectively. The first item has a [##prev##] pointing to the index, and the last has a [##next##] pointing to the index.
  • [##nextlabel##] and [##prevlabel##]
    The labels of the next and previous items, for nice link formatting.
  • [##body##]
    The body of the item is its documentation and properly formatted code. In other words, most of the work of weave is to build up the [##body##] tag.
  • [##name##], [##url##] and [##label##]
    The name, URL and label of the current item.
Eventually we'll want change dates as well, but we're not tracking it yet, so there's not much point. All in good time. if (/([[item .*>)/i) { $tag = $1; $tag =~ s/^[[item\s+//i; $attr = ""; %thisitem = (name => '', label => '', pattern => '', language => '', format => 'default'); foreach $part (split /"/, $tag) { if ($attr eq '') { $attr = $part; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thisitem{$attr} = $part; $attr = ''; } } $name = $thisitem{name}; $formatname = $thisitem{format}; if ($parent{$name} eq '') { $tags{name} = $name; $tags{url} = $url{$name}; $tags{label} = $label{$name}; if ($item == 0) { $i = $items[$#items]; } else { $i = $items[$item-1]; } if ($parent{$i} ne '') { $i = $parent{$i}; } $tags{prev} = $url{$i}; $tags{prevlabel} = $label{$i}; $tags{body} = ''; } else { $n = $name; $n =~ s/^.*?\.//; $tags{body} .= "[[br>[[br>\n"; $tags{body} .= "[[a name=\"$n\">\n"; $tags{body} .= "[[i>$label{$name}[[/i>[[br>\n"; } if ($item == $#items) { $tags{next} = $url{$items[0]}; $tags{nextlabel} = $label{$items[0]}; } else { $tags{next} = $url{$items[$item+1]}; $tags{nextlabel} = $label{$items[$item+1]}; } next; } next if $name eq ''; And of course we terminate the item similarly to the scanner. The difference is that we also write out the current item's page using the various tags we've accumulated during scanning. Note that one of the things we're doing while writing the file is to convert HTML-like XML tags into bona fide HTML; eventually this sort of thing should be handled in some formatting module, but in the meantime I'm intent on HTML output.

Things are only written when either the current item has no children (and no parent), or the current item is the last child of its parent. If the current item is a child, then its body has been written onto the pre-existing body of its parent, and other tags (like next and prev) have already been set during processing of the parent. And the format to be used is also already set by the parent. (Corollary: if you set a format in a child item, nothing will happen.) if (/([[\/item\s*>)/i) { if (($parent{$name} eq '' #^7#^7 $children{$name} == 0) || $lastchild{$parent{$name}} eq $name) { if ($parent{$name} eq '') { open OUT, ">$url{$name}"; } else { open OUT, ">$url{$parent{$name}}"; } foreach $line (split /\n/, $format{$formatname}) { $_ = $line . "\n"; while (/\[##(.*?)##\]/) { $tag=$1; s/\[##$tag##\]/$tags{$tag}/e; } s([[/li>)()g; s([[p/>)([[p>[[/p>)g; s([[br/>)([[br>)g; s([[hr(.*?)/>)([[hr$1>)g; s([[nbsp/>)(#^7nbsp;)g; s([[li>[[b>)([[b>[[li>)g; print OUT; } close OUT; } $name = ''; $item++; next; } While in an <item>, we scan for pieces, just like scan. Text inside pieces is formatted differently (with spacing and linebreaks intact.) if (/([[piece.*>)/i) { $tag = $1; $tag =~ s/^[[piece\s*//i; $attr = ""; %thispiece = (add-to => '', language => ''); foreach $part (split /"/, $tag) { if ($attr eq '') { $attr = $part; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thispiece{$attr} = $part; $attr = ''; } } $tags{body} .= "[[table width=100%>\n"; $tags{body} .= "[[tr>[[td width=30 bgcolor=eeeeee>#^7nbsp;[[/td>[[td width=100%>\n"; if ($thispiece{'add-to'} ne '') { $tags{body} .= "[[i>Add the following to \"$label{$thispiece{'add-to'}}\"[[/i>[[br>\n"; } $tags{body} .= "[[pre>"; $piece = 1; next; } if (/[[\/piece\s*>/i) { $tags{body} .= "[[/pre>"; $tags{body} .= "[[/td>[[/tr>[[/table>\n"; $piece = 0; next; } Almost done. The only remaining thing is to format <insert> tags. if (/([[insert\s.*>)/i) { $tag = $1; $before = $`; $after = $'; $tag =~ s/^[[insert\s+//i; $attr = ""; %thisinsert = (name => ''); foreach $part (split /"/, $tag) { if ($attr eq '') { $attr = $part; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thisinsert{$attr} = $part; $attr = ''; } } if ($thisinsert{name} =~ /^\./) { $thisinsert{name} = $name . $thisinsert{name}; } $tags{body} .= $before . "[[i>See [[a href=\"$url{$thisinsert{name}}\">"; $tags{body} .= "$label{$thisinsert{name}}[[/a>[[/i>$after"; next; } And finally, any line which doesn't contain an <insert> and which is actually within an <item> simply gets tacked onto the current item's body after we make sure all the characters are going to display correctly. Note especially the attention to detail with the [## delimiter -- if we didn't replace that with a version containing an HTML entity instead (&#35;) then we'd have a problem with recursion once we get into the template application code. Which, of course, I realized only after having a problem with recursion once I got into the template application code. s/#^7/#^7amp;/g if $piece; s/\[\[/#^7lt;/g if $piece; s/\[\#\#/[#^7#35;#/g; s/#\^7/#^7amp;/g if $piece; s/#\^lt#/#^7lt;/g if $piece; $tags{body} .= $_ if $name ne ''; } In the first version of this app, I used a file named [[project>_index.html as the template for the index page of the application. This page had the following tags available:

  • itemlist
  • objectlist
However, once I had the <format> tag working, it turned out to be much cleaner simply to make those tags available for all pages, then allow an item named "index" to write out the index page. So instead of writing the index page in this block, I'm just going to build the tags and trust that the document will use them. And frankly, if not, no skin off my nose. In that case, the document will need some kind of explicitly generated navigation.

So let's build our item and object indices. Each of these is hardwired to be a bullet point list at the moment; that would be a nice thing to open up to configuration but that sort of enhancement will come later.

Since the index page is now a plain old item, I'm going to omit it from the table of contents. It's kind of jarring to see it there. On the other hand, since the index page is now a plain old item, I can use its label in further link formatting!

The use of the tags hash has to do with the way I do template printing.

The item list has a variable $level which keeps track of the table of contents level. Subitems are indented under their parents. When a non-subitem is encountered and the $level is 1, then we have to terminate the sublevel. $level = 0; $tags{itemlist} = "[[ul>\n"; foreach $item (@items) { if ($item eq 'index') { $tags{indexlabel} = $label{$item}; } else { if (!$level #^7#^7 $parent{$item} ne '') { $level = 1; $tags{itemlist} .= "[[ul>\n"; } if ($level #^7#^7 $parent{$item} eq '') { $level = 0; $tags{itemlist} .= "[[/ul>\n"; } $tags{itemlist} .= "[[li> [[a href=\"$url{$item}\">$label{$item}[[/a>\n"; } } $tags{itemlist} .= "[[/ul>\n"; $tags{objectlist} = "[[ul>\n"; foreach $obj (@objects) { $tags{objectlist} .= "[[li> [[code>$obj[[/code>:"; $tags{objectlist} .= "[[a href=\"$url{$starter{$obj}}\">$label{$starter{$obj}}[[/a>\n"; } $tags{objectlist} .= "[[/ul>\n"; Finally, now that we have the good stuff out of the way, we're ready to talk about how we open our input file. The scan and weave steps expect input in the INPUT file handle. if (@ARGV != 1) { print STDERR "Usage: $0 [[filename>\n"; exit 1; } open INPUT, $ARGV[0] or die "Can't open file '$ARGV[0]'\n"; $format_html = $ARGV[0]; $format_html =~ s/\.xml$/_format.html/i; $index_html = $ARGV[0]; $index_html =~ s/\.xml$/_index.html/i; And after all is said and done, we'll want to clean all this up, too. close INPUT; This is of course a work in progress, even this alpha version. I'm going to cut off this particular version whenever I complete qdmt but until then I've got a few things I want to build into even this primitive little app.

  • Initial application functional 3/21/2000
    The initial version could (just barely) tangle and weave. It could process itself and it produced readable documentation. Qualifies as literate programming in my book, anyway.
  • Second pass 3/26/2000
    Did a few things. Changed the highlighting of code segments in woven documentation by sticking that little grey bar next to them. Eventually that sort of thing should be done with some sort of map (or better yet, lpml should work with DocBook SGML.) Adding subitem functionality.
  • Support for <format> tag 5/7/2000
    Not only did I do better formatting (and remove that awkward dependence on external files) I also made it possible for an XML-correct document to generate HTML during weave, and parsed this entire document with expat to ensure that it's well-formed XML, which it is. It was like pulling teeth! I've fallen into a number of HTML coding habits that really make XML complain! It is now also possible to generate a single-page documentation, since there is no explicitly "different" index page.
Here's a list of things I want to introduce before I do better XML tokenization:
  • Hide non-display items (like the cleanup item from this very project.)
  • Variant support (to define objects which share some items but differ in others.)
And then there's the further future:
  • Version control and documentation.
  • Real XML tokenizer (QDMT or expat).
  • DocBook XML instead of HTML output.
  • Management beyond tangle (i.e. something on the make level.)
  • Active items (generated from data structures aside from text, like database queries.)
  • More sophisticated integration with my other content management toolset.
Plenty to do. Maybe I'll get some of it done sooner or later. Watch this space for further details.