Initial scan
The intial scan of the input file is pretty straightforward. It simply reads everything and builds a list of items, keyed by name and containing their labels and their text. The only weird case you have to watch out for is when a piece concatenates to an item that hasn't been encountered yet. In that case, the piece is stashed anyway, then when the item is defined, if it has a text piece in it then that piece will be inserted before any text already collected. Due to the crude nature of my current weave, I have all this in one big blob of text. This is because I can't bring myself to break it onto separate pages. And of course the other reason for this is that I'm still not literate-adapted; I have always tended to write code in a rather monolithic fashion (breaking code into subroutines to increase readability has always really irritated me. I guess that's why I'm working on a literate programming system instead.) One assumption I'm making here: the input file is open on INPUT. It will have to be rewound before doing the weave. Tangle won't require a further pass, because this scan step will gather everything we need for tangling. First, let's set up some globals we'll be using.
@items = (); @objects = (); @formats = (); $name = ''; $piecename = ''; $parentname = ''; $formatname = ''; |
while (<INPUT>) { See Looking for tags if ($piecename ne '') { $pieces{$piecename} .= $_; } if ($formatname ne '') { $format{$formatname} .= $_; } } |
Looking for tags
Tag handling is pretty easy: each time we have a line, we check it for a match with one of the tags that we know how to handle. This means that all LPML tags must be on lines by themselves, but so far that requirement hasn't been too onerous. When I build in a real XML tokenizer, this whole section will look a lot different. The tags we're looking for are
<object>
, <item>
, and
<piece>
. Oh, and I've just added <format>
.
One of the things the tag handlers do is to set and unset various state markers.
For instance, I terminate the <item>
tag by setting the $name
global to blank. I also set the $piecename
global to blank in case the user
forgot to terminate the current piece. I know that violates the principles of XML tokenization,
but again, later we'll get into real XML tokenization and I don't want to mess with it yet.
if (/(<object .*>)/i) { See Handle object tag } if (/(<item .*>)/i) { See Handling item tags next; } if (/(<\/item\s*>)/i) { if ($name !~ /\./) { $parentname = $name; } $name = ''; $piecename = ''; next; } if (/(<piece.*>)/i) { next if $name eq ''; # Pieces are silent outside of items. See Handle piece tag within item next; } if (/(<\/piece\s*>)/i) { $piecename = ''; next; } if (/(<format.*>)/i) { See Handle format tag next; } if (/(<\/format\s*>)/i) { $formatname = ''; next; } |
Handle object tag
The object scanner is a little simpler than the item and piece scanners, so I'll explain it first. As each line is scanned, it's checked for being an
<object>
tag.
Note that this is assuming that the tag will be the only thing on the line. I don't want to
get into real tokenizing of the XML input, because that will be the province of the QDMT, which
is my next four-letter vowelless acronym. The next version of lpml will use the QDMT to tokenize
its input.
At any rate, if the object tag is encountered, I read its attributes into the
$thistag
hash. The other handlers reuse this hash.
$tag = $1; $tag =~ s/^<object\s+//i; $attr = ""; %thistag = (name => '', language => '', item => ''); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thistag{$attr} = $piece; $attr = ''; } } |
if ($thistag{name} eq '') { print STDERR "$. : Nameless object encountered.\n"; next; } if ($thistag{item} eq '') { print STDERR "$. : Object '$thistag{item}' has no starting item.\n"; next; } |
@objects = (@objects, $thistag{name}); $starter{$thistag{name}} = $thistag{item}; |
Handling item tags
Handling items works pretty much the same way, except that there's more to keep track of.
$tag = $1; $tag =~ s/^<item\s+//i; $attr = ""; %thistag = (name => '', label => '', pattern => '', language => '', format => 'default'); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thistag{$attr} = $piece; $attr = ''; } } if ($thistag{name} eq '') { print STDERR "$. : Nameless item encountered.\n"; next; } $name = $thistag{name}; $lastchild{$name} = $name; $children{$name} = 0; if ($name !~ /\./) { $parentname = ''; $parent{$name} = ''; } else { $parentname = $name; $parentname =~ s/\..*?$//; $parent{$name} = $parentname; $lastchild{$parentname} = $name; $children{$parentname} += 1; } push @items, $name; if (defined $label{$name}) { print STDERR "$. : Duplicate item name '$name'.\n"; } if ($thistag{label} eq '') { $thistag{label} = $name; } $label{$name} = $thistag{label}; if ($parentname eq '') { $url{$name} = "$name.html"; } else { $n = $name; $n =~ s/^.*?\.//; $url{$name} = $url{$parentname} . '#' . $n; } |
Handle piece tag within item
And finally, the
<piece>
tag, which is pretty analogous to
<item>
.
$tag = $1; $tag =~ s/^<piece\s*//i; $attr = ""; %thistag = (add-to => '', language => ''); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thistag{$attr} = $piece; $attr = ''; } } $piecename = $name; $piecename = $thistag{'add-to'} if $thistag{'add-to'} ne ''; |
Handle format tag
The
<format>
tag simply takes its content and stashes it into a hash, just
like pieces. The only attribute we care about in a format tag is its name.
$tag = $1; $tag =~ s/^<format\s*//i; $attr = ""; %thistag = (name => 'default'); foreach $piece (split /"/, $tag) { if ($attr eq '') { $attr = $piece; $attr =~ s/^\s*//; $attr =~ s/\s*=\s*$//; } else { $thistag{$attr} = $piece; $attr = ''; } } $formatname = $thistag{name}; push @formats, $formatname; |
This code and documentation are released under the terms of the GNU license. They are copyright (c) 2000-2006, Vivtek. All rights reserved except those explicitly granted under the terms of the GNU license. |