Proof of concept: using Marpa with an external tokenizer

Since learning of Marpa (I don't know when I first ran across it - the earliest notes I can find are from mid-2012 and they all mention Marpa as something I really wanted to learn someday), there have been two basic realms of problems I want to use it for: parsing computer language, specifically my Decl declarative language - and parsing natural language.

This article is about natural language.

If you haven't dealt much with natural language, there are a couple of areas where it differs strongly from computer languages. First, the grammars of natural languages are far more complex: Pro3gres, a state-of-the-art research parser for English, with a hand-written grammar, has about 1200 rules - and its design goals do *not* include full coverage of the English language. (See the technical report here if you're interested in Pro3gres - it's pretty nifty, and the report is well worth reading for its good overview of current NLP parsing research alone.) In contrast, the C parser written in Marpa by Jean-Damien Durand, MarpaX::Languages::C::AST, consists of just 102 productions by my count.

In fact, many researchers consider hand-writing a grammar for a natural language to be essentially an impossible task, so that most research today has a focus on statistical methods. I'm hoping (and currently believe) that Marpa will be a game-changer in this respect. Watch this space for further details, I guess. And wish me luck.

The second difference between computer languages and natural languages is that the tokens in a formal computer language are necessarily quite simple and well-marked. The tokenizer is a set of regular expressions and the token type is easily predicted - and never ambiguous.

In marked contrast, natural languages *love* ambiguity. The part of speech (the token type) of a given word can be quite impossible to know out of context. Marpa's own timeflies.t test case is an illustration of this ambiguity - the words "time", "flies", and even "like" can all take on different parts of speech and entirely different meanings, so that the sentence "Time flies like an arrow" can itself have multiple meanings. And that's not all - the part of speech of a given word can't be determined from its appearance, unlike, say, an identifer in C. Instead, lexicalization of natural language words is a database-intensive task. You simply have to know detailed information about tens of thousands of words.

And that's just in English, where words are mostly separated by spaces when written down. Chinese and Japanese don't use spaces at all, and many languages use complicated systems of prefixes and suffixes to build words on the fly. English does this too, a little. We can put "un-" on the beginning of just about anything and still get some meaning from it. Novel words like "unmeaning" are immediately comprehensible, even though they don't appear in any dictionary. In other languages, like Hungarian, "the table" is "az asztal", while the phrase "for the table", which uses an introductory word in English, *changes* the Hungarian word so that it's "az asztalnak". (It gets worse: "that table" is "az az asztal", and "for that table" is "annak az astzalnak". And don't even get me started on Turkish, in which words of fifty syllables can easily be constructed - although this is essentially only done when you want to make people laugh.)

Tokenization is hard: outsource it

So in short, finding lexemes in natural languages - tokenizing them - is a hard problem. And as a result, a practical approach to natural language starts with a tokenizer. I've been working hard on just such a tokenizer for a few months now; it's not on CPAN yet, but it will appear under Lingua::Tok, with the lexical database work being done by one or more lexicon classes subclassing Lingua::Lex (including a German lexicon in Lingua::Lex::DE).

Naturally, I want to feed tokens from my dictionary-based tokenizer into Marpa, but all the existing examples I could find treat lexicalization as something very simple. In fact, "vanilla Marpa" simply tokenizes in the L0 layer of the grammar, which consists of regular expressions on plain old strings. This works great for formal languages like programming languages, but not so much for full natural language.

Fortunately, there is a perfectly serviceable API for feeding tokens to Marpa directly, and so the rest of this article introduces my proof-of-concept test script for doing that. The script is presented in its entirety here - there are a couple of sections that deserve a little extra discussion, though.

Definition of the Marpa grammar (lines 11-37)

Any Marpa grammar breaks down into a general preamble, then top-level G1 rules followed by lower-level L0 rules. Here, I've explicitly reflected that by breaking these up into separate strings, which are concatenated into the full grammar when we build the parser. Here's the grammar I'm working with:

11 my $preamble = <<'EOF';
12 :default ::= action => [name,values]
13 lexeme default = latm => 1
14 :start ::= text
15 EOF
16 
17 my @tokens = ('n', 'aa', 'prep', 'unk');
18 my $tlist = join (' | ', @tokens);
19 
20 my $rules = <<"EOF";
21 text ::= heading | salad
22 heading ::= np
23 np ::= n | adjp n | np pp
24 pp ::= prep np
25 adjp ::= aa+
26 salad ::= lex+
27 lex ::= np | $tlist
28 EOF
29 
30 
31 my %types;
32 my $tokens = '';
33 foreach my $token (@tokens) {
34     $types{$token} = '_'.$token unless $token eq 'unk';
35     $tokens .= "$token ::= _$token\n";
36     $tokens .= "_$token ~ 'a'\n";
37 }

The grammar itself is pretty stupid - my original stab was just a "word salad", which parsed each series of words into ... a series of words, without further analysis. The idea here is that when I have a sentence that can't be parsed, I still want Marpa to tell me the parts it can parse, so a word salad can either be a bunch of undifferentiated words, or it might have noun phrases in it, and so on.

Turns out that doesn't work well, because Marpa just sees a whole lot of ambiguity and I get a lot of different possible "parses" back that consist of calling larger and larger parts of the sentence word salad. More on that approach later.

Anyway, each token has a type and a word. Marpa doesn't actually care about the word, so I just pass the word's offset in the token stream, along with a type that is either an actual guess at a part of speech, or 'unk' for unknown.

In line 26, we define a word salad as a series of lexemes, and in line 27, we define a lexeme as either a noun phrase or any one of the parts of speech we've identified as coming from the tokenizer.

In lines 33 through 37, we loop through our token types and define a G1 rule consisting of the token type and an underscored version that will be an L0 lexer rule, then we build the lexer rule as a dummy. If that rule isn't there, Marpa won't accept lexemes of that type, so there you go.

Set up Lingua::Tok with appropriate lexica (lines 41-49)

The setup of the tokenizer itself is relatively unimportant for this test; essentially I'm recycling a common configuration I use elsewhere. The extra words lexicon is a local lexicon for any words the main lexicon doesn't catch, and the main lexicon is the standard German lexicon Lingua::Lex::DE. The test file 'test.ttx' is an actual translation file from a recent job.

Iterate through tokens, parsing as we go (lines 54-79)

The loop for token processing deserves a little comment, as it's mildly opaque:

54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78     } 79 }

while (1) { my $token = $tok->token(); last if not defined $token; if (ref $token eq 'ARRAY') { next if $token->[0] eq 'S'; if ($token->[0] eq 'FSB') { $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } ); my $input = ' ' x 100; $recce->read(\$input, 0, 0); @cursent = (); } elsif ($token->[0] eq 'FSE') { my $v = $recce->value(); out ($$v, @cursent); #while ($v = $recce->value()) { # print " --- OR ---\n"; # out ($$v, @cursent); #} } else { my $type = $types{$token->[0]} || '_unk'; $recce->lexeme_read ($type, 1, 1, scalar @cursent); push @cursent, $token; } } else { die "Hit a non-array token $token";

To process tokens, we just grab one at a time with $tok->token(). Each token is an arrayref (although I'm a little paranoid that I might have missed something). The document model returns 'FSB' and 'FSE' tokens to mark the start and end of segments; each segment is nominally a sentence, so this is a good unit to pass into the parser. 'S' tokens are whitespace; here, I'm ignoring whitespace (in line 58).

So on an 'FSB' token, I build a scanner for the sentence to come (line 60). Line 61 might be obsolete, but Marpa doesn't like it when you tell it the start of a token is beyond the end of the input string, so I have a long dummy string. Line 62 sets up the scanner with a dummy read and initializes the current sentence, which is just a list of the tokens we've pushed into the scanner.

On an 'FSE' token, we extract the topmost value from the scanner and send it to our output function.

Finally, on any other kind of token, we look up the token in our table of tokens the grammar will accept, and assign '_unk' for anything unexpected. We use lexeme_read to push the token into the scanner, probably incorrectly given that all the tokens are marked as being at position 1 with length 1 (I think). The "text" of the word is just an index into the current list of tokens; then we push the token onto the list in line 74.

Routine to output a parsed sentence (lines 81-96)

The output function is relatively uninteresting, although I should note that unless you provide some indication of nested phrases, a tabular output of tokens and parse structure together is nearly impossible to comprehend.

Sample output

The resulting output file ends up looking pretty much like this:

aa	0	aa	Beste		
adjp	--> Beste
n	1	n	Partner		
np	--> Beste Partner
prep	2	prep	f&uacut;r		
aa	3	aa	beste		
adjp	--> beste
n	4	n	L&eacut;sungen	L&eacut;sen+ung+en	n+f+p
np	--> beste L&eacut;sungen
pp	--> f&uacut;r beste L&eacut;sungen
np	--> Beste Partner f&uacut;r beste L&eacut;sungen
salad	--> Beste Partner f&uacut;r beste L&eacut;sungen
n	0	n	&Uacut;bersicht		
n	1	n	Produkt		
np	--> Produkt
unk	2	P	-		
n	3	n	Portfolio		
np	--> Portfolio
salad	--> &Uacut;bersicht Produkt - Portfolio

Double-parse technique

As I said earlier, my original thought was to let Marpa return a heading (as a noun phrase) or a sentence, if that's what it managed to parse correctly, but then a "word salad" if it couldn't find a parse. The word-salad approach, however, potentially sees everything as a word salad, so any given phrase is ambiguous - it could be a noun phrase, or it could be word salad with a smaller noun phrase in it. This is useless if I want to ask Marpa about multiple valid parses for a given sentence, so what I should probably do instead is to provide a grammar that doesn't contain word salads first - then, if Marpa can't parse the sentence, I take a word salad approach so that Marpa can at least chunk the identifiable parts of the sentence for me.

Which brings me to the notion of chunking. Marpa in salad mode seems to me to be an ideal chunker - no machine learning involved. It simply finds all the (say) noun phrases it can, given the lexical knowledge we already have. This will definitely be one thing I'll be investigating as I spend more time on NLP with Marpa.

And note that this technique of double parsing rests on the notion that Marpa parsers are quick to build. If we generalize that idea, we come to the possibility of building all kinds of parsers on the fly for very fine-tuned analysis of particular kinds of text. This is another promising thread of investigation I hope to follow up later.

A final note on chunking, yet another experimental series I hope to address: if I can safely identify all the words in a sentence but one, and if most of the sentence can therefore be chunked, it should be possible to make a reasonable guess at the part of speech of the remaining word by trying parts of speech until the sentence parses. That should be an excellent way to work through a corpus to discover parts of speech, so again - a promising thread I'll be looking at in the months to come.

Handling ambiguity in parts of speech

Marpa is notable for its ability to backtrack over lexicalization, so that if a word is initially identified as a given part of speech, the tokenizer can "change its mind" and try a different tokenization. There are a couple of different ways to do that; the most obvious is simply to use the lexeme_alternative and lexeme_complete calls to provide Marpa with a set of alternative readings of a given token.

But Ruslan Shvedov proposes a fascinating alternative in email: simply define a new grammar for each sentence, feeding it specific lexemes appropriate to just that sentence. It's an elegant solution, and I might try it out at some point.

Going forward

So in short, it turns out to be quite possible to use Marpa with an external tokenizer, and I have to say that this little script has really given me a taste of how sweet it's going to be to have Marpa in my toolkit. There are a boatload of things I want to try it out on, and hopefully I'll be writing many more articles like this one.