Perl and NLP

2014-05-20 nlp

So I've long had a thing about doing NLP-type stuff in Perl. I know, I know. All the cool kids use NLTK in Python. So why Perl?

As always, the answer is CPAN. I can get a good, quick start in nearly anything by installing a CPAN module, and I know it has been tested on Windows already thanks to CPANtesters. And anything I write will be tested six ways from Sunday, too.

So Perl.

A few years ago I hacked out the beginnings of a tokenizer for NLP usage. It really just consisted of a convenient iterator wrapper around some very simple regexes, along with some n-gram type stuff for collocations (not that I've ever had much luck with those - yet). I've revived it and I've been tossing some actual translation jobs at it to see what sticks, and it's nearly ready for release.

I had the revelation, though, that what even NLTK is missing in terms of practical use is that it's a mess trying to retrieve information from documents. So my tokenizer explicitly works with a source document, which can deliver a series of text and formatting commands in a pre-tokenization step. The formatting commands are passed right through by the tokenizer.

Along the way, I realized that to do part-of-speech tagging I was going to need a lexicon. I've got a dumb model of a lexicon running against SQLite (which will be good for job-specific vocabulary), but for the main lexicon in German, it just isn't possible to get around the morphological structure of the German language. So I'm currently adapting the igerman98 ispell dictionary. Its affix script is a pretty good run-down of German morphology, although it doesn't encode parts of speech very accurately. (Nouns are capitalized, of course, and adjectives/adverbs are pretty much "A"-flagged decliners.)

There's going to be a lot of tweaking involved, but the end result is going to be a pretty good data-based lexicon that can probably fall back on some educated guesses for parts of speech of unknown words.

Here's the kicker. If the part of speech is ambiguous at the word level, Marpa can simply figure it out from context (usually). I think I have a good plan for this, but until I have a reasonable coverage of parts of speech in my lexicon, I won't have anything to experiment with yet.

Soon, though, I'm going to be able to make some specific contributions to making NLP in Perl a reality. I've been talking about doing this for a long, long time indeed. It's exciting to be actually making progress with it for a change.