Something I've wanted to do for a couple of years now is a sort of online chatbot framework thing. In other words, this would be a testbed for different language analysis techniques that could be played with online and tested against real people.
An extension would be to connect a given chatbot to some other chatbot out there somewhere, and see them talk to each other. That could be fun.
The basic framework for this kind of venture could be pretty simple, but could, of course, end up arbitrarily complex. You'd need some kind of principled semantic framework (which would start at a simple box with words in it and ramify through increasingly sophisticated syntactic and semantic analyses -- the idea is to have a framework which can support both simplistic single-word pattern matching to select a response, or use sentence frames to extract some subject patterns to be manipulated in the response, right up through a hypothetical Turing-complete NLP parser.)
The session would contain a list of facts and "stuff" which corresponds to the, I dunno, dialog memory of a conversation. There could optionally be some kind of database memory of earlier conversations with a given contact. Again, this would run the gamut from simple named strings to be substituted into a response pattern, to complete semantic structures of unknown nature, which would be used to generate more sophisticated conversation.
Then the third and final component would be the definition of a chatbot itself. This would consist of a set of responses to given situations (a situation being the current string incoming plus whatever semantic structure has been accumulated during the course of the conversation.) There could be a spontaneity "response", i.e. something new said after some period of time without an answer from the other. Again -- it should be possible to start small and stupid, with simple word patterns, random-response lists, and the like, and build upwards to more complicated semantics.
The ability to detect and switch languages would be of great use, of course, and there should be some kind of facility for that as well.
Wouldn't it be nice to be able to build a chatbot for language practice in, say, Klingon or Láadan? I mean, how else could you reasonably practice a constructed language?
Anyway, when I have time, I'll certainly be doing something with this idea. Any year now, yessir, any year.
So I've long had a thing about doing NLP-type stuff in Perl. I know, I know. All the cool kids use NLTK in Python. So why Perl?
As always, the answer is CPAN. I can get a good, quick start in nearly anything by installing a CPAN module, and I know it has been tested on Windows already thanks to CPANtesters. And anything I write will be tested six ways from Sunday, too.
A few years ago I hacked out the beginnings of a tokenizer for NLP usage. It really just consisted of a convenient iterator wrapper around some very simple regexes, along with some n-gram type stuff for collocations (not that I've ever had much luck with those - yet). I've revived it and I've been tossing some actual translation jobs at it to see what sticks, and it's nearly ready for release.
I had the revelation, though, that what even NLTK is missing in terms of practical use is that it's a mess trying to retrieve information from documents. So my tokenizer explicitly works with a source document, which can deliver a series of text and formatting commands in a pre-tokenization step. The formatting commands are passed right through by the tokenizer.
Along the way, I realized that to do part-of-speech tagging I was going to need a lexicon. I've got a dumb model of a lexicon running against SQLite (which will be good for job-specific vocabulary), but for the main lexicon in German, it just isn't possible to get around the morphological structure of the German language. So I'm currently adapting the igerman98 ispell dictionary. Its affix script is a pretty good run-down of German morphology, although it doesn't encode parts of speech very accurately. (Nouns are capitalized, of course, and adjectives/adverbs are pretty much "A"-flagged decliners.)
There's going to be a lot of tweaking involved, but the end result is going to be a pretty good data-based lexicon that can probably fall back on some educated guesses for parts of speech of unknown words.
Here's the kicker. If the part of speech is ambiguous at the word level, Marpa can simply figure it out from context (usually). I think I have a good plan for this, but until I have a reasonable coverage of parts of speech in my lexicon, I won't have anything to experiment with yet.
Soon, though, I'm going to be able to make some specific contributions to making NLP in Perl a reality.
I've been talking about doing this for a long, long time indeed. It's exciting to be actually making
progress with it for a change.
Oh, now this is cool. My lexicon has also been a convenient place to collect n-grams, but I've never found the raw n-grams to be all that helpful. Turns out (as is so often the case) I've been doing it wrong. There are a boatload of statistical measures of n-grams that try to capture how much more frequently a collocation appears than would be expected by sheer chance given the probability of the individual words.
The entire boatload appears to be encoded in Text::NSP, which from a cursory examination of its chief command count.pl appears to be readily adapted to my tokenizer. That's entirely cool. I'm looking forward to getting set up to the point that I can get serious about my 10-million-word corpus of technical German. (Yeah, that's how much I've translated in my career so far, starting in about 2004 and continuing to today.)
Soon, compadre. Soon!
I'm getting hip-deep in German lexicography these days; since I can actually realistically tokenize my translation documents now, I keep trying to, well, tokenize my translation documents now.
I've broken things up into Lingua::Lex, which is general tools for management of a lexicon, and Lingua::Lex::DE, which is specifically my German lexicon. This allows me to test things at the level of the mechanism in one place, and at the level of specific lexical rules in another. (Note: I haven't gotten as far as testing specific lexical rules: but it's there in potential, anyway.)
The distribution version of a Lex::DE is flat files with convenient ASCII-only encoding; as part of the setup procedure we build an SQLite database for the actual lexical work. I'm not sure how big the actual lexicon will end up being yet, but from the igerman98 distro, it's not frighteningly large or impossible for CPAN to handle. We'll see.
The lexical rules from igerman98 only scratch the surface - and they're grammatically naive, as there was no reason to try to encode parts of speech, so for instance any word that takes an 's' on the end can be lumped into the same flag no matter why it takes that 's'. Is it genitive? Plural? Something else? A spell checker doesn't care, but a lexicon driving a parser does.
So there's work to be done, and I have the framework hammered out. I've made a lot of progress with it;
the compounding mechanism works, and suffixes are mostly working as of today. Once I start trying to tokenize
real text, then improvements should proceed apace. I figure maybe another week before I'm to the point of
trying to feed these token streams to Marpa - but that's not too long at all!
Looking at the adoption list of CPAN modules (these are ranked by a need score based on the number of issues registered, the number of dependencies, and age of issue), there are actually quite a few in the Lingua namespace.
It would probably be character-building and instructive to adopt a few and fix them up.