2014-05-21

Oh, now this is cool. My lexicon has also been a convenient place to collect n-grams, but I've never found the raw n-grams to be all that helpful. Turns out (as is so often the case) I've been doing it wrong. There are a boatload of statistical measures of n-grams that try to capture how much more frequently a collocation appears than would be expected by sheer chance given the probability of the individual words.

The entire boatload appears to be encoded in Text::NSP, which from a cursory examination of its chief command count.pl appears to be readily adapted to my tokenizer. That's entirely cool. I'm looking forward to getting set up to the point that I can get serious about my 10-million-word corpus of technical German. (Yeah, that's how much I've translated in my career so far, starting in about 2004 and continuing to today.)

Soon, compadre. Soon!

