2014-05-21 nlp

Oh, now this is cool. My lexicon has also been a convenient place to collect n-grams, but I've never found the raw n-grams to be all that helpful. Turns out (as is so often the case) I've been doing it wrong. There are a boatload of statistical measures of n-grams that try to capture how much more frequently a collocation appears than would be expected by sheer chance given the probability of the individual words.

The entire boatload appears to be encoded in Text::NSP, which from a cursory examination of its chief command count.pl appears to be readily adapted to my tokenizer. That's entirely cool. I'm looking forward to getting set up to the point that I can get serious about my 10-million-word corpus of technical German. (Yeah, that's how much I've translated in my career so far, starting in about 2004 and continuing to today.)

Soon, compadre. Soon!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.