Text plausibility metrics
Text plausibility is an interesting notion. There are two classes of "implausible text" I see in spam. The first is Bayes spoilers, which largely consist of (mostly) English words strung together, sometimes grammatically, sometimes not. This is most often seen in email spam, but I recently also saw it in the lottery spam pages as filler to make Google notebooks look legitimate.

To test plausibility of that category, we'd have to have some kind of parser or phrasal lookup or something. This will be difficult, but if you whistle a couple of bars, I'll bet I can fake it.

But there is another category of implausible text, which I've been seeing in forum spam, and frankly I have no idea what the spammer's goal is. Here's a sample from Donttasemeblog.com:

Commenter: whxydt dbaekt
Email:     bkigz@gmail.com
URL:       http://www.jdkzrwx.suwdcb.com
Text:      rhzbns uhspfvrbt ufmaso zvnpq tpdwxrzf vigbjsdfp ghuyc

What is this one doing? I don't know! The URL doesn't resolve, there are no links ... I have no idea. But I see a lot of it, so it's not just a test. It might be a probe to see what gets indexed by Google, since the words are obviously unique. But I just don't know!

A similar class of spam which I'm sure we've all seen has titles of the form "JApILjIxZkgzEOmMD" or "INPfYwTQQIkviegmINv" (both samples from the archive in the last five minutes.)

OK -- both of these are made of implausible words, and those words are easily recognizable because they don't have the right letter frequency. What would be very nice is a quick, cheap metric which I could apply to a string which would give the probability that that string was English, and how accurate the assessment probably is (longer text = more probable.) Then I could use that metric to ding these.

A fantastic list of word and letter frequencies in English.

Wiktionary has some useful links. Actually, wouldn't it be very useful to guess the language of a given text? Given the top 100 words and letter frequencies of a language, that ought to be possible with some given probability. Then if somebody posts in Spanish to an English forum (or vice-versa) it could be regarded with correspondingly increased skepticism.






Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.