To test plausibility of that category, we'd have to have some kind of parser or phrasal lookup or something. This will be difficult, but if you whistle a couple of bars, I'll bet I can fake it.
But there is another category of implausible text, which I've been seeing in forum spam, and frankly I have no idea what the spammer's goal is. Here's a sample from Donttasemeblog.com:
Commenter: whxydt dbaekt Email: email@example.com URL: http://www.jdkzrwx.suwdcb.com Text: rhzbns uhspfvrbt ufmaso zvnpq tpdwxrzf vigbjsdfp ghuyc
What is this one doing? I don't know! The URL doesn't resolve, there are no links ... I have no idea. But I see a lot of it, so it's not just a test. It might be a probe to see what gets indexed by Google, since the words are obviously unique. But I just don't know!
A similar class of spam which I'm sure we've all seen has titles of the form "JApILjIxZkgzEOmMD" or "INPfYwTQQIkviegmINv" (both samples from the archive in the last five minutes.)
OK -- both of these are made of implausible words, and those words are easily recognizable because they don't have the right letter frequency. What would be very nice is a quick, cheap metric which I could apply to a string which would give the probability that that string was English, and how accurate the assessment probably is (longer text = more probable.) Then I could use that metric to ding these.
Wiktionary has some useful links. Actually, wouldn't it be very useful to guess the language of a given text? Given the top 100 words and letter frequencies of a language, that ought to be possible with some given probability. Then if somebody posts in Spanish to an English forum (or vice-versa) it could be regarded with correspondingly increased skepticism.