Lottery spam example

This is a page linked from spam to the nowarblog forum just now. The page is hosted on a Google Groups Belgium public notebook at:

http://www.google.be/notebook/public/09289702668278931362/BDSVoIgoQrd3E6J8j

Note the auto-generated notebook name. This same spam contained 96 links to such notebooks (prefaced by the text, "Your site is very usefuls." Thanks.)

The username is the first multi-digit number there; Google Groups helpfully gives us a list of notebooks by that user. There are 52 of them, all edited today interestingly enough. (This alone would be spam sign, of course, but I'm not sure exactly how to automate its detection.)

Also, we learn that Flemish for "notebook" is "kladblok," which is a very useful thing to know!

The structure of the notebook text is kind of cool, actually. The payload link is up at the top, linking off to a forwarding link, which forwards to another forwarding link, which hits a page at www.casinogoldsun.com, which has links to a numeric-IP server with encoded parameters, which forwards to google.com if your headers don't include a believable User-Agent (I suspect.) At any rate, contextually, this reeks to high heaven when you do linkage analysis.

Below that link, however, is a great deal of text which links to other notebooks in the group of notebooks. It would be interesting to track those links, but we'll leave that for another day. What's got me sitting up straight right now (figuratively speaking) is the text itself. It's obviously autogenerated in order to make the page look real, in much the same way that Bayesian-logic foiling text is inserted into email spam. It smells the same, you know? Here's a short sample:

The land lottery in washington co ga two envelopes problem is a puzzle or
paradox within the subjectivistic interpretation of probability theory; more
specifically within Bayesian new york state lottery numbers decision theory.
This is still an find winning repeat lottery numbers Lucky Lottery Numbers open
missouri lottery raffle problem among the subjectivists as no consensus has been
reached yet.

That's what made me start thinking: one of the source texts used is itself an article about Bayesian analysis! It's interspersed at random with set phrases related to the spam topic (here, state lotteries; the linked site is more generally about online gambling, and I'm not ruling out attempts to inject botnet viruses somewhere in there.)

Surely, he said, it would be possible to detect this? Like, some kind of grammar-plausibility scanner that wouldn't be too expensive to run? That would be useful, wouldn't it? Yes, it would. Watch this space for further details, I guess. (You might be watching for a while.)

Anyway, Googling on the second sentence, after reconstructing it ("This is still an open problem among the subjectivists as no consensus has been reached yet"), gives us a link to Wikipedia (an article on the "two-envelopes problem", an unsolved problem in statistics.) It also gives us a number of links to other Webspam. I find this truly fascinating. Incidentally, one of the Webspam is a search engine foiler for "envelopes". That's just freaky.