modbot, google counter, and the spam archive

Three related things today. First, the scripts I've been putting together for forum spam blocking have kind of coalesced into a "modbot". This program attempts to automate the tasks performed by human moderators and could technically be placed into any Web spam moderation situation. It is currently running happily in an Eboard 4.0 installation and blocking roughly 93% of spam, while still allowing anonymous posting to that forum. I'll be packaging it up for distribution in the public domain. Watch this space for further details -- one of the more fascinating notions I've had is to enable it to receive moderation emails from Blogger and thus automate the comment moderation process there.

One of the rules/tools used by the modbot is to count Google hits for the numeric IP of an untrusted poster. Turns out that HTTP proxies have a real proclivity for getting indexed. A lot. Legitimate IPs, not so much. I wrote a little online tool to call Google to get these counts; the tool is here and the write-up of the code is here. It's currently blocking about 40% of spam (I don't have good statistics analysis in place yet, so that's very approximate.)

Finally, as a spinoff of this project, I've started a spam archive. There's nothing to present yet, but I hope to start doing some interesting analysis, and most specifically a searchable database -- along with a searchable database of spamvertised sites. That ought to overlap with the sites spamvertised by email spam as well, and that's going to be an interesting thing to look at. We'll see.

I've stumbled onto a spam link network of staggering extent in the course of examining forum spam. A spammer has a site somewhere, and then spamvertises it. But then some of the spam starts to link to other forum spam, which in turn links to the site. Some sites auto-forward to other sites using obscured Javascript (I haven't figured out just why, yet; if you have a rationale, I'd be happy to hear it.) Anyway, after that goes on for a while, there's a huge resulting network of vulnerable fora linking to other vulnerable fora. There is a true treasure trove of information available to the interested party. Which would, of course, be me. I will definitely be following up on that and posting on it.

Anyway, it's been nice talking to you. Back to work!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.