I first got seriously interested in forum despamming when manual deletion of spam from my primitive old Toonbots forum (motto: all expense spared) got to the point where I got ... bored with it. So I said, hey, time to automate this tedium. The following article was the result. The techniques here weren't terribly effective -- I mean, they were far better than doing everything by hand! But they deleted good posts, missed spam, and generally ended up being, well, a first try. The second try was the modbot, and it started to get pretty decent.
So I have a forum where some friends of mine (used to) hang out. It got a lot louder in there when I reworked the site, provided links into and from the archives, and the spammers moved in. So naturally I take this as a technical challenge. I'm going to despam the bastards.
The forum is based on WebBBS (currently at version 5.12, but I run version 3.20 -- hey, it does what I need it to and I don't need to change anything.) Posts are in individual text files in the discussion directory, and each has a numeric filename. I run a periodic script to archive old threads when the post count goes above 256, and the periodic script does some other fun stuff, too.
Now, my suspicion, based on a few greps several weeks ago, is that the spammers generally act unlike actual posters, because actual posters read the site first or are otherwise known to me. Thus by correlating the hit logs (easy to obtain) with the IP of the poster, I should easily be able to discern spamminess without even needing to look at content. This is a Good Thing, because you generally don't want a machine to need to determine the humanness of anything. But if I'm going to automate the rather boring process of despamming the forum, I'm going to have to know more specifics than this, so first I want to do a little analysis, then get down to the business of actually despamming. (That part's easy: just delete the post's file, and the message index, and the post is gone!)
And it occurs to me that this is going to be kind of a fun script to write, and a pretty brief one (at least initially) and so I'm going to document it as I go.
First: Extracting the preliminary results.
Next: A look at preliminary results.
Spammers enter the site and proceed directly to a forum. They do not read content. Some don't even read the forum first, but rather post directly (these have bought the address from somewhere.) They do not leave referrer information (otherwise I could find them.)
Finally: Using our analysis to despam the forum.
A new surprise: A slight miscalculation, corrected.
A new technique: Counting Google hits to judge proxying IPs.
Please note: if you're interested in my help despamming your own forum, drop me a line and we'll talk.