Now that I've been collecting spam from actual fora for a little while, I have some initial statistics and musings.
I've collected spam from one eBoard 4.0 forum since May 5; it is now May 13. The spam filters I'm using are blocking about 93% of the postings, making the moderation burden manageable for that forum. In those 8 days I have collected 1,235 spam samples. That's 150 spams a day, from a fairly obscure forum; in retrospect, even though the actual log activity seems low, this is a lot of spam.
Those 1,235 spam samples link to a total of 10,795 links. I haven't yet built analysis machinery to get much farther than that; I've mostly been just looking at the links, retrieving the pages, and musing about how all that might be automated in an interesting and useful way.
Some of the spam links point to actual sites being advertised. I don't yet have a feel for many links point to sites other than those actually advertised, but there are some interesting commonalities. For instance, there are a lot of pages placed onto vulnerable fora and other venues which simply link to other pages. In some cases, it's easy to tell why: Google spamming and simply a way to counter attempts to block posts which link to particular URLs.
I have a separate notion to find and track those vulnerable sites, and to attempt to mine them for further information on these spam networks.
One spam has a huge number of links to different domains, all of which resolve to the same IP. That's an interesting feature. I'm not sure how to track it yet. What I really want to do is some kind of generic analysis framework, but I don't have a good picture of what that framework would look like, or indeed precisely what it is that I expect it to do.
It seems that what I want to do is to build a kind of task list for an incoming event. That task list would consist of a certain (small) number of analysis steps which themselves generate new analysis events. Each step is a test. The results of the tests are cached, so that all possible duplicated effort is avoided, but also so that relationships such as "these spam efforts share an IP" can be found.
There's a certain exponential explosion involved, it seems at times. But there
are also patterns which could cut down on the amount of work done. Of those
10,795 links I have so far (oops, in the time it's taken to write this
much, two more spams have arrived, so I now have 10,886 links to analyze) --
of those 10,886 links I have, many of them are hosted at
Well, anyway, this is just a little talking out loud while I muse about how to automate all this analysis. Eventually I'll get down to posting graphs of some sort. That will be fun. The other thing, of course, is some way to ask about a URL, "Is this URL a spam indicator?" I hope it will also cross-fertilize with Despammed.com. Wish me luck.