Some thoughts on trackback spam
Since January 2008, I've been working with MovableType on a particular blog, installed on the same machine serving you this very page. And as the tech guy and server owner, it's been up to me to figure out how to moderate comments (of course) and trackbacks.

Now here's the thing. Comments are kind of what I think of as "standard Webspam" -- they come with a name, and an email usually, and have a text which is intended to contribute to a conversation. So when we look at a comment, we can see if it's spam by looking at the number of links, the IP it's posted from, and so on.

But with trackback spam, we have a different beast entirely. The IP is going to be a server with a blog on it, whether it's spam or not. The "text" is a reference to a blog post, whether it's spam or not.

All we can really do is to examine the page it's pointing to, and make a judgment based on the content of that page. This will be a useful thing, anyway, of course, in a number of spam contexts (links from email, for instance.)

Here are some criteria we can apply to a URL to judge spamminess:

  • Forwards to another page
    This applies to trackbacks, for sure -- maybe not for other categories of link, though. We don't want to exclude any comment referring to a tinyurl forwarder, for instance. (But that wouldn't be appropriate in a trackback.)
  • Contains malware or obfuscated Javascript
    Malware would definitely be a spam sign. Obfuscated JS, though ... maybe a strike against, without being a definite spam sign (unless the obfuscated script is itself malware of some sort, of course.)
  • Links to a known spammed site
    Data-based filters are hard to keep up with, but here, I'm thinking of gateway pages which front for, say, "pharmacies".
  • Text analysis yields spam sign
    If a page has 100 instances of the word "phentermine", we're pretty sure it's not a page we're interested in.
  • A link list to suspicious pages
    Again -- this is an intuitive thing which a human would see and immediately say, "spam". How to get the machine to recognize it, well, that's where despamming gets fun, you see.

Anyway, that's some food for thought.






Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.