Now here's the thing. Comments are kind of what I think of as "standard Webspam" -- they come with a name, and an email usually, and have a text which is intended to contribute to a conversation. So when we look at a comment, we can see if it's spam by looking at the number of links, the IP it's posted from, and so on.
But with trackback spam, we have a different beast entirely. The IP is going to be a server with a blog on it, whether it's spam or not. The "text" is a reference to a blog post, whether it's spam or not.
All we can really do is to examine the page it's pointing to, and make a judgment based on the content of that page. This will be a useful thing, anyway, of course, in a number of spam contexts (links from email, for instance.)
Here are some criteria we can apply to a URL to judge spamminess:
- Forwards to another page
This applies to trackbacks, for sure -- maybe not for other categories of link, though. We don't want to exclude any comment referring to a tinyurl forwarder, for instance. (But that wouldn't be appropriate in a trackback.)
Malware would definitely be a spam sign. Obfuscated JS, though ... maybe a strike against, without being a definite spam sign (unless the obfuscated script is itself malware of some sort, of course.)
- Links to a known spammed site
Data-based filters are hard to keep up with, but here, I'm thinking of gateway pages which front for, say, "pharmacies".
- Text analysis yields spam sign
If a page has 100 instances of the word "phentermine", we're pretty sure it's not a page we're interested in.
- A link list to suspicious pages
Again -- this is an intuitive thing which a human would see and immediately say, "spam". How to get the machine to recognize it, well, that's where despamming gets fun, you see.
Anyway, that's some food for thought.