A note on the Despammed archives: when spam comes into despammed.com or vivtek.com, its headers and origin get checked by custom code I've been running since 1999. If the recipient is an unknown despammed.com address, I chuck it unread; if the recipient is an unknown vivtek.com address, I cache it. And if the IP resolves to anything looking like a DSL line (i.e. with matching numbers in the rDNS name, for instance) I cache it. So most botnet spam never gets to me.
Cached email is saved in a header file and a body file on my server's hard drive, and a record is written to a database containing its ID, subject, recipient, envelope and header From addresses, originating IP (if forwarded from a known server, I find the *real* originating IP), and why the filters didn't like it. So I can write SQL queries to find out a lot about my stored spam. The old server crash lost me most of my historical archives, but I have complete data going back to last year, and the active database's window of three or so months usually contains about a million spam entries.
So data, I got.
When I first started looking for botnet-specific spam (on August 1), it could be recognized by a subject using a provocative (but usually false) news headline. Inside the body of the spam, then, was a second provocative headline, plus a link to a landing page. I'll leave the discussion of landing page analysis for a different time; this page documents the detection and tracking of email only.
At any rate, that week the botnet operators were using the same pool for both the email subject line and the body title, but they used different headlines for both. So my first step was to look for email with subjects I had decided were botnet subjects (like "McCain picks Osama bin Laden as VP" -- I loved this stuff), read out the secondary headline, and find more matching spam with that.
Then for each headline, I'd open up the header, get the content-type, and then read the body of the mail to see if it met my criteria. The original version of that scanner was really hacky, so it didn't do HTML well. I've since switched to HTML::TreeBuilder, which is frankly the most brilliant module on CPAN.
At any rate, once my pool started to grow, I noticed two things: first, the IP pool injecting this spam was relatively limited (to "only" a few thousand), and second, the unknown vivtek.com addresses I was seeing looked kind of similar. At some point in the past, after I put vivtek.com on the filters and started accepting unknown mail instead of bouncing it, the operators of the botnet obviously did some dictionary-search exploration of the domain. Since all the addresses were found to be valid, they're spamming them still. There were obviously a couple of different dictionary searches, too; using random strings, common names, and single letters appended to existing known addresses. They all succeed, so I get a lot of botnet spam.
But only the botnet uses them. So once I did that, I started looking at those addresses. I developed a "purity" test, calculating the quantity of botnet-propagation spam as a percentage of total spam. For existing despammed.com addresses, this purity was low -- 10% or 20% (already pretty amazing, really -- the botnet is really pounding me). But for unknown vivtek.com addresses, it reached up to 70% or more.
So my second line of discovery was to look at highly pure recipient addresses and test their subjects for botnet-ness. As my indexing for the Despammed.com archive was poor, these queries took a while, so I ran them at night for a few days. (Then I realized I had been stupid about the indexing, defined some different ones, and the whole process roared along and I started running it hourly throughout the day.)
When the CNN Top 10 subjects came along, of course, discovery of new subjects was suddenly rendered moot, so I stopped looking that week. And by the time the msnbc.com BREAKING NEWS subjects came along, I'd already gotten comfortable with the idea of scanning just the IP addresses for the known botnet, and discovering new subjects that way -- it's a much purer way to go anyway.
But in the meantime, I had discovered what may be two different botnets. Whereas BN1 had spanned some 15,000 IPs, BN2 (with subject line "Internet Explorer 7") came from about 300 IPs, and BN3 (with subject line "BBC NEWS" and others) also used about 300 IPs. In the meantime, I have one single IP which appears to belong to both BN1 and BN3. This could mean that the botnets are really different segments of the same botnet, or it could conceivably mean that one PC has been zombied twice. Further analysis may shed more light on that. But it's an open question.
Currently, then, I'm grouping spam simply by its subject. Since these are using very distinctive subjects, that works -- but it may stop working at some point. And indeed, this week BN1 has been sending their standard spam using the fake headlines from the middle of July, so just looking at subjects and assuming things will always work is really naive, even though it's been successful and fun so far.
Really, though, it would be much better to group spam by "modus operandi". This would take the spam, match it against known subjects, then analyze the body, match that against known patterns, then find the link it's spamming, and match that content against a known pattern.
The result would be a fingerprint of sorts, and something like this:
- Subject "Kick-up - % - video"
- Body pattern xxxx (where xxxx defines a pattern, below)
- Link yyyy.com/xxx.php (where yyyy might index a list of hijacked servers)
- Content pattern xxxx (where xxxx defines a content pattern)
- Ultimate payload xxxx (where xxxx identifies the malware)
OK? So that overall pattern, which should approximate what goes through my mind when looking at a given instance, would characterize a group of spams sent out over a given range of time. Then new spam could be characterized in the same way.
So what defines a "body pattern"? That's a good question, and I'm glad you asked it. I wish you'd answer it for me.
I'd like to do some kind of template discovery. Given a set of spam, I'd like to be able to determine shared features, then abstract the shared features away to leave variables embedded in the template. This is clearly how they're generated in the first place, so we ought to be able to reverse-engineer the process to a certain extent.
I could start by hand-coding the templates, then perhaps start to automate the process later. So this may or may not be where the project goes next. It really depends on how much time I have, and how driven I become. Watch This Space for Further Details.