Despamming the Toonbots forum, Chapter the First
Extracting the preliminary results

So I'm putting "despammer.pl" into the discussion directory, and the first thing I want to do is just to find the posts:

opendir D, ".";
@files = grep /^\d+$/, readdir(D);
closedir D;

foreach $post (@files) {
   print "$post\n";
}

This simply scans the directory and pulls out the filenames which are all numeric-only. When I run it, I get a long list of numbers. So far, so good. Next, I want to ignore anything older than a week. This will make sure that I'm not fooled in case I move log files elsewhere, and it'll also cut down a bit on overhead (not that this will actually incur much overhead, but still: waste not, want not.) So I add one line:

opendir D, ".";
@files = grep /^\d+$/, readdir(D);
closedir D;

opendir D, ".";
@files = grep /^\d+$/, readdir(D);
closedir D;

foreach $post (@files) {
   next if (-M $post > 7);

   print "$post\n";
}

Now I have a more manageable list of posts. (19, according to "perl despammer.pl | wc".)

OK. For each post, I have a file that looks like this:

SUBJECT>cialis vs viagra -
POSTER>Paydaybess
EMAIL>
DATE>1170330822
EMAILNOTICES>no
IP_ADDRESS>84.19.188.16
PREVIOUS>6286
NEXT>
IMAGE>
LINKNAME>
LINKURL>
<P>viagra vs cialis cialis viagra vs cialis levitra viagra vs cialis drug viagra vs cialis strong viagra vs cialis effects side viag
<P>cialis vs viagra

Yeah, I'm pretty sure that one's spam. So I open the file and extract the line that starts IP_ADDRESS>, as follows: (I don't care about the rest of the file.)

opendir D, ".";
@files = grep /^\d+$/, readdir(D);
closedir D;

foreach $post (@files) {
   next if (-M $post > 7);

   open P, $post;
   while (<P>) {
      next unless /^IP_ADDRESS>/;
      chomp;
      s/^IP_ADDRESS>//;
      $ip = $_;
      last;
   }
   close P;

   print "$post - $ip\n";
}
Now I have an output more like this:
6869 - 74.131.29.150
6908 - 63.166.111.6
6817 - 137.110.127.65
6917 - 84.19.188.16
6886 - 58.15.127.29

Fortunately, I already have scripts running hourly to pull interesting things out of the log for me, and so to look at these IPs I can grep those files for the IP. If the post is more recent than an hour, I'll look at the current raw access log. All those are in my log directory; the current log is access.log and the older ones are all in meat.*. I'm just going to use a naive grep to get the information out, because I don't really care too much about scalability. If I cared about scalability, I'd probably build this into the forum posting script itself, and just scan for each IP as it came in. But this approach lets me try out techniques before doing that, and possibly disabling the forum while I tinkered in its guts. (Again: not that this forum would kill me, but these habits all stem from cold-sweat-inducing incidents in my checkered past.)

Let's grep. And let's pull the subject of the post out, too, so we can keep track of what's what.

$logdir = "/usr/local/aolserver/servers/vivtek/modules/nslog";

opendir D, ".";
@files = grep /^\d+$/, readdir(D);
closedir D;

sub get_field {
   my $in = shift;
   chomp($in);
   $in =~ s/^.*>//;
   return $in;
}

sub extract_info {
   chomp;
   $line = $_;
   ($first, $dt, $last) = split / *[\[\]] */, $line;
   $dt =~ s/ -\d*$//;
   ($ip, $j1, $j2) = split / +/, $first;
   ($j1, $req, $respbw, $ref, $j2, $agent, $j3) = split / *" */, $last;
   ($meth, $url, $protocol) = split / /, $req;
   ($resp, $bandwidth) = split / /, $respbw;

   $file = $url;
   $file =~ s/.*\///;

   $chaff = 0;
   $chaff = 1 if $file =~ /.gif/i;
   $chaff = 1 if $file =~ /.jpg/i;
   $chaff = 1 if $file =~ /.js/i;
   $chaff = 1 if $file =~ /.png/i;
   $chaff = 1 if $file =~ /.ico/i;
}


foreach $post (@files) {
   next if (-M $post > 7);

   open P, $post;
   while (<P>) {
      $ip = get_field ($_) if /^IP_ADDRESS>/;
      $subj = get_field ($_) if /^SUBJECT>/;
   }
   close P;

   if (-M $post > 0.1) {
      open G, "grep -h $ip $logdir/meat.* |";
   } else {
      open G, "grep -h $ip $logdir/access.log |";
   }
   print "------------------------------------------------------------------------------------\n";
   print "$post - $ip - $subj\n";
   print "------------------------------------------------------------------------------------\n";
   while (<G>) {
      extract_info();
      print "$dt - $url - $ref\n";
   }
   close G;
}

Next: A look at preliminary results.






Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.