Despamming the Toonbots forum, Chapter the Fourth
Day one dawns with a new surprise

Earlier chapters: Extracting preliminary information, Analysis of preliminary information, and Using our knowledge to despam the forum. And so I put this script into the crontab, and lo! on the first morning, there was a surprise. A spam had survived.

????: 6817 - Re: Eating the piece of fruit
????: 6854 - Re: Eating the piece of fruit
????: 6856 - Re: Eating the piece of fruit
????: 6888 - Re: Eating the piece of fruit
????: 6979 - Don't feed the trolls
????: 6980 - Fredy

Well. I'd sacrificed my display ability when putting in the despamming code itself. So I put it back, but with a little better control. If I pass a post number to the script, I want to see the log dump for that post, but the script should take no action and ignore all other posts but that one.

No sooner said than done, most of the changes being in the top loop (you can see the full current script at the bottom of this page.) Now I display post 6980: "perl despammer.pl 6980":

------------------------------------------------------------------
 6980 - 222.76.215.173 - Fredy
------------------------------------------------------------------
19/Jan/2007:12:55:23 - /make-contact - POST - http://www.vivtek.com/webmaster.html
19/Jan/2007:14:14:44 - /make-contact - POST - http://www.vivtek.com/webmaster.html
23/Jan/2007:03:19:31 - /make-contact - POST - http://www.vivtek.com/webmaster.html
27/Jan/2007:03:10:18 - /make-contact - POST - http://www.vivtek.com/webmaster.html
01/Feb/2007:05:10:09 - /make-contact - POST - http://www.vivtek.com/answers.html
01/Feb/2007:05:10:37 - /make-contact - POST - http://www.vivtek.com/webmaster.html
04/Feb/2007:03:11:12 - /toonbots/discuss.pl?post - POST - http://www.vivtek.com/toonbots/discuss.pl
Spammy? maybe

So this guy has been around for a while, "visited" non-forum pages, but never actually done a GET. Just POST. If that ain't filterable, I don't know what is. The "make-contact" URL is what I use for Web-based messages from my site to my own email. Now I know who's been injecting spam into it. (Bastards.) I figure their strategy is that postable fields are nearly always fora, so posting into a non-forum field is just acceptable collateral damage. They have no costs, so they don't care.

Anyway, the filtration is easy. Here's the modified script:

$logdir = "/usr/local/aolserver/servers/vivtek/modules/nslog";
$discussion = "/home/michael/vivtek/pages/toonbots/discussion";

chdir $discussion;

$p = $ARGV[0];

opendir D, ".";
@files = grep /^\d+$/, readdir(D);
closedir D;

sub extract_post_info;
sub extract_log_info;
sub apply_rules;

$spam_count = 0;
foreach $post (sort @files) {
   next if (-M $post > 7);
   if ($p) {
      next unless $p == $post;
   }
   extract_post_info();
   if ($p) {
      print "------------------------------------------------------------------\n";
      print " $post - $ip - $subj\n";
      print "------------------------------------------------------------------\n";
   }
   extract_log_info();
   $spammy = apply_rules();
   if ($p) {
      print "Spammy? $spammy\n";
      last;
   } else {
      if ($spammy eq 'yes') {
         print "SPAM: $post - $subj\n"; 
         system "mv $post spam";
         $spam_count += 1;
      } elsif ($spammy eq 'maybe') {
         print "????: $post - $subj\n";
      }
   }
}

if ($spam_count) { system "rm messages.idx"; }

# ------- Applying rules ----------------------------
sub apply_rules {
   if ($hits == 0) { return 'no'; }
   if ($hits == $post_posts) { return 'yes'; }
   if ($content_hits == 0) { return 'yes'; }
   if ($initial_content_hits == $content_hits && $content_hits < 3) { return 'yes'; }
   return 'maybe';
}

# ------- Extracting post information ---------------
sub get_field;
sub extract_post_info {
   open P, $post;
   while (<P>) {
      $ip = get_field ($_) if /^IP_ADDRESS>/;
      $subj = get_field ($_) if /^SUBJECT>/;
   }
   close P;
}
sub get_field {
   my $in = shift;
   chomp($in);
   $in =~ s/^.*>//;
   return $in;
}

# ------- Extracting log information ---------------
sub extract_logline_info;
sub extract_log_info {
   if (-M $post > 0.1) {
      open G, "grep -h $ip $logdir/meat.* |";
   } else {
      open G, "grep -h $ip $logdir/access.log |";
   }
   $hits = 0;
   $content_hits = 0;
   $initial_content_hits = 0;
   $post_posts = 0;
   while (<G>) {
      extract_logline_info();
      $hits += 1;
      $content_hits += 1 unless $forum_post;
      $initial_content_hits += 1 unless $forum_post or $hits != $content_hits;
      $post_posts += 1 if $meth eq 'POST';
   }
   close G;
}

# ------- Extracting log line information ---------
sub extract_logline_info {
   chomp;
   $line = $_;
   ($first, $dt, $last) = split / *[\[\]] */, $line;
   $dt =~ s/ -\d*$//;
   ($ip, $j1, $j2) = split / +/, $first;
   ($j1, $req, $respbw, $ref, $j2, $agent, $j3) = split / *" */, $last;
   ($meth, $url, $protocol) = split / /, $req;
   ($resp, $bandwidth) = split / /, $respbw;

   $file = $url;
   $file =~ s/.*\///;

   $forum_post = 0;
   $forum_post = 1 if $file =~ /^discuss.pl/;
   $forum_post = 1 if $file =~ /^board_/;

   if ($p) { print "$dt - $url - $meth - $ref\n"; }
}

Still a pretty easy script to grok, and it's doing a bangup job keeping the forum usable. Here's the result:

????: 6817 - Re: Eating the piece of fruit
????: 6854 - Re: Eating the piece of fruit
????: 6856 - Re: Eating the piece of fruit
????: 6888 - Re: Eating the piece of fruit
????: 6979 - Don't feed the trolls
SPAM: 6980 - Fredy

I hope you've enjoyed today's lesson.






Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.