Despamming the Toonbots forum, Chapter the Third
Using our newfound knowledge to fight forum spam

Earlier chapters: Extracting preliminary information (we're going to extend that script with our rules and deletion) and Analysis of preliminary information. Now we'll actually do the despamming.

First, we want to skip anything with no recorded access hits. We also know that any single hit which is just a POST is spam. These rules are very simple to implement, so let's rearrange the code a little to make it easier to work with (it's starting to grow) and do another test run. Here's the code:

$logdir = "/usr/local/aolserver/servers/vivtek/modules/nslog";

opendir D, ".";
@files = grep /^\d+$/, readdir(D);
closedir D;

sub extract_post_info;
sub extract_log_info;
sub apply_rules;

foreach $post (sort @files) {
   next if (-M $post > 7);
   extract_post_info();
   extract_log_info();
   $spammy = apply_rules();
   if ($spammy eq 'yes') {
      print "SPAM: $post - $subj\n"; 
   } elsif ($spammy eq 'maybe') {
      print "????: $post - $subj\n";
   }
}

# ------- Applying rules ----------------------------
sub apply_rules {
   if ($hits == 0) { return 'no'; }
   if ($hits == 1 && $meth eq 'POST') { return 'yes'; }
   return 'maybe';
}

# ------- Extracting post information ---------------
sub get_field;
sub extract_post_info {
   open P, $post;
   while (<P>) {
      $ip = get_field ($_) if /^IP_ADDRESS>/;
      $subj = get_field ($_) if /^SUBJECT>/;
   }
   close P;
}
sub get_field {
   my $in = shift;
   chomp($in);
   $in =~ s/^.*>//;
   return $in;
}

# ------- Extracting log information ---------------
sub extract_logline_info;
sub extract_log_info {
   if (-M $post > 0.1) {
      open G, "grep -h $ip $logdir/meat.* |";
   } else {
      open G, "grep -h $ip $logdir/access.log |";
   }
   $hits = 0;
   while (<G>) {
      extract_logline_info();
      $hits += 1;
   }
   close G;
}

# ------- Extracting log line information ---------
sub extract_logline_info {
   chomp;
   $line = $_;
   ($first, $dt, $last) = split / *[\[\]] */, $line;
   $dt =~ s/ -\d*$//;
   ($ip, $j1, $j2) = split / +/, $first;
   ($j1, $req, $respbw, $ref, $j2, $agent, $j3) = split / *" */, $last;
   ($meth, $url, $protocol) = split / /, $req;
   ($resp, $bandwidth) = split / /, $respbw;

   $file = $url;
   $file =~ s/.*\///;
}

From here on out, I won't quote the entire script (it's getting unwieldy), just individual functions I'm changing. The output from this version is already shaping up a little:

????: 6817 - Re: Eating the piece of fruit
????: 6847 - home equity loan rate calculator
????: 6854 - Re: Eating the piece of fruit
????: 6856 - Re: Eating the piece of fruit
????: 6869 - yesterday that large auto is relevant against,
????: 6886 - samsung last month made the RAZR V3x Ferrari Chall
????: 6888 - Re: Eating the piece of fruit
????: 6908 - Wholesale Video Games
????: 6917 - cialis vs viagra -
????: 6966 - 43 - It is the greatest treatment
????: 6967 - My Video collection :)
SPAM: 6968 - Cool site. Thanks!!! pharmacy [url=http://hulpen.i
????: 6969 - I - Real good treatment - 1
????: 6971 - II - Wanna some more - 1?
????: 6972 - Nice pornstar Nice busty babe
????: 6973 - Computer Geeks Pick Up Lines
????: 6974 - very usful pages about..
????: 6975 - free hardcore sex videos

We've eliminated one post from consideration, and positively identified another. Now we can get down to business. First, if there are hits to a non-forum page after the first hit to a forum page, we eliminate the post as legitimate. To do this, we modify extract_logline_info as follows:

sub extract_logline_info {
   chomp;
   $line = $_;
   ($first, $dt, $last) = split / *[\[\]] */, $line;
   $dt =~ s/ -\d*$//;
   ($ip, $j1, $j2) = split / +/, $first;
   ($j1, $req, $respbw, $ref, $j2, $agent, $j3) = split / *" */, $last;
   ($meth, $url, $protocol) = split / /, $req;
   ($resp, $bandwidth) = split / /, $respbw;

   $file = $url;
   $file =~ s/.*\///;

   $forum_post = 0;
   $forum_post = 1 if $file =~ /^discuss.pl/;
   $forum_post = 1 if $file =~ /^board_/;
}

Now we can determine boardness and non-boardness of a given profile. If there are only board hits, we call it spam. To do that, let's just count board hits and non-board hits in extract_log_info:

sub extract_log_info {
   if (-M $post > 0.1) {
      open G, "grep -h $ip $logdir/meat.* |";
   } else {
      open G, "grep -h $ip $logdir/access.log |";
   }
   $hits = 0;
   $content_hits = 0;
   while (<G>) {
      extract_logline_info();
      $hits += 1;
      $content_hits += 1 unless $forum_post;
   }
   close G;
}

And finally, let's define a rule to take advantage of this information:

sub apply_rules {
   if ($hits == 0) { return 'no'; }
   if ($hits == 1 && $meth eq 'POST') { return 'yes'; }
   if ($content_hits == 0) { return 'yes'; }
   return 'maybe';
}

Now we've already made significant progress! Here's our new output:

????: 6817 - Re: Eating the piece of fruit
SPAM: 6847 - home equity loan rate calculator
????: 6854 - Re: Eating the piece of fruit
????: 6856 - Re: Eating the piece of fruit
SPAM: 6869 - yesterday that large auto is relevant against,
SPAM: 6886 - samsung last month made the RAZR V3x Ferrari Chall
????: 6888 - Re: Eating the piece of fruit
SPAM: 6908 - Wholesale Video Games
SPAM: 6917 - cialis vs viagra -
????: 6966 - 43 - It is the greatest treatment
SPAM: 6967 - My Video collection :)
SPAM: 6968 - Cool site. Thanks!!! pharmacy [url=http://hulpen.i
????: 6969 - I - Real good treatment - 1
SPAM: 6971 - II - Wanna some more - 1?
SPAM: 6972 - Nice pornstar Nice busty babe
SPAM: 6973 - Computer Geeks Pick Up Lines
SPAM: 6974 - very usful pages about..
SPAM: 6975 - free hardcore sex videos
SPAM: 6978 - What to do with my collection?

Already pretty convincing! But it's missing two spams, the two by the guy who hit the blog post before spamming. This guy's in Russia, by the way, according to his IP of 81.177.26.50. Interesting. XRUMER is also Russian. The Russian forum spam programmers seem pretty clever, and I'm sure we'll be seeing much more of them in the years to come.

Let's implement our "initial content doesn't count as non-spam" rule to get this guy. Here, I want to be careful; if content hits go over about two, I want to avoid filtering the post as spam, because that scenario fits a technical-question kind of person. Not really a factor on this forum, but I'll be reusing this on the wftk forum as well, and in the exceedingly unlikely event that somebody someday stops by there again and posts something legitimate, it would be a shame to filter it out as spam.

First, we change extract_log_info to record initial content hits as well as total content hits:

   sub extract_log_info {
   if (-M $post > 0.1) {
      open G, "grep -h $ip $logdir/meat.* |";
   } else {
      open G, "grep -h $ip $logdir/access.log |";
   }
   $hits = 0;
   $content_hits = 0;
   $initial_content_hits = 0;
   while (<G>) {
      extract_logline_info();
      $hits += 1;
      $content_hits += 1 unless $forum_post;
      $initial_content_hits += 1 unless $forum_post or $hits != $content_hits;
   }
   close G;
}

And then we add another rule:

sub apply_rules {
   if ($hits == 0) { return 'no'; }
   if ($hits == 1 && $meth eq 'POST') { return 'yes'; }
   if ($content_hits == 0) { return 'yes'; }
   if ($initial_content_hits == $content_hits && $content_hits < 3) { return 'yes'; }
   return 'maybe';
}

OK? Now we correctly identify all the spam from the last week, and don't false-positive on any legitimate posts:

????: 6817 - Re: Eating the piece of fruit
SPAM: 6847 - home equity loan rate calculator
????: 6854 - Re: Eating the piece of fruit
????: 6856 - Re: Eating the piece of fruit
SPAM: 6869 - yesterday that large auto is relevant against,
SPAM: 6886 - samsung last month made the RAZR V3x Ferrari Chall
????: 6888 - Re: Eating the piece of fruit
SPAM: 6908 - Wholesale Video Games
SPAM: 6917 - cialis vs viagra -
SPAM: 6966 - 43 - It is the greatest treatment
SPAM: 6967 - My Video collection :)
SPAM: 6968 - Cool site. Thanks!!! pharmacy [url=http://hulpen.i
SPAM: 6969 - I - Real good treatment - 1
SPAM: 6971 - II - Wanna some more - 1?
SPAM: 6972 - Nice pornstar Nice busty babe
SPAM: 6973 - Computer Geeks Pick Up Lines
SPAM: 6974 - very usful pages about..
SPAM: 6975 - free hardcore sex videos
SPAM: 6976 - II - Real good treatment - 2
SPAM: 6977 - I - Yes, this is it - 2
SPAM: 6978 - What to do with my collection?

Finally, we want to move spam into a spam folder and, if any spam was moved, delete the message index so that it will regenerate on the next visit to the forum. To do this, we modify the outer loop as follows:

$spam_count = 0;
foreach $post (sort @files) {
   next if (-M $post > 7);
   extract_post_info();
   extract_log_info();
   $spammy = apply_rules();
   if ($spammy eq 'yes') {
      print "SPAM: $post - $subj\n";
      system "mv $post spam";
      $spam_count += 1;
   } elsif ($spammy eq 'maybe') {
      print "????: $post - $subj\n";
   }

}
if ($spam_count) { system "rm messages.idx"; }

And that is our preliminary forum despammer. There are more refined things we could do, but if the problem stays as lowkey as it currently is, then there's no real point in spending more time on it.

I hope you enjoyed our little chat. There is another chapter with the full script.






Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.