So I'm putting "despammer.pl" into the discussion directory, and the first thing I want to do is just to find the posts:
opendir D, "."; @files = grep /^\d+$/, readdir(D); closedir D; foreach $post (@files) { print "$post\n"; }
This simply scans the directory and pulls out the filenames which are all numeric-only. When I run it, I get a long list of numbers. So far, so good. Next, I want to ignore anything older than a week. This will make sure that I'm not fooled in case I move log files elsewhere, and it'll also cut down a bit on overhead (not that this will actually incur much overhead, but still: waste not, want not.) So I add one line:
opendir D, "."; @files = grep /^\d+$/, readdir(D); closedir D; opendir D, "."; @files = grep /^\d+$/, readdir(D); closedir D; foreach $post (@files) { next if (-M $post > 7); print "$post\n"; }
Now I have a more manageable list of posts. (19, according to "perl despammer.pl | wc".)
OK. For each post, I have a file that looks like this:
SUBJECT>cialis vs viagra - POSTER>Paydaybess EMAIL> DATE>1170330822 EMAILNOTICES>no IP_ADDRESS>84.19.188.16 PREVIOUS>6286 NEXT> IMAGE> LINKNAME> LINKURL> <P>viagra vs cialis cialis viagra vs cialis levitra viagra vs cialis drug viagra vs cialis strong viagra vs cialis effects side viag <P>cialis vs viagra
Yeah, I'm pretty sure that one's spam. So I open the file and extract the line that starts IP_ADDRESS>, as follows: (I don't care about the rest of the file.)
opendir D, "."; @files = grep /^\d+$/, readdir(D); closedir D; foreach $post (@files) { next if (-M $post > 7); open P, $post; while (<P>) { next unless /^IP_ADDRESS>/; chomp; s/^IP_ADDRESS>//; $ip = $_; last; } close P; print "$post - $ip\n"; }Now I have an output more like this:
6869 - 74.131.29.150 6908 - 63.166.111.6 6817 - 137.110.127.65 6917 - 84.19.188.16 6886 - 58.15.127.29
Fortunately, I already have scripts running hourly to pull interesting things out of the log for me, and so to look at these IPs I can
grep those files for the IP. If the post is more recent than an hour, I'll look at the current raw access log. All those are in my log
directory; the current log is access.log
and the older ones are all in meat.*
. I'm just going to use a naive
grep to get the information out, because I don't really care too much about scalability. If I cared about scalability, I'd probably build
this into the forum posting script itself, and just scan for each IP as it came in. But this approach lets me try out techniques before
doing that, and possibly disabling the forum while I tinkered in its guts. (Again: not that this forum would kill me, but these habits
all stem from cold-sweat-inducing incidents in my checkered past.)
Let's grep. And let's pull the subject of the post out, too, so we can keep track of what's what.
$logdir = "/usr/local/aolserver/servers/vivtek/modules/nslog"; opendir D, "."; @files = grep /^\d+$/, readdir(D); closedir D; sub get_field { my $in = shift; chomp($in); $in =~ s/^.*>//; return $in; } sub extract_info { chomp; $line = $_; ($first, $dt, $last) = split / *[\[\]] */, $line; $dt =~ s/ -\d*$//; ($ip, $j1, $j2) = split / +/, $first; ($j1, $req, $respbw, $ref, $j2, $agent, $j3) = split / *" */, $last; ($meth, $url, $protocol) = split / /, $req; ($resp, $bandwidth) = split / /, $respbw; $file = $url; $file =~ s/.*\///; $chaff = 0; $chaff = 1 if $file =~ /.gif/i; $chaff = 1 if $file =~ /.jpg/i; $chaff = 1 if $file =~ /.js/i; $chaff = 1 if $file =~ /.png/i; $chaff = 1 if $file =~ /.ico/i; } foreach $post (@files) { next if (-M $post > 7); open P, $post; while (<P>) { $ip = get_field ($_) if /^IP_ADDRESS>/; $subj = get_field ($_) if /^SUBJECT>/; } close P; if (-M $post > 0.1) { open G, "grep -h $ip $logdir/meat.* |"; } else { open G, "grep -h $ip $logdir/access.log |"; } print "------------------------------------------------------------------------------------\n"; print "$post - $ip - $subj\n"; print "------------------------------------------------------------------------------------\n"; while (<G>) { extract_info(); print "$dt - $url - $ref\n"; } close G; }