Topic: AOLserver -- Date fixup for access log bug

AOLserver index ]
For two years I've hated this stupid bug. The site in question has about 10,000 hits per week (excluding all graphics) and every few weeks AOLserver (v2.1 on Solaris) glitches the access log, transposing a couple of digits in the date, or leaving a couple of bytes out, or simply incrementing the year by 100 or so. It's weird.

So I wouldn't care that much, except for two things. First, the first-pass log analysis is done by Analog, which is a darned good program. In the case where one year is a hundred years later, though, Analog gives me a bar chart of the entire hundred years of traffic. Most of it's pretty empty, but the resulting Analog report, instead of the accustomed couple hundred K, comes out to several megabytes of download. Not good.

But in the last year or so, we've need much more detailed analysis of site usage (the site is Techspex if you're interested; it's a largish online database). So we've been postprocessing the logs and loading large chunks of them into the database and doing lots of really involved queries on them. And to do that, I postprocess the logs into a tab-delimited format for mass insertion via micopy (yeah, still on Illustra). And micopy croaks on those bad dates, some of them, meaning that the reports don't run automatically and I have to go change the file by hand and run them manually.

So I finally broke down and wrote a script to fix the logs. And here it is. You can also get a plaintext version here for easier download if you want to use it. Knock yourself out. Tell me if you change it.

# Copyright (c) 2000 Vivtek --
# Use this any way you like, but don't blame me if it doesn't work or somehow crashes your machine and burns your office
# down.  Fix it and tell me what you did.  I don't promise you anything at all.
First, a quick little function which breaks a line into the parts we're interested in: the date, the time, and everything else. The everything else goes into two sections, of course.
sub crack { 
   my ($peer, $user, $whatever, $rest) = split (/ /, shift (@_), 4);
   my ($stamp, $stuff) = split /\]/, $rest, 2;
   $stamp = substr($stamp, 1);

   my ($d, $nont) = split / /, $stamp, 2;
   my ($dt, $t) = split /:/, $d, 2;

   return ("$peer $user $whatever [", $dt, $t, " $nont] $stuff");

We store two previous lines' worth of info. We're always working on the last line before the one we just read. The idea is that we want our dates and times to be continuous. If parts of the date are the same on either side of the last line, but different in the last line, we know the last line has a bogus date, see?
$lastpre = "";
$lastpost = "";
$lastdt = "";
$lastti = "";

$llastpre = "";
$llastpost = "";
$llastdt = "";
$llastti = "";

Here a little logging functionality just as a sanity check. Obviously you should change the location of the log file to suit your filesystem.
system "date +%y/%m/%d >> /usr/local/AOLserver/web/nslog/analog_done/fixups.log";
open LOG, ">>/usr/local/AOLserver/web/nslog/analog_done/fixups.log";

$line = 0;

The rest is pretty obvious. Get a line. If there was a last line, process the last line and print it, then save everything for the next line. Processing is just a matter of applying some tests and fixing things if the line fails. Most of the tests compare the line to the lines before and after it.
while (<>) {
   $line ++;
   ($pre, $dt, $ti, $post) = crack ($_);

   $dateok = 0;
   $timeok = 0;

   if ($lastpre ne "") {
      # Check time first for being a valid time
      if (length ($lastti) < 8) {
         print LOG "$line: time was '$lastti', replaced with '$ti'\n";
         $lastti = $ti;
      ($h, $m, $s) = split /:/, $lastti, 3;
      if ($h < 0 || $h > 23 || $m < 0 || $m > 59 || $s < 0 || $s > 59) {
         print LOG "$line: time was '$lastti', replaced with '$ti'\n";
         $lastti = $ti; # Use the next time up as the timestamp if we've lost the real info.
      # OK, now let's check the date.  For date checking we need *both* bracketing dates.
      if ($llastpre ne "") {
         if ($lastdt ne $dt) {
            # Well, the dates differ.  They can obviously do that anyway...  But not by much.
            ($lld, $llm, $lly) = split m'/', $llastdt, 3;
            ($ld, $lm, $ly) = split m'/', $lastdt, 3;
            ($d, $m, $y) = split m'/', $dt, 3;

            # These three rules restore continuity in almost all cases.  If a glitch occurs
            # right on the boundary between days, you're still screwed, but frankly with any
            # appreciable traffic the probability is extremely low so I'm not too worried
            # about it.  If you want to fix it, do it and tell me how.!
            if ($y == $lly && $ly != $y) {
               $ly = $y;
            if ($d == $lld && $ld != $d) {
               $ld = $d;
            if ($m eq $llm && $lm ne $m) {
               $lm = $m;
            if ($lastdt ne "$ld/$lm/$ly") {
               print LOG "$line: date was '$lastdt', replaced with '$ld/$lm/$ly'\n";
               $lastdt = "$ld/$lm/$ly";
      print "$lastpre$lastdt:$lastti$lastpost";

   $llastpre = $lastpre;
   $llastdt = $lastdt;
   $llastti = $lastti;
   $llastpost = $lastpost;

   $lastpre = $pre;
   $lastdt = $dt;
   $lastti = $ti;
   $lastpost = $post;

if ($lastpre ne "") { # Print the line hanging in the buffer if there was any data at all.
   print "$lastpre$lastdt:$lastti$lastpost";

close LOG;

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.