Blog topic tag cloud weighted by traffic

This is something I've wanted to do for a couple of weeks now -- I have a handy set of scripts to filter out chaff from my hit logs, and to grep them out to convenient category files (like "all interesting non-bot traffic to the blog"). So I've written a script to take all that blog traffic and determine which tag it should be attributed to. Hits to individual pages boost the traffic to all their tags.

The resulting tag cloud is on the keyword tag cloud page next to the cloud weighted by posts. This is a really meaningful way to analyze blog traffic and get a feel for what people are actually finding interesting. A possible refinement might be to time-weight the hits so that more recent hits count for more weight (that would be pretty easy to do, actually -- even so cheesily as to count number of hits and multiply all the counts by 90% for every ten hits or something.)

The Perl code to read the logs and build the cloud file is below the fold. It's not brain surgery, but I still think it's kinda cool:

open IN, 'blog_hits.txt';

$max_count = 0;
sub inc_kw_count {
   my $tag = shift;
   my $inc = shift;
   $inc = 1 unless $inc;
   $tag_count{$tag} += $inc;
   $max_count = $tag_count{$tag} if $tag_count{$tag} > $max_count;
}

while (<IN>) {
   chomp;
   $line = $_;
   ($first, $dt, $last) = split / *[\[\]] */, $line;
   $dt =~ s/ -\d*$//;
   ($ip, $j1, $j2) = split / +/, $first;
   ($j1, $req, $respbw, $ref, $j2, $agent, $j3) = split / *" */, $last;
   ($meth, $url, $protocol) = split / /, $req;
   ($resp, $bandwidth) = split / /, $respbw;

   ($j1, $j3, $type, $tag, $j2) = split /\//, $url;

   $tag =~ s/\.html$//;
   if (not $type and not $tag) {
      $type = 'kw';
      $tag = '[index]';
   }
   next unless $type;
   next unless $tag;

   if ($type eq 'kw') {
      inc_kw_count($tag);
   } elsif ($type eq 't') {
      $page_count{$tag} += 1;
   }
}
close IN;

open P, "posts.txt";
while (<P>) {
   chomp;
   ($post, $date, $title, $page, $keywords) = split /\t/;
   next unless $page_count{$page};
   foreach $kw (split / /, $keywords) { inc_kw_count($kw, $page_count{$page}); }
}
close P;

$sm = 70;
$lg = 200;
$del = $lg - $sm;
open CLOUD, ">hitwords.tag";
foreach $kw (sort keys %tag_count) {
   $weight = $tag_count{$kw} / $max_count;
   $font = sprintf ("%d", $sm + $del * $weight);
   $url = "/blog/kw/$kw";
   if ($kw eq '[index]') { $url = '/blog/'; }
   print CLOUD "<a href=\"$url\" style=\"font-size: $font%;\">$kw</a>\n";
}
close CLOUD;






Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.