Keyword perl


Today's fun task was the creation of a little prototype code to format the tag cloud for the drop handler project. I did it in the context of this blog, and so first I had to get my keywords functional. I already had a database column for them, but it turned out my updater wasn't writing them to the database. So that was easy.

Once I had keywords attached to my blog posts, I turned my attention to formatting them into keyword directories (the primary motivation for this was to make it possible to enable Technorati tagging, on which more later.) And then once that was done, I had all my keywords in a hash, so it occurred to me that I was most of the way towards implementing a tag cloud formatter anyway.

Here's the Perl I wrote just to do the formatting. It's actually amazingly simple (of course) and you can peruse the up-to-the-minute result of its invocation in my blog scanner on the keywords page for this blog. Perl:

sub keyword_tagger {
   my $ct = shift @_;
 
   my $weight;
   my $font;
   my $sm = 70;
   my $lg = 200;
   my $del = $lg - $sm;
   my $ret = '';
   foreach my $k (sort keys %kw_count) {
      $weight = $kw_count{$k} / $max_count;
      $font = sprintf ("%d", $sm + $del * $weight);
      $ret .= "<a href=\"/blog/kw/$k/\" style=\"font-size: $font%;\">$k</a>\n";
   }
 
   return $ret;
}

This is generally not the way to structure a function call, because it works with global hashes, but y'know, I don't follow rules too well (and curse myself often, yes). The assumptions:

  • The only argument passed is the maximum post count for all tags, determined by an earlier scan of the tags while writing their index pages.
  • $sm and $lg are effectively configuration; they determine the smallest and largest font sizes of the tag links (in percent).
  • The loop runs through the tags in alphabetical order; they are all assumed to be in the %kw_count global hash, which stores the number of posts associated with each tag (we build that while scanning the posts).
  • For every tag, we look at its post count in the %kw_count hash and split the difference in percentages between $sm and $lg -- then format the link with that font size. Obviously, this is a rather overly hardwired approach (the link should obviously be a configurable template) but as a prototype and for my own blogging management script, this works well.

For our file cloud builder, we'll want to do this very same thing, but in Python (since that's our target language). But porting is cake, now that we know what we'll be porting.

Thus concludes the sermon for today.



This is something I've wanted to do for a couple of weeks now -- I have a handy set of scripts to filter out chaff from my hit logs, and to grep them out to convenient category files (like "all interesting non-bot traffic to the blog"). So I've written a script to take all that blog traffic and determine which tag it should be attributed to. Hits to individual pages boost the traffic to all their tags.

The resulting tag cloud is on the keyword tag cloud page next to the cloud weighted by posts. This is a really meaningful way to analyze blog traffic and get a feel for what people are actually finding interesting. A possible refinement might be to time-weight the hits so that more recent hits count for more weight (that would be pretty easy to do, actually -- even so cheesily as to count number of hits and multiply all the counts by 90% for every ten hits or something.)

The Perl code to read the logs and build the cloud file is below the fold.

2008-05-31 wiki perl

So a week or two ago I suddenly was seized by the desire to Wiki-ize my venerable old site. I know, I know. There are pages here I hand-coded in 1996. There's stuff I tweaked into magnetic core memory using tweezers and a small rare-earth magnet in 1948. And we felt lucky to have that cardboard box!

But, well, I love vi. But lately, I've been feeling the need to stray from my first love, and the ability to whack content into a simple form, click a button, and have it published with no further ado, with all the sidebars and stuff in place, well, I needed that.

So I did it. And as everything else in my life, I did it with an idiosyncratic blend of Perl for the guts and AOLserver Tcl for the Web presentation and input parsing. Eventually I will present the code. But in the meantime, I'll note two things. First: it works, and works well, and works extremely efficiently, because Wiki pages are published once when changed, and are then available as flat HTML files when requested. Contrast this with MediaWiki, which hangs interminably on the database every damned time it generates the sidebar menu. Bad design, if you ask me (but of course, nobody did.)

Secondly: it integrates the beginnings of a pretty efficient data management tool. I'm using it for to-do lists right now, but I'm looking at various other applications as well. And it will probably feed right back into workflow, if all goes well. The most exciting thing about this aspect of the system is that organized data can be anchored and commented upon in the Wiki system. I'll be putting this to much more extensive use in the analysis of spam over at Despammed.com, but even in the context of my to-do list management it's proving a powerful tool for data organization.

Other extensions I hope to explore are a CodeWiki (which will allow the literate commentating of program code and other textual resources), a document management tool for the management of binary objects like images, and, more immediately, the replacement of this blog tool with Wiki-based code to do the same thing.

This last month has been quite productive in terms of the code I use in my everyday life, and the Wiki tool has been a big part of that. So I hope this burst of momentum continues.


2009-03-20 wftk python perl ruby

So I had this really, really stupid idea a couple of days ago, but I just can't shake it. See, I'm rewriting the wftk in Perl in tutorial form, something that I've planned for a really long time.

Well, here's the thing. The Muse picked Perl, essentially because WWW::Modbot is an OOification of the original modbot stuff I wrote in Perl. And the Term::Shell approach to the modbot turned out to resonate so well with what I wanted to do, that I just ... transitioned straight from the modbot back to wftk in the same framework. But Perl -- even though I love Perl -- is not something I'm utterly wedded to, you know?

And now, I'm working in a unit-testing paradigm for the development. I've carefully defined the API in each subsection, tested it, and know where I'm going.

So here's the stupid idea. It just won't let go of me. Why stick to Perl?

Why not take each class, each unit test, and do that in selected other languages? It would be a fascinating look at comparative programming between the languages, wouldn't it? And the whole point of the wftk is not to be restrictive when it comes to your existing infrastructure -- wouldn't one facet of that unrestrictiveness be an ability to run native in Python? Ruby? Java? C? Tcl? LISP?

It just won't let go.


2014-05-24 perl nlp

Looking at the adoption list of CPAN modules (these are ranked by a need score based on the number of issues registered, the number of dependencies, and age of issue), there are actually quite a few in the Lingua namespace.

It would probably be character-building and instructive to adopt a few and fix them up.






Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.