Vivtek 2.0, or, The Blog

2014-09-06 blogmeta

Traditionally, our family has made use of summers to visit whichever of our two countries we're not currently in. For twenty years, that meant a pilgrimage to Budapest; now that we live in Budapest it means road tripping in the States. This year was no exception - from June 14 to August 30, according to my calculations, I drove a total of 4927 miles, mostly shuttling various family members around Indiana for this event and that.

Even at half-power for the paying work, the result was absolutely zero mental capacity for anything resembling coding, writing, writing about coding, coding related to translation - a complete and utter void. I didn't even take any meaningful notes, beyond the occasional recap of The Plan. (The Plan is a multi-threaded thing that starts with better file handling, goes through literate exegesis of codebases and declarative accounting structures, visits machine translation on the way, and culminates in Hofstadterian generality at its most tenuous reaches - I want to say it's been a constant companion in my life for many years, except that it is utterly mutable. It's actually been a very inconstant companion in my life.)

Anyway, now I'm back in Budapest, Facebook is blocked, I have nobody to visit, no class reunions, all known at-risk family members have already died, I have no need or means to drive people around the countryside, and I'm over the jetlag and the virus I caught in August - in short, session has resumed.

Hold onto your butts.

2014-05-24 perl nlp

Looking at the adoption list of CPAN modules (these are ranked by a need score based on the number of issues registered, the number of dependencies, and age of issue), there are actually quite a few in the Lingua namespace.

It would probably be character-building and instructive to adopt a few and fix them up.

2014-05-24 nlp

I'm getting hip-deep in German lexicography these days; since I can actually realistically tokenize my translation documents now, I keep trying to, well, tokenize my translation documents now.

I've broken things up into Lingua::Lex, which is general tools for management of a lexicon, and Lingua::Lex::DE, which is specifically my German lexicon. This allows me to test things at the level of the mechanism in one place, and at the level of specific lexical rules in another. (Note: I haven't gotten as far as testing specific lexical rules: but it's there in potential, anyway.)

The distribution version of a Lex::DE is flat files with convenient ASCII-only encoding; as part of the setup procedure we build an SQLite database for the actual lexical work. I'm not sure how big the actual lexicon will end up being yet, but from the igerman98 distro, it's not frighteningly large or impossible for CPAN to handle. We'll see.

The lexical rules from igerman98 only scratch the surface - and they're grammatically naive, as there was no reason to try to encode parts of speech, so for instance any word that takes an 's' on the end can be lumped into the same flag no matter why it takes that 's'. Is it genitive? Plural? Something else? A spell checker doesn't care, but a lexicon driving a parser does.

So there's work to be done, and I have the framework hammered out. I've made a lot of progress with it; the compounding mechanism works, and suffixes are mostly working as of today. Once I start trying to tokenize real text, then improvements should proceed apace. I figure maybe another week before I'm to the point of trying to feed these token streams to Marpa - but that's not too long at all!

2014-05-21 nlp

Oh, now this is cool. My lexicon has also been a convenient place to collect n-grams, but I've never found the raw n-grams to be all that helpful. Turns out (as is so often the case) I've been doing it wrong. There are a boatload of statistical measures of n-grams that try to capture how much more frequently a collocation appears than would be expected by sheer chance given the probability of the individual words.

The entire boatload appears to be encoded in Text::NSP, which from a cursory examination of its chief command appears to be readily adapted to my tokenizer. That's entirely cool. I'm looking forward to getting set up to the point that I can get serious about my 10-million-word corpus of technical German. (Yeah, that's how much I've translated in my career so far, starting in about 2004 and continuing to today.)

Soon, compadre. Soon!

2014-05-20 nlp

So I've long had a thing about doing NLP-type stuff in Perl. I know, I know. All the cool kids use NLTK in Python. So why Perl?

As always, the answer is CPAN. I can get a good, quick start in nearly anything by installing a CPAN module, and I know it has been tested on Windows already thanks to CPANtesters. And anything I write will be tested six ways from Sunday, too.

So Perl.

A few years ago I hacked out the beginnings of a tokenizer for NLP usage. It really just consisted of a convenient iterator wrapper around some very simple regexes, along with some n-gram type stuff for collocations (not that I've ever had much luck with those - yet). I've revived it and I've been tossing some actual translation jobs at it to see what sticks, and it's nearly ready for release.

I had the revelation, though, that what even NLTK is missing in terms of practical use is that it's a mess trying to retrieve information from documents. So my tokenizer explicitly works with a source document, which can deliver a series of text and formatting commands in a pre-tokenization step. The formatting commands are passed right through by the tokenizer.

Along the way, I realized that to do part-of-speech tagging I was going to need a lexicon. I've got a dumb model of a lexicon running against SQLite (which will be good for job-specific vocabulary), but for the main lexicon in German, it just isn't possible to get around the morphological structure of the German language. So I'm currently adapting the igerman98 ispell dictionary. Its affix script is a pretty good run-down of German morphology, although it doesn't encode parts of speech very accurately. (Nouns are capitalized, of course, and adjectives/adverbs are pretty much "A"-flagged decliners.)

There's going to be a lot of tweaking involved, but the end result is going to be a pretty good data-based lexicon that can probably fall back on some educated guesses for parts of speech of unknown words.

Here's the kicker. If the part of speech is ambiguous at the word level, Marpa can simply figure it out from context (usually). I think I have a good plan for this, but until I have a reasonable coverage of parts of speech in my lexicon, I won't have anything to experiment with yet.

Soon, though, I'm going to be able to make some specific contributions to making NLP in Perl a reality. I've been talking about doing this for a long, long time indeed. It's exciting to be actually making progress with it for a change.

2014-05-08 webcomics humor

2014-04-24 writing

So I just finished WWW::KeePassRest and wrote a little article about how I did that, and now, flush with victory, I'm casting around a little to see what else I can write about.

There's a lot.

First up will probably be the second installment on the TRADOS language saga; with Win32::RunAsAdmin I now have the crucial tool I was missing to wrap the prototype code up into a module suitable for framing. That gives me a decent little utility I can call from the command line - easier to remember than looking for a script, that's for sure!

But after that... I'm not sure. I've done some work getting IE::Mechanize ready for new generations of Internet Explorer, but until I'm actually done with that effort, I don't think I want to write about it. Although my tools for doing so are getting better.

Similarly, the work I've done with Exegete (my tool for writing about code) has only begun to scratch the surface. It'll turn into good writing as well as a good writing tool soon enough.

There are three big topics I want to address, but they're all pretty daunting. They're going to be article series instead of single articles. The first is Win32::OLE, specifically documentation of the XS code and the APIs it exposes without actually documenting them anywhere. That's been a problem for me for some time, and I'm not the only one.

Kind of tangent to this would be a module generator for COM interface modules; Win32::Shortcut would be a good starting point for that kind of very simple wrapper. Since this would be a decent supplement for better understanding of OLE in the first place, and since I halfway suspect it's going to end up being my only actual option for automating IE, this is actually a pretty decent early target.

Where exegesis is going to really get serious is GPG. Thomas Ptacek proposed some kind of code-level documentation a few months ago and I've really been working towards building the tools and the skills to write that kind of text. That will be useful for a wider audience, too, so from a point of view of mindshare it will be an excellent target.

Finally, there's always OpenLogos, the machine translation tool I adopted (and have neglected) a couple of years ago. It's massively huge and atrociously (that is: not) documented, so again, it's something for which the whole concept of a code exegesis will pay off.

So there's no shortage of grist for the mill. It's just a question of picking closer targets first. I suspect the order I've presented these systems in will be roughly the order I start writing about them.

Oh, I suppose an article about the refactoring approach I've been taking to these little CPAN modules might not be a bad idea.

2014-04-12 blogmeta

Between 2009 and 2014, a lot happened to me, but none of it shows in this blog. Mostly that's because I wanted to blog the house renovation and my blogging code of the day was a horrendous, unmaintainable mess that never seemed to do anything I wanted - so I switched to Blogger. (You can see the house blog here, but as we left the country again two years ago and unloaded the house last summer, it's gotten kinda boring.

In November of that year, I started thinking harder about what programming actually means. And that turned into the semantic programming blog, also at Blogger. Over the next two years, that effort resulted in a kind of neat declarative programming system for Perl, which collapsed under its own technical debt shortly thereafter.

I had family stress for the year after that, culminating in our leaving the country for Budapest in 2012 after our daughter graduated high school. The summer after that we sold the house and I discovered that my blood pressure could best be measured in psi. And now things with family, health, and finances are actually pretty good, and my thoughts wander back to writing.

The Blogger system is great for short-to-medium text and pictures that can be written quickly and don't need any internal structure. It was a fantastic medium for the house blog. It is not fantastic for code given its rigid three-column format, difficulty comprehending monospace fonts, and refusal to handle indentation, and is also less than great for any presentation longer than a few paragraphs. So to write real technical articles, I needed to revitalize the Vivtek site.

Back in the day, the site was hosted on AOLserver for historical reasons, and elements like the sidebar menu and a lot of the other bits and pieces were handled dynamically. The content was compiled on the server. But as that system aged and the box was put to additional uses, the cracks in its structure became apparent. Anything exposed to Internet input becomes an instantaneous spam archive, MediaWiki in a different site I was hosting there was a relentless processor hog, the Despammed spam filter made sure that if there was any resource problem it would be magnified tenfold as the queue backed up behind the kink, and the accreted weirdness of twelve years of haphazard Perl scripts running behind the scenes had forced me to move the site's static content to Github hosting. The dynamic content was left to hang in the weather. I had more important fish to fry.

So to get the site back into working shape, I had to reinvent my content handling code on my laptop and integrate with the static site. And there simply hasn't been enough time to think about that - especially given my penchant, when given any programming task, to think about how it should be possible to code it at a higher semantic level. I can talk to myself at length about just finishing this damn script today and worrying about entire new programming paradigms at some later date, but it doesn't help.

But at the end of last month, after working two month's worth of jobs in March, I decided that what I needed in the first week of April was a sabbatical. I wanted to write at least one article. (Actually I have a whole page of one-line ideas of things to do this week. And instead, it's Saturday night, day 9 of the sabbatical, work already queued up for Monday, and I haven't even written the article yet - but I did do the research for it, and wrote a kick-ass tool for my daily work, and that's really the point of this week.)

And lo! I have the site compiling again - and with a tool I'm not even ashamed of! One that will hopefully soon grow into a more mature writing environment in the future, including a notion I've been calling "code exegesis", about which more later. Hopefully, now that the site builder works well, I can keep things moving with less than a full week off work, and build up some momentum. I have a lot of things I want to write about, so that's not a problem. Historically, family and financial stress have been my greatest impetus sinks, and those (knock on wood) are at low ebb currently, may they stay that way forever.

And that's why the blog below goes from 2014 to 2009 with no visible transition. I am cautiously hopeful that things will start moving again now.

2014-04-09 blogmeta

I've finally got the blog publisher working, so all the historical posts and keyword assignments now appear where they should (mostly; still have a couple of weirdnesses to figure out).

It all gets compiled and built on my local machine, written to static HTML, and pushed via git to Github, which as you know, Bob, now serves my static content. (This statement implies that my non-static content is hosted elsewhere, which is technically true - it's still on my old box, but doesn't actually work right now. That is a can of worms for another day.)

I had a lot of fun setting things up, and eventually I intend to post about the new site publication system. But in the meantime, there are lots of other things I also want to write about, and my sabbatical week is already half over just on the blog publishing system alone. (Sigh.)

Over the past two years, I've trained myself to use a note-taking system of my own design to track programming work and ideas, and I've augmented that system to publish some notes to the blog. This is the first of those notes. I hope to rebuild the habit of technical writing now that my translation productivity has risen - which should at least potentially free up some time. Right?

Anyway. There's lots to do. I'm going to get to it.

2011-01-01 blogmeta

There were no posts between 2009 and 2014 for reasons I explain in the state of the site post.
2009-07-31 house

Reader tom writes in to say:

Subject: Can anyone edit your blog?

I hope not. But just in case we can I'd just like to say (and I speak for all of humanity here) that you should let us people on the internet come and live in y our carriage house whenever we feel like it. We wouldn't pay you per se. But thi nk of it as wiki-living.

thank you, -tom

Tom, I can't tell you how horrifying an idea I find that, but the sad fact of the matter is that my family and I are living in the carriage house at the moment, before getting underway renovating the big house.

However, finding tom's post in the spam filter (sorry, tom, nothing personal) reminds me that I kind of left my home-grown blog high and dry in favor of the shiny new House-specific Blogspot blog I started shortly after arrival here in town. Check it out; lots of status updates and pictures.

2009-04-14 house

Just about one month until I get to see my house!
2009-04-13 humor

Just a little something that occurred to me whilst perusing the Curmudgeon:

Sorry, sorry.

2009-03-31 house

So I got this spam today: Lower house Payments 30%. And I realized: 70% of zero is still zero.

How cool is that?

2009-03-24 economics house

Tomorrow, March 25, I'm closing on the house. The title company has the money, my sister has my power of attorney, and all is set; tomorrow I become a landowner again. But there's something I don't understand, now that I've seen the HUD document detailing the financing of this deal.

John Fitch bought this house to form the Renaissance House in 2003, financing $54,000. I very much doubt he'd paid more than he had to. So five years later, I can't imagine he owed less than $50,000 or so. Following me?

Of the $8000 changing hands tomorrow, $2500 goes to the realtor, $700 to a property management company (coincidentally also the realtor, but that's not the point), $500 to a listing placement company, etc. etc. The actual current holder of the mortgage will be getting a tad over $4000 for this magnificent structure.

So here's my question. Given that this is more than a 90% loss -- why? Why would they foreclose? Is this one of those "pennies on the dollar mortgage purchases" one reads about? Who benefits?

Because somebody thought this was a good idea. The realtor clearly thinks so, with reason. I think so, because I get a cheap house (actually, Renaissance House may end up getting the use of the building for a year or two; we'll see). But the original holder of the mortgage?

Somebody's really insane in all this. And it seems indicative of the whole economy.

2009-03-22 misc

So I was just measuring the latitude and longitude of my new house (closing in three days!) to play around with solar heating ideas -- when I noticed how close to 40 degrees of latitude it really is. Just a shade further north, and it would be right smack on it.

So I panned north a little, then a little west... My Dad's chicken house is precisely at 40.0000 degrees latitude. My old bedroom is at 39.9997, if you were wondering. The chicken house's longitude (and that of my bedroom) is 85.1561 degrees, not nearly as interesting a number.

Fascinating. I hadn't really ever considered how small a distance one ten-thousandth of a degree of latitude might be, but it seems to be about ten or fifteen feet.

2009-03-20 wftk python perl ruby

So I had this really, really stupid idea a couple of days ago, but I just can't shake it. See, I'm rewriting the wftk in Perl in tutorial form, something that I've planned for a really long time.

Well, here's the thing. The Muse picked Perl, essentially because WWW::Modbot is an OOification of the original modbot stuff I wrote in Perl. And the Term::Shell approach to the modbot turned out to resonate so well with what I wanted to do, that I just ... transitioned straight from the modbot back to wftk in the same framework. But Perl -- even though I love Perl -- is not something I'm utterly wedded to, you know?

And now, I'm working in a unit-testing paradigm for the development. I've carefully defined the API in each subsection, tested it, and know where I'm going.

So here's the stupid idea. It just won't let go of me. Why stick to Perl?

Why not take each class, each unit test, and do that in selected other languages? It would be a fascinating look at comparative programming between the languages, wouldn't it? And the whole point of the wftk is not to be restrictive when it comes to your existing infrastructure -- wouldn't one facet of that unrestrictiveness be an ability to run native in Python? Ruby? Java? C? Tcl? LISP?

It just won't let go.

2009-03-18 house

The plot thickens. It turns out that the edifice in question was until some point in time no earlier than August, 2007 an "intentional community ... balanc[ing] work, ministry, and restoration." Here is a picture of a man with an alligator behind or to the side of the house:

Fascinating, the notion of buying a house with history attached.

The house was mentioned in Quaker Life's September, 2003 issue. "Renaissance House: A Ministry of Renovation". It is still linked to from, but the domain is gone to the Bitbucket in the Sky, and no, the Wayback Machine only caches an older owner of the domain, not the 2005-2007-vintage site, which is a shame. Pictures would have been nice.

Update 2009-03-18: The alligator's name is Amos Moses, and the man himself is John Fitch, the director of Renaissance House and former owner of the edifice in question. It turns out he's a pretty nice guy (well, you expect that of Quakers) and the Renaissance House community is still going strong in two nearby houses with much lower payments.

So that's good.

A tutorial approach to workflow -- right here!

So for the last month or so I've been working on rewriting the wftk from the ground up, in Perl, with proper object orientation (which is to say, given it's Perl, just enough object orientation to let me not trip over my feet, but not so much that it makes me crazy).

When I started my casual redevelopment, it was because I had discovered Term::Shell, which is brilliant for casual development of code where you're really not quite sure what you want to do with it. Just start writing, and develop new commands as they occur to. Cycle often. This worked great for the current version of WWW:Modbot (available on CPAN, but not very well-documented yet, because I got sidetracked on the wftk, you see).

Anyway, I got quite a ways in before I realized I was starting to break things I'd done earlier. So I did something I'd never done before, but had always intended to: I reorganized the entire project to use test-driven development. Each new set of features or "featurelets" is in a subsection of a tutorial. I write a new section of the tutorial, making up the code as I think it should work, copy that code into a new test section, and run "make test". Then I fix it.

So all of my development is fixing. I'm good at fixing.

You can see the current state of the Perl tutorial over here. I've started with data manipulation as per my blog post last year about how the wftk should be structured. It's going slowly because there's just so much functionality that needs to be in there -- but every day, I do a little subsection, and I feel a sense of accomplishment.

This is a sweet library. You can define lists with a natural syntax, add data with a copy command or by throwing lists and hashes at it, then query it all with SQL. It's everything I had always thought should be in the (data) repository manager, but didn't have the time to write, because in Perl, half of everything is already there and waiting for you on CPAN.

I'm going to start blogging each major point finished, just to continue to give myself a feeling of accomplishment, and also to provide a little bit of timeline for myself looking back. This project has been running for about two months already; it'll probably take a year, at least, perhaps more -- so it's going to be fruitful to look back and see what I finished when.


If you haven't read Michael Chabon's Yiddish Policeman's Union, then please, I beg of you, sweetness, go out, purchase a copy of that book, and read it.

I think it may well be the best book I have read in a very long time indeed, and on many levels.

2009-03-07 webcomics humor


That the universe should misspend one mote of its grace and bounty on a fool like that is all the proof I need that the throne of the Lord sits empty.

2008-09-21 life plumbing

OK, so for reasons of decrepitude, the water shutoff for our place doesn't work (they weren't installed very well and our valve is snapped off entirely) so to do plumbing without local cutoff, like the showers, I have to turn off the water to our entire building (three apartments).

Anyway, the shower we don't use has been leaking since we moved in here, but recently it's been a lot worse. I've been meaning to get around to it, but during the day the neighbors need water, and during the night, the kids sleep right next to it.

But the wife and kids are at Boy Scout camp tonight. (Not me, I grew up in a log cabin, and that was all the camping I need for life.) So after a grueling day of paperwork, at 2 AM I figured, well, why not now? I've got the gasket set, etc.

So I shut off the water, go up, take a firm grip on the faucet with my pliers, and wrench it.

The whole thing came off in my hands. As in, ripped off the pipe.

OK. So, I have the water shut off to all the neighbors, it's 2 AM in Ponce and nothing is open (I mean nothing), and I don't have a car anyway.

Think, think, think. This is not so different from shooting yourself in the foot with Unix sysadmin work, and I've done that often enough...

OK, so the inside of the faucet is this weird thing, with hot and cold in one pipe-looking thing, each feed with a, what, 3/8" or something copper tube. I found the biggest screws I have on hand (remember: no car, no stores -- have to fix it with stuff I have in the house). They're not big enough, but it's close.

Things I learned:

1. Electronics soldering irons are useless on plumbing.

2. There is a limit to how much teflon tape you can wrap around something and still make it work. This limit appears to be a thickness smaller than the actual thread.

3. Despite our extensive collection of laboratory glassware, we don't have any rubber stoppers. (This should be rectified.)

4. Although a stopper could easily be fashioned from duct tape, we don't have duct tape. (This should definitely be rectified.)

4. Rule #47: if you don't even have duct tape, electrician's tape sometimes works.

5. Electrician's tape, rolled into a plug and jammed into the feed tubes, then screwed into place, reduces explosive leaks to drippy leaks.

6. Although a drippy leak at 3 AM is probably good enough for most people, I appear not to be most people.

7. Aquarium tubing, with electrician's tape wrapped around it to make a gasket, inserted into the copper tubing, then screwed in place with a 1/8" wood screw, doesn't even drip.

8. It is probably not quite balanced to blog about this instead of going to bed at 3:30 AM.

2008-08-24 humor

So there I was, clicking down a list of old bookmarks I just ran across, tossing out the link rot and mulling over the stuff that's still in existence, and what should I find but this:


Man, I needed that. I have to say -- I have great taste in bookmarks; they always seem to be stuff I like!

2008-08-11 botnet

If, like me, you've been wondering when the Storm people would do something new, your answer would have to be: an hour ago.

Here's the analysis for the new page, although I've just barely started.

2008-08-03 propaganda

Arghhh! Have you seen these ads saying "How do people like Vista when they don't know it's Vista?" Turns out they exposed a bunch of people to Vista and they (gasp) liked the user interface! Wow! So I should go out and buy Vista!

Uh .. no. Vista has a great interface -- but that's not why people are failing to buy it in droves. They're failing to buy it because it's new (and therefore buggy), it's broken by design (because it decrees that you don't own your PC -- license holders do, and if you're lucky, they'll let you use it), and because it represents money that I simply don't need to spend.

XP works for me. I was slow to adopt even XP -- when my last Windows 95 PC couldn't keep up any more, I switched. And frankly, the only reason I'm still running Windows is I'm a dinosaur who uses dinosaur software like Word and TRADOS (unfortunately, both absolute and utter kings of the translation industry -- the source of my money).

Anyway, if I can't vent on my blog, where can I?

2008-08-03 botnet spam

Over the past couple of days as I datamined the Despammed spam archives for Storm botnet spam, I've grown to really enjoy their madcap subjects (latest here). But today?

Guys! "Obama bribing countrymen" or "McCain picks Osama bin Laden as VP" are hilarious! But "Video News"? "Top stories"? Come on! If you're going to hijack a million people's machines to spam us all, the least you can do is to continue to be entertaining about it. This? This is beneath you.

So I got halfway through analysis of my first Javascript obfuscation discovered via spam, when another came in, and then another! And then I realized -- these were sent from botnet-controlled mailers that were slipping past my no-DSL filters at Despammed. So how many were getting blocked?

Turns out, a lot. Like, a lot. So I'm going to have plenty of grist for this mill -- and the very fascinating thing is that it sure looks like there is a change in tactics each day. So I'm going to try to go back through older instances and hope that people haven't fixed their servers yet for some, and I'm going to put up some early warnings to tell me about new ones -- but this is truly, truly fun.

Each of these mails has a faux news headline: "Michael Vick escapes from Federal jail", or "Beijing Olympics canceled", the one that first drew my attention. Then the body of the mail has a different headline, and a link.

Turns out that different headline is drawn from the same list. So I can check the spam archive (1.2 million spam emails on file at the moment) for other emails with that subject. And so on. This should allow me to build a database of subjects really, really easily. And then I can simply scan for those subjects to find new instances. If they select their headlines randomly (and I have no reason to believe they don't) this should allow me to find all their headlines and keep up with new ones at the same time. Fun!

Once I've got that coded, I'll post a database page in real time. [Updated to include link.] That will be even more fun. And then I can resume the de-obfuscation effort. Actually, I've dusted off some old project idea notes and started work on the monkeywrench to help me organize this stuff.

Note to anybody interested: the design philosophy of the monkeywrench is essentially a Hofstadter parallel terraced scan. But operated by a human (for now) in a workflow paradigm. I can sloooowly start to feel the various bits of my life coming together.

I got a spam today saying the Beijing Olympics had been cancelled, so I was all "O hai, Botnet, I can has spamtrail?" (Because I hear the Russians are using fake news headlines to induce people to open the mail now. And part of this trail goes through Russia, as we'll see.)

The whole story (well, as much as I've followed and written down so far) is over here because it is really detailed. But it's fun so far, because not only is the main injection page obfuscated, it appears to be encrypted and the decryption code is itself obfuscated and located on a different server. In Russia.

So far, it's been instructive, as always when one unravels these threads. More later.

2008-07-20 traffic

When I Wikiized the site, and started indexing the Wiki changes, I naturally also wanted to start looking at incoming traffic and referrers, as you can see on the "recent" page on the main menu. And of course I then started refining it to suit my tastes.

I had already had a "" script to preprocess the logs and give me the hits I want to see. That screens out spiders, everything I myself do from home, and (lately) any IP that posts spam to the forum or Wiki. The remainder is proving pretty interesting.

Normally, one can filter out search engine spiders based on their agent. But Microsoft, as always, follows their own rules (a little research on "" and "QBHP" will show you plenty of griping.) They use a normal IE agent string, but mark their search queries using the "form".

And you know, normally I wouldn't care. But their search queries are weird. They consist of a single word, usually (but not always) one found on the page, and if you're actually paying attention to search queries to determine what it is about your site people find interesting, these won't help.

So now my preproc script blocks everything from the 65.55.*.* block with "form=QBHP" in the referrer. You just have to wonder what Microsoft is thinking, sometimes.

2008-07-19 politics

Glenn Greenwald writes today that "the idea that the Rule of Law is only for common people, but not for our political leaders and Washington elite, is pervasive among the political and pundit class, in both parties."

My immediate realization was that this attitude is exactly equivalent to the notion that financial risk applies only to individuals and to small business, but that large business (the "financial elite") is simply too large to be allowed to fail. Oh, it's certainly allowed to succeed, mind you. The banks in the mortgage market were happy enough to take home the profits from selling houses at rates far above their value. But when their risky action proved, well, risky, they aren't expected to pay the price. We do.

In a similar way, when our political elite breaks the law, they consider themselves to be too important to pay that price. As a result, we all do.

2008-07-16 wiki blogwiki

This is my first post using my Wiki Web interface with blog extensions. There are still a few rough edges, but I believe this should work -- meaning I'll now be able to spout off on my blog at the drop of a hat.

Oh, sure. I could have just installed MovableType like a normal person. I am no normal person.

I've always had a soft spot for good explanations of Internet sleuthing for fun and profit, and here's a dandy example.

2008-06-08 fiction

Well, yesterday was fun; I got BoingBoing'd for my Tales of the Singularity. But today, I have regained my cool sang froid about all that, and it's almost too much trouble to grep for referrers again. (Almost.)

The only question is: having gotten a few eyeballs, how do you hold them? I'm pretty sure the answer is: you don't. They come, they go. Having once done something interesting, perhaps I can later continue to be interesting, but the key insight (besides hexapodia) is that you don't keep eyeballs just because you want them. That's putting the cart before the horse -- as Paul Bunyan discovered in my story.

So really, my experience mirrored my story. How ... meta. And very unexpected.

2008-05-31 wiki perl

So a week or two ago I suddenly was seized by the desire to Wiki-ize my venerable old site. I know, I know. There are pages here I hand-coded in 1996. There's stuff I tweaked into magnetic core memory using tweezers and a small rare-earth magnet in 1948. And we felt lucky to have that cardboard box!

But, well, I love vi. But lately, I've been feeling the need to stray from my first love, and the ability to whack content into a simple form, click a button, and have it published with no further ado, with all the sidebars and stuff in place, well, I needed that.

So I did it. And as everything else in my life, I did it with an idiosyncratic blend of Perl for the guts and AOLserver Tcl for the Web presentation and input parsing. Eventually I will present the code. But in the meantime, I'll note two things. First: it works, and works well, and works extremely efficiently, because Wiki pages are published once when changed, and are then available as flat HTML files when requested. Contrast this with MediaWiki, which hangs interminably on the database every damned time it generates the sidebar menu. Bad design, if you ask me (but of course, nobody did.)

Secondly: it integrates the beginnings of a pretty efficient data management tool. I'm using it for to-do lists right now, but I'm looking at various other applications as well. And it will probably feed right back into workflow, if all goes well. The most exciting thing about this aspect of the system is that organized data can be anchored and commented upon in the Wiki system. I'll be putting this to much more extensive use in the analysis of spam over at, but even in the context of my to-do list management it's proving a powerful tool for data organization.

Other extensions I hope to explore are a CodeWiki (which will allow the literate commentating of program code and other textual resources), a document management tool for the management of binary objects like images, and, more immediately, the replacement of this blog tool with Wiki-based code to do the same thing.

This last month has been quite productive in terms of the code I use in my everyday life, and the Wiki tool has been a big part of that. So I hope this burst of momentum continues.

2008-05-17 modbot

As promised, a post on the modbot.

As you know, Bob, I first got into despamming more or less seriously in 1999, when I wrote and foisted it on an unsuspecting world. And life intervened, as it often does, and Despammed's fortune has waxed and waned along with it, but I still retain a fascination for the spam.

When XRumer hit the market in November 2006, and everybody suddenly started getting forum spam, I started work on the modbot, which is a set of Perl code for the detection of spam on the Web. And you know, there's lots of it. In the context of various projects, I personally am responsible for monitoring a MovableType blog (two, actually), a Scoop installation, a MediaWiki site, and the venerable old Toonbots forum on WebBBS. And they all get spam. The type of spam changes over time. The modus operandi changes over time. And I find it all irresistible.

After my first iteration of the modbot, I got distracted for about a year, and all those venues started to accumulate spam, slowly but surely. The Toonbots forum had some basic spam blocking in place, but it wasn't too effective, and of course MovableType has some fairly decent filters in place and is moderated anyway, so spam didn't proliferate too badly there. MediaWiki doesn't seem to be a real spam magnet, either (I suspect it needs a little more savvy than the low-level help spammers hire can be expected to master.) And so it was all pretty manageable until...

The Scoop installation (which I'd nearly forgotten about) was slowly growing in its server demand. I hadn't realized it for a while, because I had just assumed it was MediaWiki being the hog it most certainly is. I recognized that eventually I'd have to track it down, but I'd been quite busy. But finally, things got so bad I couldn't neglect it any more -- Apache was spending so much time locking the CPU that sendmail wasn't actually getting me my mail, and that was a problem for the paying work.

So, groaning at the notion I was going to have to get into MediaWiki's PHP and cache it or something, I took a closer look. And it turned out that while I wasn't watching, the Scoop installation had collected 34,000 comments and change. Shyeeah, like that was gonna happen. It was spam. Scoop doesn't react well to large numbers of comments -- each hit to a spammed page (including every new spam comment post) was hanging on the CPU for over a minute. Of course I knew: that meant war.

I dusted off the modbot code, because I wanted to archive the spam properly because eventually I'm going to do some analysis. And two days later, there were only twenty comments left on (not for lack of trying, of course. The modbot just cleaned up 1400 spam comments this afternoon.) Next I adapted it to the Toonbots forum; that went well, too. The modbot is carefully written to be as modular as possible, because spam crops up all over the place and I want one single way to filter it all.

My next target is MovableType, which has two categories of spam with different characteristics. There's normal comment spam, and I have a few techniques which will work well for that. But the other category is trickier, and blog-specific: trackback spam. gets about five trackback spams a day, and I'm still not entirely sure how to block them. Ultimately, one test is going to be to check the link being spammed; for trackbacks, if it forwards to another site, I regard that as spam. Haven't implemented it yet, though.

MediaWiki spam is going to be tougher still; I'm going to need to write code to back the revisions out carefully, and I'm not yet sure how that's going to work without shooting myself in the foot. The really pernicious feature of MW spam, though, is that the spammers typically deface existing content. That's really not good. So it's going to be necessary.

One mode of the modbot is going to have to be email-based. For simple Web-post forms which deal with email, I want to be able to filter that spam before it comes to me. The normal email filters at Despammed, of course, can't begin to deal with that, because as email it's entirely legitimate. Instead, a judgement has to be made based on its actual content. That's on the to-do list, too.

Ultimately, it will be impossible to block spam -- there's no way for a machine to know with absolute certainty who you want to hear from. But that's exactly what makes it so very fascinating. The vast majority of spam is obvious, but sometimes ... sometimes you have to think about it. And the natural response of spammers will have to be to get better at spamming. I truly believe that the spam arms race is where natural computer intelligence has a good chance of arising. So ... I despam. It's my way of immanentizing the Eschaton.

Something that a lot of my project ideas have in common lately is a kind of generalized document management framework.

This isn't as impressive as it sounds, actually. But it's kind of a key notion for Web 2.0 stuff -- if you want collaboration, you have to have a place to store that collaborated content. That place is the document management system.

Let's consider this for a moment, in the context of the fantasy name generator from last week. That fascinating little thing takes a simple document -- the language definition -- and runs a Perl script on it, yielding some interesting results. The Toon-o-Matic does the same thing; it takes a simple XML document and runs a whole sheaf of Perl on it to generate an image. A Wiki for my general site content, or a forum, or even a simple Web form post, can all be seen as doing the same thing. An online programming tool; same thing. All these systems share a component -- the user can submit a large (ish) text object, often based on an existing one, for arbitrary processing, which usually has some visible effect on the system.

If you just look at that little unit of functionality, you can imagine lots of attractive ways to extend it, too. As I mentioned in my initial post on the fantasy namer, you can suddenly imagine allowing people to name a particular definition. You can imagine a page devoted to it, perhaps including all the results it's generated -- maybe in ranked order. All that's a lot of different features, but the central one is simply being able to store and manipulate that central document. It provides a hook on which you can start hanging interaction; without the hook you can't even conceive of where to start.

So this notion's been in my head lately. Oh, I'm sure all this was done in much more diligent detail ten years ago. (Well. Seven or eight years ago, anyway.) In fact, I can think of a couple of systems -- but they're all too damned complicated. What I'm after is the ability to boil these things down to their essence, to provide a language of thought about these systems. For myself, anyway. Assuming you exist, you may or may not benefit. (But I'll bet you would.)

Man, just when I think it can't possibly get any busier in my life, well, it does. So I just haven't had much time left over to do programming, and that makes me sad. But when this happens, eventually the Muse forces me to code something. This week, besides getting back to the modbot forum despammer (on which topic I will write later), suddenly last night I found myself writing a Perl script to generate fantasy country names based on a program written in some antediluvian BASIC dialect in the 70's by Jo Walton.

Once I'd written it, I realized it would be cooler if it were online -- because everything is. And so I wrote a little Tcl wrapper based on my Google count wrapper, and slapped it up. And played with it a lot.

So now I have a lot more ideas, of course. I wrote it to generate a random number of syllables up to a maximum, then gave it syllable forms like "cvv", "vv" (for consonant-vowel combinations). The vowels and consonants are likewise simple lists. You can put accents on the vowels, optionally. You can specify prefixes or suffixes. You can make letter choices for the consonants according to some scheme Jo dreamed up.

But really what this is -- or could be -- is a generic generative grammar at the phonological level. So I could well imagine making it more general, by providing a nested and arbitrary hierarchy of phonemes (liquids, fricatives, front and back) and providing more complicated options for rules.

And on the wrapper end, wouldn't it be nice to be able to save a language spec, bookmark it, name it, commment on it, mail it to your friends, etc? How about an "evolutionary strategy" to search for types of words you like the best? Heck, there's all kinds of stuff that could be done.

But this was fun. And from the traffic it's getting, I'm not the only one who finds it hypnotic.

2007-12-03 mediawiki

I have seen the future, and it calls itself MediaWiki.

2007-10-26 despammed

I finally took the plunge, after 8 years. All mail is now going through the Despammed filters. I just couldn't take the sheer mass of spam any more. (I get more than 1000 email spam messages a day and yes, for 8 years I've sorted through them by hand. Did I mention I read quickly? But lately I've been letting the spam slush pile get too big. Time for a change.)

There are some issues I still want to address -- like setting up a whitelist for customers to be on the very safe side -- but I have to say that the drop in spam in my actual Inbox has been (1) incredible and (2) so worth it.

2007-10-24 workflow wftk

I never really officially released wftk 1.0, of course (the magnitude of the task simply grew and grew and I became less and less certain of my approach -- and then the recession happened.) But I've been thinking a lot of a more reasoned approach lately, and maybe it's time to reboot the wftk project and start more or less "from scratch".

I see the modules in this new approach more or less as the following:

  1. Data management
    This is the basic list-and-record aspect that the repository manager started out addressing. Now, of course, there is SQLite. So a principled workflow toolkit would start by using SQLite for local tables, and add "external tables" (for which the new SQLite has an API) defined in what SAP now calls the "system landscape". It's amazing, by the way, how much of my thinking over the past few years I see reflected in what SAP is doing lately in their NetWeaver stuff.

  2. Document management
    Document management, as I see it, consists of: (1) actual central storage and versioning of unstructured data; (2) storage of metadata about documents; (3) parsing and indexing of unstructured data to produce structured data elsewhere in the system. The document manager should be able to work well in either situations where it controls storage (and thus can initiate action whenever anything is changed) or when it merely indexes a storage which can be changed externally -- that latter might be, for instance, management of a Website's files in the file system. Or just your system files on a Windows machine. Periodically, the document manager could check in and see whether things had been changed, and if so, trigger arbitrary action.

  3. "Action" management
    A central script and code repository defines the actions that can be taken by a system. I consider this to include versioning and some kind of change management and documentation system, including literate programming and indexing of the code snippets. The build process should also be managed here, and should be capable, for instance, of taking algorithms written in C, compiling them into DLLs or .so dynamic load libraries, and calling them from Perl, say. Ultimately.

    Actions, documents, and data would have a nested structure, by the way; there would be global actions, application actions (a given case or project could be an instance of an application), and project/instance actions, and the same applies to data and documents, perhaps. Originally I'd thought of doing the same for users or organizational units, but I really think that if you're defining a common language of actions and data, it should be organized into applications and, perhaps, subapplications or something. But not differ by user! (I might be wrong, of course.)

    The above three modules together allow a data-flow-oriented processing system, but we're still missing:

  4. Outgoing interfaces
    This includes publishing of HTML pages, outgoing mail notifications, other notifications such as SMS or ... whatever. Logged, all of it. It includes report generation into the document management system or the file system, generation of PDFs, etc.

  5. Incoming interfaces
    Given the parsing power of the document management module, this is more an organizational module. The system should be able to receive email, parse it, and take action. Conversational interfaces are covered here as well, from SMTP- and IMAP-like state machines to chatbot NLP interfaces. And of course form submission from Websites also falls into this bucket.

  6. Scheduling
    Whether running on Unix with cron and at, or Windows with ... whatever the hell Windows offers, the system should have a single unified way of dealing with time in a list of scheduled tasks.

  7. Users, groups, roles, and permissions
    This module would be in charge of keeping track of who is performing a given action and whether they're allowed to do so. The original wftk already provided a really nice mechanism which would still be nice here: when judging permissions, any action can get the answers "yes, it's allowed", "no, it's not allowed," and "it's allowed subject to approval." That last invokes workflow for any arbitrary action and that would be a powerful abstraction for nearly any system. It's essentially transaction management on a much more abstract scale.

    And finally, the icing on the cake,

  8. Workflow
    The two components which make workflow workflow are a task list (tasks are hierarchical in nature and so a task can have subtasks as a separate project) and a workflow process definition language. The new wftk should be able to work with any workflow formalism -- after all, the process definitions are considered scripts in the versioned script document repository. The existing wftk engine will almost certainly fit in here with little modification.

    The primary benefit of workflow is that it allows dissociation over time. A running workflow process isn't active on the machine for the weeks or months it might require -- it's simply a construct in the database that gets resurrected as required. There are a boatload of applications in general programming, but nobody sees them as workflow because everybody "knows" workflow is a business application. The wftk was to have changed that, and I think the potential's still there.

    There's also a case to be made for a module for

  9. Knowledge management
    This portion of my thinking is a little less organized. I'd kind of like to lump some kind of concept database in here, perhaps a semantic parser or something. Originally I'd thought that AI would go in here, but I actually think that Prolog might just be another action script language. This is definitely a blurry line in its native habitat, and crikey, he's not happy to see me here!

    But the point of a blog is to write this stuff down as it occurs. So there you have it, this would sit on top of the workflow. Think of it as a way to build smart agents into your data/document/action/workflow management system.

And there you have it -- my plan to wrap up the thought and work of eight years. Oh, and this time I'm not bothering with licensing requirements. Like SQLite, wftk 2.0 will be in the public domain. I don't really care if I get credit or not for every little thing, because frankly, anybody who counts will figure it out. And have you noticed how everything these days uses SQLite? It's because -- well, primarily because it works, but also because you don't have to worry about legal repercussions of using the code.

That's where wftk document management should be, where wftk workflow should be. Simple, easy to use, and ubiquitous.

2007-10-07 chatbot nlp

Something I've wanted to do for a couple of years now is a sort of online chatbot framework thing. In other words, this would be a testbed for different language analysis techniques that could be played with online and tested against real people.

An extension would be to connect a given chatbot to some other chatbot out there somewhere, and see them talk to each other. That could be fun.

The basic framework for this kind of venture could be pretty simple, but could, of course, end up arbitrarily complex. You'd need some kind of principled semantic framework (which would start at a simple box with words in it and ramify through increasingly sophisticated syntactic and semantic analyses -- the idea is to have a framework which can support both simplistic single-word pattern matching to select a response, or use sentence frames to extract some subject patterns to be manipulated in the response, right up through a hypothetical Turing-complete NLP parser.)

The session would contain a list of facts and "stuff" which corresponds to the, I dunno, dialog memory of a conversation. There could optionally be some kind of database memory of earlier conversations with a given contact. Again, this would run the gamut from simple named strings to be substituted into a response pattern, to complete semantic structures of unknown nature, which would be used to generate more sophisticated conversation.

Then the third and final component would be the definition of a chatbot itself. This would consist of a set of responses to given situations (a situation being the current string incoming plus whatever semantic structure has been accumulated during the course of the conversation.) There could be a spontaneity "response", i.e. something new said after some period of time without an answer from the other. Again -- it should be possible to start small and stupid, with simple word patterns, random-response lists, and the like, and build upwards to more complicated semantics.

The ability to detect and switch languages would be of great use, of course, and there should be some kind of facility for that as well.

Wouldn't it be nice to be able to build a chatbot for language practice in, say, Klingon or Láadan? I mean, how else could you reasonably practice a constructed language?

Anyway, when I have time, I'll certainly be doing something with this idea. Any year now, yessir, any year.

This week, for the second week in a row, Boing Boing features a University of Arizona initiative to "identify people online by their writing style". Homeland Security is of course all over this whiz-bang tech like ants on honey, because... well, I started a comment on the program, only to realize this would better be a blog post.

1. I'll accept that it's possible to come up with some "similarity metric" that says "A is 99% similar to A' but only 32% similar to B", in the sense that we have such-and-such a probability that a given text was written by a certain person. (So we end up with "similarity islands" of texts in the metric space, and we call each of those islands a writer.)

But that means that for any text we have only a certain (finite and non-certainty) probability that a given text is actually written by A. So let's get entirely wild and assume some government researcher with more money than brains, working alone in a highly technical and difficult field, somehow writes an algorithm as good as, say, Google's algorithm for determining the topic of a page, which is inherently an easier topic.

The result? We will be able to find terrorists online as well as Google can avoid giving us crap search results. And forgive me for saying this, but nobody in their right minds would arrest someone based on a Google result.

OK, #2. We are categorizing writers and potentially calling them enemies of the state based on WHAT THEY WRITE. Now, I know that not all the readers of this blog are Americans, but here in America, we have something called the Constitution which means that it is not a crime to write things.

Ah, hell, all snark aside, even if this works, it's still misguided for patently obvious Constitutional reasons. And it's not going to work, not the way they think, because --

3. What this all boils down to is this. Politicians and technocrats think that the world is divided into two groups of people: "our" people, who do what we say and pay us taxes so we can buy nice houses in Virginia, and "those" people, who rouse the rabble and put our salaries in jeopardy. "Those" people, this year, we call "terrorists". Earlier they were "communists", or "labor organizers", or "civil rights activists", or whatever -- the main thing to remember is that everything is stable unless troublemakers stir things up.

And we used to be able to know who those people were, because they looked funny. But on the Internet, nobody knows you're a dog -- and so there is a perceived need, whether it's possible or not, for a technology to identify dogs. Or "terrorists" -- but what they really want is to be able to draw a line down the middle of the Internet between safe people and troublemakers.

And what that means is that the freewheeling exchange of ideas -- OK, and 90% crap -- which is the Internet? It's gone sufficiently mainstream that these people regard it as a threat, exactly like certain neighborhoods, or certain movements, have been in the past. It's too free for comfort, and it's too well-known to be ignored.

I don't know how it'll play out. But this story really exposes a seamy underside of our society. It's depressing.

And then there is the notion of spoofing such an algorithm, as researched by none other than Microsoft, at Obfuscating Document Stylometry to Preserve Author Anonymity. This may well be illegal research at some point in a dystopian future...

2007-09-12 politics

I just found myself posting this in comments on another site, and it expresses my feelings pretty damned well. This is a technical blog, but it's my blog, and in this one instance, I feel justified about a political post. And it's not like anybody reads this anyway -- so assuming you even exist, you have no reason to complain.

Thus runs the missive:

Bah. I don't like to rehash 9/11 because the second thing I thought, walking into the IU Union and seeing the smoking tower on the TV, was "Reichstag", and in this one thing, Mr. Bush has not failed me.

The first thing was -- holy shit THERE's something you don't see every day.

You know what? The actual specific damage there was nothing to the United States. Compare it to the damage Hitler did in Europe, then come back and tell me it was at all significant. The only damage it did was in America's collective head -- it was a bee sting, and the last six years have been anaphylaxis. Not sure yet whether it was fatal shock, but the patient still doesn't look good.

Now these same people, after 9/11-ing for years to justify stupidity and blood in Iraq, telling us a tinpot third-world embargoed communist was as dangerous or more dangerous to our mighty nation than the heavily industrialized Germany we faced down and beat while fighting on another front entirely -- against ninjas for cryin out loud -- those people are now telling us we're in an existential fight with Iran over the fate of Western Civilization. They've been at war with us for thirty years, and it's been an existential threat, but we just ... haven't noticed? Does anybody but me see how fricking stupid that sounds?

I spent a lot of time and energy being liberal and anti-war between 2001 and 2004-ish, killed my business thinking about politics instead of noticing the recession, ran up a shitload of unpaid back taxes and debt while killing my business, and every day I watched a pack of deadbeats getting richer and richer off America's pain, all because of America's Reichstag and the willingness of the American people to treat international politics like a horror movie. And that fucking monkey, excuse my freedom, can't even wipe the smirk off his face, even now.

I just don't care any more. America will recover from its panic attack, or it won't. It doesn't matter what I do or say. And that's what 9/11 means to me.

So I actually did something with the Toon-o-Matic for a change. It was fun! Didn't finish what I wanted to do, but at least there was an actual new kind-of-episode for the first time since 2006.

Maybe next year I can do another.

OCR in Python. There isn't any, to speak of. While there do exist a few open-source OCR projects (Conjecture seems to have a great deal of promise!), none of them play well with Python. I may want to rectify that at some point.

Anyway, growing bored of simply writing AutoHotKey scripts to play Tower Defense, I quickly realized that I really needed a tool to start the game for me, and track the score and other stats for later analysis.

The first part was no big deal; I whipped out a PyPop applet that could launch a URL. Since I wanted a window that was sized according to the Flash object, that required putting together a local HTML file that could run some JavaScript to pop up the window I wanted, then close itself afterwards. I'll document that when I have time (I propose the abbreviation wIht to save my time typing that phrase.)

Well. That was fun, and it worked, but I really wanted something to monitor the score for me, and timing, and ... stuff. Which meant that I would have to read the actual graphical screen, because there's no handy-dandy textual output on that Flash app.

You'd think that would be trivial in 2007. But you'd be wrong.

Getting a snapshot of the screen was easy enough. I wrapped win32gui to get a window handle by the title (I'll document this later: Idtl), then installed PIL to grab the actual graphical data and manipulate it. To warm up with that, I set a timer to grab a four-pixel chunk of the screen so I could see whether the Flash had started or not (Tower Defense goes through an ad screen, then a splash screen, and only then does the game start.) That took a little putzing around, but the result was gratifying: my little utility could tell me when the game was ready to play. As long as the window was on my primary monitor, anyway (turns out PIL is not good with multiple monitors -- who knew?) So it turned out I had to move the window before all that.

And then I could grab the sections of the screen with the numbers on them for score, bonus, lives, timer, and money... And then it all screeched to a halt, because there are no open-source Python OCR libraries. At all. And clearly I don't have the time to adapt something -- hell, I don't even have time to do all this. I don't even have time to write this blog entry.

So of course I did the natural thing. I wrote my own special-purpose OCR, because clearly that would be saving time. I saved four hours before I started falling asleep, and it still can't tell 8 from 0 (but it does a fine job on the rest of the digits.) It was a lot of fun, actually. Idtl.

So. Proposal. It would be nice to work some with Conjecture, and produce the following: (1) a Python binding that can work in memory with PIL bitmaps, (2) a Web submitter for test graphics, and (3) an online tester and test database reporter. That would be really cool. Of course, only wIht.

So it turns out that AutoHotKey can indeed play Tower Defense; I recorded a script to set up my favorite opening last night. It occurs to me that it would be quite easy to set up a simple "strategy-description language" which would be a series of commands to be executed in order. Translating that higher-level language into an AHK script would be trivial.

So at least it would be possible to track strategies and compare them. That's step one in an evolutionary approach.

Couple that with a Python front-end to make it all more convenient, and we're off to the races!

So I have gotten sucked into this cool game, Tower Defense -- apparently there's a whole genre, but this is the one I've been playing with the past couple of days.

It's a Flash game, and the premise is simple. You have a grid with two incoming gates and two outgoing gates. Enemies come in the incoming gates, and you have to kill them before they get to the outgoing gates. You lose one health point for every enemty that gets through. You have 20 health points and no way to get more.

So -- the point of the game is to build defense towers which run automatically. Each enemy you kill nets you a small amount of money, which you can use to build new towers, or upgrade existing ones. (You need to upgrade to keep up with the steadily increasing toughness of the enemies, of course.)

It's fun. I thought of a fun way to do some really cool programming with it. AutoHotKey can find the Flash window, and can scan the screen for whatever you need. It can click things, too. So my notion is to write an AHK front-end to play the game, with an IP connection to a strategy server. The strategy server would then use evolutionary programming to come up with board arrangements, and over time (perhaps a great deal of time, who knows?) you could watch the strategizer get really good at Tower Defense.

Or not.

But it would be fun to try it.

My notion is that the instructions for a given strategy would consist of a series of builds at given coordinates, intermixed with upgrades to the towers built. Each instruction would only execute after there was enough money available for it. So you'd have a string of instructions which would express a given board setup.

To combine strategies, you'd just ... well, I guess you'd just cross the streams, so to speak, taking two (or more?) successful strategies, cut them at a random spot, and splice them together. You might want to have some cleanup rules (you can't upgrade squirter #3 if you didn't build three squirters, for instance.)

And then you'd let it run for a few weeks. You could get other people to download your script and run strategies, too.

It could be very, very cool...

At least since last Wednesday night, has been dead in the water. Given that Russia promised the United States to "make it illegal" for to continue operations by June 1 of this year, I'm wondering if that's the situation.

Possibly it's technical issues (they've been known to happen in the past) or possibly somebody has been too pre-emptive in their removal, but still -- it's been five days now. And nobody in the world seems to have noticed. Except me. World exclusive? Am I the next tech-Drudge? Only time will tell.


Anyway, I even got uber-paranoid and tried traceroutes from other countries than the United States (hey, it could have been the case!), but the IP is dead from Austria and even Byelorus. I figured Byelorus is close enough that there would be no profit in trying the rest of the alphabet.

Watch this space for further details. Assuming I discover any. And remember poor in your prayers.

I'm not dead! Just really busy.

So I've been wanting to put up some information about organic produce, the digestive system, bioplastics, and so on, but (as usual) don't have the time. And then today, xkcd comes up with this. Ha! Definitely one of the best toons out there.

More later. Really. In the meantime, if you want meaningless, content-free remixes of my pearls of wisdom, I refer you to the Markov version of my blog, my Ulysses in the Caribee. It's different, I'll give it that.

2007-05-27 adsense

I'm sure this will surprise no-one, but Google hasn't responded to my appeal as of today.

So I repeat: don't get into a position where you depend on Google for revenue. If you think ad revenue is a good business model, consider an ad partner who will avoid cutting you off for no apparent reason, and who will respond to questions if they do decide to take some unilateral actions. Google's too big and obviously overwhelmed with abuse -- but make no mistake; Google won't suffer if somebody does something fishy on your site. You will. (Or at least, I can only assume that's what happened to me -- since Google doesn't have the ability to tell me.)

2007-05-18 adsense

In an entirely unexpected development, Google has decided to terminate my Adsense account. Not sure why, but my God! Whatever shall I do without the scads of money rolling in from my vast Internet empire?

Seriously, given that I wrote most of the content on this site years ago, it wasn't bad getting a few bucks for free every month, but it was strictly recreational book-buying money. Nothing to write home about. But tonight I am profoundly glad that I didn't make the time to bring in ad-revenue traffic. If I were depending on Google for my livelihood, I would be in a world of hurt with absolutely no recourse whatsoever.

Here's what their email said:

Hello Michael Roberts,

It has come to our attention that invalid clicks and/or impressions have been generated on the Google ads on your site(s). We have therefore disabled your Google AdSense account. Please understand that this was a necessary step to protect the interests of AdWords advertisers.

As you may know, a publisher's site may not have invalid clicks or impressions on any ad(s), including but not limited to clicks and/or impressions generated by:

- a publisher on his own web pages - a publisher encouraging others to click on his ads - automated clicking or surfing programs, or any other deceptive software - a publisher altering any portion of the ad code or changing the layout, behavior, targeting, or delivery of ads for any reason

These or any other such activities that violate Google AdSense Terms and Conditions and program polices may have led us to disable your account. The Terms and Conditions and program polices can be viewed at:

If you have any questions about invalid activity or the action taken on your account, please do not reply to this email. You can find more information by visiting


The Google AdSense Team

Wow. I'm underwhelmed -- this gives me absolutely no information whatsoever as to what might have happened. Now, since my daughter has been in the hospital this week, I really haven't been online much at all. So whatever did happen, I know it wasn't me.

So I thought, OK, I'll appeal it, what the hey? Here's what the appeal form looks like (bolding is mine):

As you know, Google treats instances of invalid clicks very seriously. By disabling your account, we feel that we have taken the necessary measures to ensure that invalid clicks will not continue to occur on your site. Due to the proprietary nature of our monitoring system, we're not able to disclose any specific details of these clicks.

Publishers disabled for invalid click activity are not allowed further participation in Google AdSense. However, if you can maintain in good faith that the invalid clicks we detected on your ads were not due to your actions or negligence, or the actions or negligence of others working for you, you may appeal the closing of your account.

Google reserves sole discretion in considering whether to take any action on an appeal.

In order to appeal the disabling of your account, please supply us with the details requested below. We're unable to consider appeals that do not contain all of this information:

Company's name (If applicable):

AdSense Login (Email Address):

Publisher ID:

located in the AdSense code on your website with the format, pub-################

Date Account was disabled:

Who are the intended users of your site?:

What is the source of your site's content?:

How often do you update your site?:

How do users get to your site? (How do you promote your site?):

How many people are involved with the administration of the site? :

Any relevant information that you believe would explain the invalid click activity we detected

Any data in your weblogs or reports that indicate suspicious IP addresses, referrers, or requests:

And then the button for the appeal form: Submit. Yeah. Never has that default text seemed so appropriate. But what sort of floors me is that the entire exchange is like this:

Google: We're cutting you off. Have a nice day!
Me: Wh-- what? Who? Why would you do that?
Google: We don't know. Do you have any relevant information which could explain our unilateral and arbitrary action?
Me: Yeah. I got your relevant information right here!

A deeply disappointing experience.

2007-05-13 spam forumspam

Now that I've been collecting spam from actual fora for a little while, I have some initial statistics and musings.

I've collected spam from one eBoard 4.0 forum since May 5; it is now May 13. The spam filters I'm using are blocking about 93% of the postings, making the moderation burden manageable for that forum. In those 8 days I have collected 1,235 spam samples. That's 150 spams a day, from a fairly obscure forum; in retrospect, even though the actual log activity seems low, this is a lot of spam.

Those 1,235 spam samples link to a total of 10,795 links. I haven't yet built analysis machinery to get much farther than that; I've mostly been just looking at the links, retrieving the pages, and musing about how all that might be automated in an interesting and useful way.

Some tidbits:

Some of the spam links point to actual sites being advertised. I don't yet have a feel for many links point to sites other than those actually advertised, but there are some interesting commonalities. For instance, there are a lot of pages placed onto vulnerable fora and other venues which simply link to other pages. In some cases, it's easy to tell why: Google spamming and simply a way to counter attempts to block posts which link to particular URLs.

I have a separate notion to find and track those vulnerable sites, and to attempt to mine them for further information on these spam networks.

Bugzilla, oddly enough, seems to have such a vulnerability. (Can you call this a vulnerability?) There are links to pages stored as attachments to bug reports. Those attachments are (naturally enough) not subject to any content restrictions. Unfortunately, that means you can put any Javascript into them at all.

I haven't yet found actual malicious Javascript being spammed to fora. What I have found is obscured Javascript which modifies document.location to force a page forward to another site. I consider that semimalicious, and my initial goal is to find a way to detect that with some sort of automatic analysis, and block posts based solely on the basis of link to that sort of page.

I figure it's only a matter of time, though, before I find some actual malicious Javascript which will attempt to rootkit my machine with keyboard loggers to steal my bank accounts. That's pretty cool, actually, so I'm watching the spam traps with bated breath.

One spam has a huge number of links to different domains, all of which resolve to the same IP. That's an interesting feature. I'm not sure how to track it yet. What I really want to do is some kind of generic analysis framework, but I don't have a good picture of what that framework would look like, or indeed precisely what it is that I expect it to do.

It seems that what I want to do is to build a kind of task list for an incoming event. That task list would consist of a certain (small) number of analysis steps which themselves generate new analysis events. Each step is a test. The results of the tests are cached, so that all possible duplicated effort is avoided, but also so that relationships such as "these spam efforts share an IP" can be found.

There's a certain exponential explosion involved, it seems at times. But there are also patterns which could cut down on the amount of work done. Of those 10,795 links I have so far (oops, in the time it's taken to write this much, two more spams have arrived, so I now have 10,886 links to analyze) -- of those 10,886 links I have, many of them are hosted at -- 2,804 of them, as a matter of fact. It will be very interesting to analyze the spam pattern there, by the way. Are all of these from the same spammer? Same IP? (Bet not.) But more germane to the point I was making, eliminating those URLs from separate analysis will cut out 20% of the analysis effort.

Well, anyway, this is just a little talking out loud while I muse about how to automate all this analysis. Eventually I'll get down to posting graphs of some sort. That will be fun. The other thing, of course, is some way to ask about a URL, "Is this URL a spam indicator?" I hope it will also cross-fertilize with Wish me luck.

Three related things today. First, the scripts I've been putting together for forum spam blocking have kind of coalesced into a "modbot". This program attempts to automate the tasks performed by human moderators and could technically be placed into any Web spam moderation situation. It is currently running happily in an Eboard 4.0 installation and blocking roughly 93% of spam, while still allowing anonymous posting to that forum. I'll be packaging it up for distribution in the public domain. Watch this space for further details -- one of the more fascinating notions I've had is to enable it to receive moderation emails from Blogger and thus automate the comment moderation process there.

One of the rules/tools used by the modbot is to count Google hits for the numeric IP of an untrusted poster. Turns out that HTTP proxies have a real proclivity for getting indexed. A lot. Legitimate IPs, not so much. I wrote a little online tool to call Google to get these counts; the tool is here and the write-up of the code is here. It's currently blocking about 40% of spam (I don't have good statistics analysis in place yet, so that's very approximate.)

Finally, as a spinoff of this project, I've started a spam archive. There's nothing to present yet, but I hope to start doing some interesting analysis, and most specifically a searchable database -- along with a searchable database of spamvertised sites. That ought to overlap with the sites spamvertised by email spam as well, and that's going to be an interesting thing to look at. We'll see.

I've stumbled onto a spam link network of staggering extent in the course of examining forum spam. A spammer has a site somewhere, and then spamvertises it. But then some of the spam starts to link to other forum spam, which in turn links to the site. Some sites auto-forward to other sites using obscured Javascript (I haven't figured out just why, yet; if you have a rationale, I'd be happy to hear it.) Anyway, after that goes on for a while, there's a huge resulting network of vulnerable fora linking to other vulnerable fora. There is a true treasure trove of information available to the interested party. Which would, of course, be me. I will definitely be following up on that and posting on it.

Anyway, it's been nice talking to you. Back to work!

2007-04-25 xrumer

In the predictability department, one of my forum spam traps just pulled in an interesting post: yeah, it was posted (presumably) by XRumer and certainly fits the profile -- but it's advertising a crack of XRumer.


"Greate new XRumer4.0 platinum edition and crack DOWNLAUD".

I wondered how long that cash cow would last -- looks like about, what, November to April? Actually, it took longer than I expected.

In case you're wondering whether this is a good idea, well, given that you therefore think spamming is a valid business technique, then: sure, go ahead. Download a crack from Russians and give them control of your machine.

In related news, I have doubled the number of forum sites I am despamming. (If you're paying attention, that means, yes, I now have one that isn't my own site.) And I decided to try a notion that's really paid off in spades.

See, XRumer uses a vast database of known HTTP relays to post spam. This makes it much more difficult for human admins to block by IP -- since a single spammer may have hundreds of IPs available, how can you block?

Well -- unintended consequence time! Thanks to the explosion in use of these proxies, we now have a reliable way to find them out without human intervention at all. Count the number of times Google indexes an IP, and you have an incredibly effective way to determine whether it is on the list of known proxies used by spammers. Granted, you have the lag between the time it becomes a proxy and when Google starts indexing the references to it on forum posts around the world. But this one test for spam blocks about 60% or more of forum spam, sight unseen.

It won't last. But then again, neither will XRumer, not in its present form.

Just to help you out, I've provided a simple Google hit counter: go here and type in any phrase, not just an IP address, to see how many references to the phrase Google has indexed. When I've got a little more timeframe behind it, I'll even put in autorepeating queries of the good ones, with gnuplot graphs to show googlecount over time.

And of course, I'll be putting the code up; it's about ten lines of Perl -- the only reason it's that long is that it caches results in a database so repeated queries don't pound Google. Not that Google can't stand the pounding, but I don't really want a bunch of Perl script threads hanging around waiting on Net latency.

So, a common refrain lately: more later.

2007-04-06 xrumer

Second post in a day... Turns out that the WaPo posted on XRumer back in January. The article is here, with comments. Note that the comments are, except for four, all by Russian spammers. Who are tagging the Washington Post with high-fives because they've caught the attention of the mainstream.

If that doesn't blow your pretty little mind, I'm not sure what will. I love this century!

So again: I'll help you block XRumer if you want. Just drop me a line and we'll talk. This ought to be fun.

2007-04-06 xrumer spam

So hey, kids, I'm still alive, and now posting from the lovely Caribbean island of Puerto Rico for the foreseeable future.

After the move, and after some confusion on the part of the cable company involving losing my order, I have blessed, blessed broadband again, without having to cadge the neighbors' WiFi from the rooftop terrace, which would be a great place to work were it not for the tropical proximity of a horrible huge ball of blazing nuclear explosion hanging over my head, plus the necessity of placing the laptop in a precarious position on the railing, four floors above concrete, to get good signal.

But now things are good again, and I have 9000 emails to go through (yes, as a guy with a spam filter, I should probably be filtering my spam, but, well, it's a long story and look, shiny thing!). And lo! within those 9000 mails were two from hapless forum operators who are getting fed up with manual despamming.

So sure, I'll be seeing what I can do in that regard, but it piqued my interest in forum spam again. And so I checked my logs for instances of XRumer, and wow -- somebody actually linked my XRumer blog keyword in response to ... a new instance of the XRumer forum bomb. Dated April 5, as it so happens. This one contains the novel text "Also, do you know when XRumer 4.0 Platinum Edition will be released?" and it's posted by AlexMrly. Google either the phrase or the name, and you'll see a whole lot of forum spam. Hey, XRumer guys -- thanks! What we all want is more forum spam!

Now I have that off my chest. I'm going to reiterate my offer to anybody listening -- I'm going to see what I can do to combat forum spam around the world, and I'm not charging anything for it. So far, I'm just in it for the interest, just like email spam in 1999. Get in touch. I'll be here. Well -- I might actually be at the beach. But I'll be back soon.

Sorry that this post isn't really all that programming-oriented. I hope to be making that right, in the next couple of days. Blocking XRumer is fun, and so easy even a child could do it! No, seriously: if you want to help me stop XRumer, all I need is your data.

Just so I don't forget how this blog thang works, I want to assure each and every one of you bots who reads this blog that I've been doing lots of cool programming-type stuff... Well, OK, actually, the family and I went sledding every day for a week and a half while the weather was right, and I've been scrambling to finish up a hueueueuge job that I was neglecting during that time. So ... no excuse at all.

On the translation front, I've discovered that hotstrings (e.g. Word's AutoCorrect feature) can really speed up my typing in cases where I'm doing repetitive texts. Even the time savings of typing "s" instead of "SAP" can really add up over time. But Word's AutoCorrect has a problem -- it saves different lists for each style, and in some instances it triggers without waiting for me to finish the word; if one of my hotstrings is a prefix of another word, that can truly suck. Some Googling got me to AutoHotKey -- and AutoHotKey truly and totally rocks. A lot of what I wanted to do with PyPop is already done and ready for me to use, actually. So I'm going to start bundling AHK with PyPop for Windows systems. It's that good.

Another nice discovery this week has been SQLite -- which is not just open-source, it's actually public domain. It implements pretty much all of the SQL92 standard (except the permission model) for a lightweight local database for single-user use. Websites have been built on it with impressive performance. And the key is -- you can bundle it into anything. Anything. So it's definitely going into the wftk. Man. I'd been kind of moving towards building my own SQL parser and so on -- what's the point? It's already been done! And beautifully!

So there's life in me yet, never fear.

On that note, it's back to work for me.

For many moons, I've had this crazy idea of a generic file parser floating around in my head. (The idea, not the parser.) This would function a lot like a hex editor, except that it would operate on a semantic level: if an extent in the file was known, meaning that its purpose was known or at least guessed at to the point where it could be named, then that information would be marked in a file description.

An example of this would be a malware analyzer. In case you haven't seen the term before, "malware" is software that is out to do you harm. Viruses, worms, and stuff like that. A popular source of malware is executables attached to email in such a way that Outlook will execute it without asking you. Yes, this still happens. For lots of details of this kind of exploit, see the Internet Storm Center's blog. Hours of fun reading there. No, seriously! Malware is fun!

But as any reader of this humble blog knows (both of you), my time for fun is strictly limited, and my patience wears thin very quickly. So I never actually analyze any malware, because to do so I'd probably have to find a piece of paper and note stuff down. Hence the need for software to do it for me: if I had something that defined the sections of an EXE file under Windows, for instance, then I'd run that against the malware, and I'd at least break everything down into conveniently readable chunks -- I'd eliminate the EXE header, split out the resources, that kind of thing.

This, then, is a generic file parser. It allows me to interactively define a file structure for a given file (or class of files) and read useful data out.

A more proximate reason to do this, lately, has been that I have a need to use a glossary file which has resisted import into MultiTerm, my glossary software of choice. I could open a hex editor and see the terms, but I couldn't do anything useful with them.

Well, yesterday I spent the whole day on it, but I have a prototype of said file parser. Using it, I can define hexblocks, sequences, lists, records, switchable sections based on flags, variable-length blocks based on length specifications in the file -- all that works like a charm, and I get a nice, readable dump file for my trouble. As I refine the file description, I get more readable dumps. And then I can write a Perl script to scan the dump and pull out whatever I like.

It is so extremely useful. Unfortunately, I only slept about three hours last night, since I stayed up until 3AM coding and didn't actually do the paying work I should have been doing instead... So posting this project will have to wait for another day -- but once it is posted, wouldn't it be groovy to have an interactive online file parsing tool for, say, malware snagged off the wild Net? That would be fun!

So: more later.

2007-02-07 art

You want to see something nice?

Go watch this.

The Machine is us. That's exactly what I've been trying to express for years. And this expression of that insight is quite artistic as well. Which just reflects another pillar of my philosophy -- beauty counts.

This is something I've wanted to do for a couple of weeks now -- I have a handy set of scripts to filter out chaff from my hit logs, and to grep them out to convenient category files (like "all interesting non-bot traffic to the blog"). So I've written a script to take all that blog traffic and determine which tag it should be attributed to. Hits to individual pages boost the traffic to all their tags.

The resulting tag cloud is on the keyword tag cloud page next to the cloud weighted by posts. This is a really meaningful way to analyze blog traffic and get a feel for what people are actually finding interesting. A possible refinement might be to time-weight the hits so that more recent hits count for more weight (that would be pretty easy to do, actually -- even so cheesily as to count number of hits and multiply all the counts by 90% for every ten hits or something.)

The Perl code to read the logs and build the cloud file is below the fold.

2007-02-05 spam xrumer

When I initially posted the XRUMER and you post, I thought that XRUMER probably used the text I posted (which I had found on a forum I frequent) to identify spammable fora -- those for which moderation is not performed.

Later, I came across the theory that this post was in fact some pretty clever viral marketing. By pretending to ask the forum's members about XRUMER, the XRUMER marketer could induce at least some people to search on it and link it, causing Google to rate it highly without actually themselves spamming. Neat.

But for whatever reason, my post caused Google to rate me third on searches on the term XRUMER -- and instead of XRUMER, I'm seeing a lot of traffic from people obviously interested in stopping it.

As am I.

But I don't have access to a forum affected by XRUMER (or at least, I can't tell for sure that I do.) My own Toonbots forum is an extremely low-traffic venue running on antiquated WebBBS code. I get spam there, and this week managed to block it all (so far), but my problem is decidedly minor.

I can only assume that if you're reading this, you have a major forum spam problem. If this is the case, I need your help. I'd like to try out some ideas about forum despamming -- building on the working concepts in my own low-traffic venue. But to try these ideas out, I'd need access to a forum. Your forum, if you're interested. And that essentially means access to the underlying storage (whether filesystem or database), a way to run Perl on your box, and access to the Web access logs in real time.

Depending on your own traffic patterns, the access logs can provide a great deal of information about whether a post is legitimate or not. Of course, you can also make a lot of valid judgments based on the post content, but I hesitate to block on things like "too many links," as satisfying as that heavy-handed approach may be. Legitimate users can often have legitimate reasons to post lots of links. Granted, they're generally not about Cialis or mortgages or hot xxxxxxx Asian lesbian pr0n, but still -- any interference with your actual users is something you want to avoid at all costs. I regard information about post content to be one factor in a good, well-rounded spam elimination strategy.

Traffic analysis correlated with forum activity can be a powerful tool, and in my own case it's working 100%, with no examination of content at all, but my traffic is so low that I can't judge how complete a strategy it might be. If you add your forum to the mix, I can improve the techniques.

So anyway, all you desperate forum admins with XRUMER problems -- if you want me to give it a shot, drop me a line. I'm working for free and during an initial phase my scripting can simply recommend post deletion instead of making any automated changes itself. Interested? Tell me.

2007-02-04 spam

I just wrote a rather effective spam eliminator for my WebBBS forum at Toonbots, and sort of "live blogged" the process as I went. The result is a rather attractive little document. I feel virtuous again tonight.

Finally! I've been pretty busy with the paying work this last week, and also with biochemistry due to my son's kidney/allergy problems, and so lowly open-source work has suffered.

But the PyPop GUI framework is ready to download in a convenient NSIS installer. Rather than host it, I've put it up onto the the SourceForge download page for your downloading pleasure.

Once it's installed, download the filetagger app definition and play around with it. It's all still pretty crude, but I'm having fun. Did I mention that this actually involves the on-the-fly generation of a Python class based on the XML application definition, which is then instantiated in the GUI to do the work? That was fun!

Anyway, more later. I'm still on week #3 of the app-a-week thing, for, um, the second or third week. Maybe I'll slowly approach an app a week as I get this stuff under control. Wish me luck!

I posted v1.0 of the filetagger in the new PyPop format. The XML definition of the app is 310 lines and about 12K. I think this could end up being quite useful.

The code is here -- I don't have the actual running PyPop up to run it, though. I still want to get registration of file extensions working -- oh, yeah, and what there is of the help system. The help text is included but there's no command to display it yet.

If I end up defining a basic XSLT processor on top of the XMLAPI, this could start to get really interesting...

At long last, I managed to finish development on a first cut of the filetagger application. It took far longer than I really wanted it to, because I spent an inordinate amount of time whipping the wxpywf framework into shape (about a month) and so the whole "app a week" thing is more like "an app per five weeks" or so. Ha.

But you know what? I did it! I actually brought a major new module of the wftk, one I'd been thinking about for three years, to the point where it can be used. Wow.

So I'm glad I took the time to do it the way I wanted to do it.

Here are some of the features of wxpywf I created and used for this app:

  • XML definition of the entire UI of an application, using frames and dialogs. In comparison with the traditional call-by-call technique for setting up a wxPython UI, this is incredibly convenient.
  • Application-specific code grouped into simple commands.
  • Each frame and each dialog automatically binds to an XML record which can be addressed on a field-by-field basis.
  • HTML can be used for more textual interfaces; links generate commands which can have arbitrary effects on the UI (in this case, clicking on a link in the tag cloud switches the tabbed frame to the file list and displays the files with the tag selected.)
  • So far, the UI can include tabsets, list controls, HTML windows, rich text controls, checkboxes, radio button groups and listboxes, command buttons, and static text.

There's a lot of ground still to cover. But in my experience, that kind of ground can be covered in small, manageable steps after initial usability is there. And initial usability is definitely there. I feel really happy about this.

2006-12-31 spam

Over at Toonbots, I have a forum, based on ancient but reliable Perl code. For many years, that forum has been a quiet backwater of the Net where I chat on various topics with those of my friends who enjoy that facet of my personality responsible for the engendering of Toonbots.

But lately, something extremely irritating has happened. The forum has become the target of forum spammers. Their spam rarely even formats correctly, since the forum code is so old and weird. But that doesn't stop three or four of them from posting every day, and I have to delete it all by hand, or relinquish the forum to utter uselessness.

Oddly, the wftk forum is utterly unaffected by all this. Since the trouble started when I started properly indexing the forum archives, I suspect the archives are acting as a Google magnet for various topics. But I'm not sure yet.

The modus operandi of forum spammers is different from real posters, according to the logs; typically the forum spammer hits the site for the first time in the forum archives, then posts within a few seconds. Real posters actually read the site first. So I could filter based on that behavior. But I'm going to study the issue for a while, see if I can detect any other useful patterns. It's a serious problem, and a growing one; email spamming is experiencing diminishing returns now, since fewer people read email thanks to spam. So forum spamming is a logical progression.


2006-12-27 despammed spam

Once, many years ago, I did a foolish thing. I wrote a quick little spam filtration forwarder and opened it up to the public. As far as I can tell, I was the first person to have done so.

The year was 1999, and the Boom was in full swing. My plan was simple: (1) write a free online service, (2) ???, and (3) profit!!! As you no doubt can surmise, #2 never really happened, let alone #3. But from 1999 to sometime in 2005, I kept that thing running, through three servers, four household moves, and a growth in userbase from two to a few thousand (if I recall correctly).

Then: the server died. I mean, it died suddenly and irretrievably, and I (never much of a stickler for formalities) had not backed it up. Ever. I'd simply moved it from place to place while intending fully to back it up, and all the old machines were broken into their constituent materials by that time. As I was going through some serious financial woes, and as my wife and I had found that our son has a kidney disorder, my priorities were clear. was superfluous.

But it nagged at me. lived on in my heart, even though its DNS entries pointed to an IP now occupied by some calendar service thing. (Which was weird.) And then I found a not-too-dusty copy of the HTML. And last month I found relatively good copies of the filtration software. I still don't have the user database or the administration/registration code or the statistics or the filter databases or, really, anything. But what the heck. I put it back online anyway. I'm nearly positive I'm going to live to regret it.

But I still feel a warm holiday glow. I'm giving back to the community again. Merry Christmas to all of you! And if breaks into your house wanting to eat your brains, remember: a headshot is mandatory.

Sunday, I translated Chapter 12 of a book on ABAP programming under SAP, forthcoming in the English edition from SAP Press in March. This was actually the third chapter I'd done in this book, and the others were a little far afield from what I usually do, since I don't actually do SAP work (except from a translation standpoint.)

But this chapter was fascinating, involving the external interactions of the SAP system with other systems. One of the supporting technologies of those interactions is SOAP, which brings up XML, which in turn brings up something I have never particularly paid much attention to: XSLT.

Wow. I didn't know what I was missing! XSLT is essentially a template-based programming language, used to transform XML structures into other XML structures; the idea is to support the expression of data-centric XML into more presentation-centric XML, but the mechanism is suitably generic.

And you know what? There's no open-source XSLT processor written in C. And this is odd, because even though the XSLT spec, like everything produced by the W3C, is horribly opaque, really, when you get down to it, XSLT is a pretty straightforward little language, and the parser is a no-brainer because it is already expressed in XML. So half the hard work in writing a compiler is already done for you, right there.

So here is my latest exciting brainstorm: I want to set up a test-based environment to support the development of an XSLT processor in C. Each test would consist of a starting structure, the template or templates to run on it, and the expected result -- that's easily imagined. And each test also refers to the exact point in the W3C spec that requires it (and vice versa -- I'd have an annotated spec that refers back to the tests.)

The whole thing would then function as an "XSLT by examples" tutorial and also a testbed for a command-line XSLT processor. Wouldn't that be a nice thing for the world to have?

I'm all for starting tomorrow.

Update: Whoops, I guess mod_xslt2 for Apache is in C, looks like. Well, never mind, I still think it's a bang-up idea. We'll see what the New Year brings, eh?

For some time, in the context of my workflow toolkit, I've been thinking intensively about UI design in wxPython.

See, once I was embroiled in a rather extensive project developing a GUI application under wxPython, and frankly, the UI was unmanageable. It had been developed with some IDE tool or another, but the output was Python code. It was horrible, trying to find what was what and on which panel it was developed and what its ID was -- ugh! This was back in about 2001.

At that point, I hadn't really started integrating wftk into Python yet, but I dabbled in it over the next couple of years, always with the notion that the UI is most sensibly defined in XML, and that a sensible UI manager would then take that definition and build all the objects needed to implement it in wxPython (or, for instance, online in a portal or something). And since that time, other people have naturally had many of the same ideas, and you see this implemented. But I've always wanted to finish my own implementation.

The current app for that I'm working on is, of course, a GUI app (at least, some of the time.) And so naturally I have relived my need for my UI design notion -- and in the context of working on the file tagger, I intend to start implementing the UI module. On that note, here is a tentative UI definition sketch for the file tagger. Ideally, we could use this XML not only to generate the app itself, but also to generate documentation for the UI design (by transforming it with XSLT into SVG, for instance; wouldn't that be indescribably cool?)

All of this is, of course, subject to radical change. Here goes:

    <tab label="Cloud">
    <tab label="Files">
       <splitter (some kind of parameters)>
              <radio value="something" label="All"/>
              <radio value="something" label="Some"/>
           <button label="Show"/>
         <col label="Name"/>
         <col label="Tags"/>
         <col label="Description"/>

I already have a framework for that definition to go into -- I wrote that in, like, 2002 or so. But I never got further than definition of menus. So here, I'm going to implement frames, and at least one dialog.

Note that what's utterly missing from this is any reference to code to handle events. That will come later, when I see what has to be defined where to get all this to work.

And on that note, I close.

2006-12-17 spam xrumer


Need overview about XRumer software?
I'm seeking for any information about XRUMER program.
Can you help me? Or give me a link to the official site with this autosubmitter.

There. Now let's wait a week for Google to index this, and see what the log drags in. Thank you very much, and I now return you to your regularly scheduled programming.

So my first actual weekly application is finished now. Aren't you proud? Suffice it to say that even a minor app takes a few hours to put together when you're reworking all your programming tools at the same time. A character flaw, I suppose. I never use an already-invented wheel if I have a perfectly good knife and wheel material. And I never use an already-invented knife if I have a perfectly good grinder and stock metal. And I never use an already-invented grinder if I have a lathe, motors, and a grindstone. And I never use an already-invented lathe... (sigh).

At any rate, it took me a few hours more than I wanted, but I'm reasonably pleased with the result. You can see the whole thing here (it's far too long to publish on the blog directly, of course). Go on. Look!

I just wanted to note at this juncture that my notion of running some very simple machine-translation code (yes, lovingly hand-coded in Perl) on a certain class of text, followed by human intervention using just the right kind of editor seems to be bearing fruit.

Granted, I would be in much less deadline trouble right now if I'd just done the Right Thing, shut up, and translated the text. But the text in question is not nicely flowing text. It is PLC message output written by engineers for machine operators, and it is dense. Very, very dense. So if I'd just translated it by hand I would have screwed up over and over.

Instead, I first scanned the entire text, broke it into words (with varying success), and looked up each and every word I didn't know. For those of you who aren't translators, this doesn't just mean not knowing a word at all; it includes not knowing what those particular engineers and machine operators intend to say with a particular technical term. This can be challenging, but in this case I had a lot of previously translated text, so I could look most words up in that.

Once all the words were "known" (ha), I ran the whole thing through a phrase scanner. Frequently occurring phrases were presented with word-by-word translations, along with some crude rewrite rules to make a better guess. This is all very, very naive, as any translator knows. It's not even as good as SYSTRAN, and SYSTRAN sucks.

But as I translated more of the frequent phrases, the system was able to string together better guesses for the longer phrases. At some point, then, I decided to switch over to direct translation of the actual segment list. This text was "nice" (in this one aspect alone) because segmentation was easy -- every line is a separate sentence, so there's no need to figure out where sentences might break. That's convenient.

At any rate, I am now using my specialized text editor to approve and/or modify each resulting phrase. Remember: all the words are already there, sort of, just not usually in an understandable order. Now that I can very quickly select and drag them around, though, my new translating technique is unstoppable. Ha! They said it couldn't be done! Those fools! MWAahahahaha!

(coff) OK, I'm better now. Documentation soon. I just felt enthusiastic.

So again, a lick and a promise for this blog as I madly try to finish some translation work. This translation job is an interesting one, though, as I mentioned, and as it turns out, amenable to editing in a specialized tool I just wrote today. Of course, having written the tool today means I have to use the tool this evening in a mad dash to finish, which in turn means I have no time to document the code until tomorrow at the very earliest.

Suffice it to say that the exercise was surprisingly easy. The task was simple: I need a tool to edit text files in which (for reasons we'll go into later) I have a number of phrases, one per line. The phrases mostly have all the right words in them, but not in the right order. I thus need a way to quickly select one or more words and drag it into the right place in the phrase. Sure, you say, Word does that. Yeah, except that Word doesn't put the spaces in the right place. God and Bill Gates alone know why, but Word doesn't put the fricking spaces in the right place when I drag words around on a line, and so I took matters into my own hands and rolled my own solution. And by God it works! Still a few little oddities in it, but it works more quickly than Word for this particular application.

Another nice thing it can do is this: when I drag the first word of a phrase out into the middle, it can decapitalize that word, and capitalize the new first word. That saves me a fraction of a second, and multiplied by 2000 phrases that adds up to a lot of time.

And another little thing I just now added: I can hit a key and toggle the case of the word the cursor is on. Again: this may or may not be of general use, but for this particular application it's very convenient. And that's really the idea of special-purpose text editors. An example from the programming world is emacs -- you can write LISP code to make emacs do literally anything at all (including psychoanalysis) from your text editor. The only problem being that it's too damn hard to start. Python's easier, at least for me. So a text editor in which you can embed your own Python snippets might be a generally useful tool indeed!

So. Tomorrow or Sunday, documentation and maybe some more movement on the drop tagger. And in the meantime, go get some sleep! (I know I won't any time soon.)

2006-12-06 drop-handler

Whew. Take a look here if you want some complicated stuff. This is a guide written for complete idiots ("The Complete Idiot's Guide to Writing Shell Extensions") and I'm on my third time through it. Apparently I'm not a complete idiot.

My mission, if I choose to accept it, is to do that, in Python. It should be possible, but I have a translation deadline, um, yesterday, and so I don't have much more time than to note that link, globber something to the effect that "I'm soooo confused", and move on.

Today's fun task was the creation of a little prototype code to format the tag cloud for the drop handler project. I did it in the context of this blog, and so first I had to get my keywords functional. I already had a database column for them, but it turned out my updater wasn't writing them to the database. So that was easy.

Once I had keywords attached to my blog posts, I turned my attention to formatting them into keyword directories (the primary motivation for this was to make it possible to enable Technorati tagging, on which more later.) And then once that was done, I had all my keywords in a hash, so it occurred to me that I was most of the way towards implementing a tag cloud formatter anyway.

Here's the Perl I wrote just to do the formatting. It's actually amazingly simple (of course) and you can peruse the up-to-the-minute result of its invocation in my blog scanner on the keywords page for this blog. Perl:

sub keyword_tagger {
   my $ct = shift @_;
   my $weight;
   my $font;
   my $sm = 70;
   my $lg = 200;
   my $del = $lg - $sm;
   my $ret = '';
   foreach my $k (sort keys %kw_count) {
      $weight = $kw_count{$k} / $max_count;
      $font = sprintf ("%d", $sm + $del * $weight);
      $ret .= "<a href=\"/blog/kw/$k/\" style=\"font-size: $font%;\">$k</a>\n";
   return $ret;

This is generally not the way to structure a function call, because it works with global hashes, but y'know, I don't follow rules too well (and curse myself often, yes). The assumptions:

  • The only argument passed is the maximum post count for all tags, determined by an earlier scan of the tags while writing their index pages.
  • $sm and $lg are effectively configuration; they determine the smallest and largest font sizes of the tag links (in percent).
  • The loop runs through the tags in alphabetical order; they are all assumed to be in the %kw_count global hash, which stores the number of posts associated with each tag (we build that while scanning the posts).
  • For every tag, we look at its post count in the %kw_count hash and split the difference in percentages between $sm and $lg -- then format the link with that font size. Obviously, this is a rather overly hardwired approach (the link should obviously be a configurable template) but as a prototype and for my own blogging management script, this works well.

For our file cloud builder, we'll want to do this very same thing, but in Python (since that's our target language). But porting is cake, now that we know what we'll be porting.

Thus concludes the sermon for today.

There are two general ways to approach software design; each has its uses.

Top-down design looks at the entire project and breaks it into high-level components; those components are then subprojects and can be further handled in the same way.

Bottom-up design looks at the resources available and sees likely things that can be done with them; the idea is to provide generalized components to be used in any project.

A healthy software design ecology has a lot of bottom-up components at varying stages of maturity; those components then inform the top-down requirements of the current project, giving those designs something to work with. In the absense of complex components, we're forced to write everything from scratch, and it all turns into ad-hockery of the worst kind.

Anyway, that item of philosophy out of the way, I wanted to talk about the design of this week's project, the drop tagger. There are three main components of the drop tagger, as follows:

  • The drop handler
    The drop handler is the component which interacts with the shell and provides something you can drop files onto or otherwise tag them. It calls the file manager. However, the notion of a general drop handler is a much more interesting one than a special-purpose drop handler just for this project, and one which can be a valuable addition to many different file-oriented projects.
  • The file manager
    The file manager shows us what files have been dropped, allows us to add and delete them and modify their tags, and for fresh drops it will actively ask for tags. It also calls the tag cloud formatter and provides a convenient place to display the cloud.
  • The cloud formatter
    This is likely to be the least general and thus the least interesting of these components, but it formats the file cloud upon request based on information compiled about the tags in the system.

Each of these components can be designed and used in isolation, and reused in other projects. Alternatively, once we've defined the components we need to meet our goal, we may well be able to find ready-made components already available (or at least something we can adapt instead of starting from scratch). There is then a maturity effect over the course of multiple projects, as our codebase allows us to be faster and faster responding to the need for a project.

I'd like to formalize this design process over the course of several mini-projects. Stay tuned for further progress.

Did I mention that I'm not only going to be covering technical topics on this blog? Today's word, kids, is "Aquaponics".

Aquaculture is growing fish for food. Hydroponics is growing food (or other) plants in water or other non-soil rooting medium. Aquaponics is using the fish water as the hydroponic nutrient solution, which does two things for you: the plants filter the nutrients (ammonia is fish urine but plant ambrosia) out of the water, so they don't choke the fish but instead are converted into, say, lettuce; and the fish provide completely organic and relatively balanced set of nutrients for the plants. So the combination is superior than either together, which makes perfect sense if you consider that two smaller ecologies put together into one bigger one are necessarily more balanced and stable.

Anyway, that's our family's project this week. We had already wanted to grow lettuce indoors for the winter, and so instead of simply growing lettuce, we are growing lettuce floating in a styrofoam block on an aquarium. The aquarium will have goldfish, so we won't be eating that end of the system -- but it could just as well have tilapia in it. In fact, tilapia are great aquaculture fish because they'll essentially eat anything. If they don't eat it, it just feeds algae, and they eat the algae instead.

So after we level up with 25 gallons of goldfish tank, I am very seriously considering building a much larger tank in the backyard under a geodesic dome (per Organic Gardening of 1972) and growing me some serious tilapia. Did you know that in a round pool 12 feet in diameter and 3 feet deep (a small section of our backyard) you can harvest 500 half-pound tilapia every couple of months? No, neither did I until today.

Anyway, I see all this as related to programming. Both are simply the design of systems to meet needs. And in fact, I find the way I think about an aquaponics system is very similar to the way I think about a general data processing system. Where an aquaponics system outputs lettuce, a data processing system outputs some information I want. To make lettuce, I need to consider the nutrients and water and light; to make valuable information I need to consider the available raw data.

In either case, I find that a small, modular approach works well. In the case of aquaculture, it's a matter of considering what nutrients are where and what organisms can convert one thing to another; whereas in software, it's a matter of seeing simple data structures and designing lightweight tools that can convert one to another -- and then you organize all your little modules/organisms into an ecology.

Lately, there have been two (software) projects I've worked on in which this systems approach has worked well. The new Toon-o-Matic is composed of a number of small, relatively simple Perl scripts which are all organized by a Makefile. Each script reads one or two or three input data structures, and emits one or two. The overall network could be drawn as a graph (and indeed, that would be edifying and entertaining, and I should do that.)

The other such system is this blog. I've deliberately kept the approach simple and completely sui generis. I'm reinventing the wheel to a certain extent, but that's the attraction -- I like new wheels, and the occasional flaw doesn't bother me, as I always learn. Evolution doesn't mind reinventing the wheel -- did you know that the eye has evolved many completely separate times? The eyes of insects, vertebrates, and molluscs are three completely independent instances of the evolution of a visual sensor. And the eyes of molluscs (like octopi) are demonstrably superior to ours: our retinal nerves are in front of our retinae, thus each eye has a blind spot where the optic nerve penetrates the retina to leave the eyeball. Molluscs sensibly have their retinal nerves behind the retina: no blind spot. Another reason to believe in Intelligent Design -- just, you know, not of us. God loves the octopus, which is why global warming is going to provide the octopus with lots of shallow, warm seas with recently vacated cities in them.

Anyway, back on something resembling a track: my ultimate goal in the case of aquaculture is to close the ecological loop. I want to take my kitchen and garden waste, recycle it with vermiculture and composting, feed the worms and plants to tilapia, use the fish water for lettuce and seedlings and the worm castings for root vegetables, and ultimately I believe it may well be possible to feed my family fresh fish and veggies with not much more input than cardboad, grass clippings, and leaves, and whatever's on sale at Kroger.

My goal in the case of most data processing systems is less lofty: I simply want to model some useful process in small, easily maintained and easily modified steps, so that the system remains flexible and reliable. But in either case, the thought processes are similar: to attain a large goal, break it down into small, reusable task utilities.

I'll keep you posted on both.

2006-11-30 programming art

I got wind of this wonderful, wonderful artist because I was obsessing on my hit logs due to starting this little blog (ahem, not like you haven't done that, be honest), and Xerexes at Comixpedia linked to this beauty of a site saying "Shades of the Toonbot" (aww, I've become a Concept. How cool.)

Anyway, my little scrivenings (hot dogs that they are) are nothing compared to the absolute jawdropping sheer beauty at Gallery of Computation | generative artifacts|Gallery of Computation | generative artifacts. You gotta see it.

The site is the turf of one Jared Tarbell, whose modus operandi is to write programs which express graphics. Pretty graphics. Really, really pretty graphics.

Well -- enough bubbling. Suffice it to say that I'd like to include a scripting engine into some version of the Toon-o-Matic which allows this kind of generative graphics. I doubt I'll ever get it that pretty, but still -- a man can dream.

Incidentally, note that this post's title contains parentheses. I'm probably revealing myself to be a complete fool, but let's just say that my blog weaving code choked on it because I was doing something really stupid with regular expressions. That may very well be the subject of a post soon. Or maybe I should keep my more egregious bugs under my hat. No, wait, those metaphors mix rather uncomfortably...

One of the neat little things I did over the past few days was a simple Word macro -- at least, it should have been simple, but the problem is one I've had for a long time.

In this case, what I wanted to do was to fix up a few documents I had from a translation customer. This particular end user, for reasons known only to them, captions their figures using fields. The fields are in text boxes for easy positioning, and the field results (the text you see on the screen) are the captions.

Only one problem: the fields are always variable results for variables which don't exist in the document. All I can figure is that the document preparer makes these things in little snippets with some other tool which spits out Word texts, then they paste those into the text boxes.

So, you're asking now (unless you're a professional translator) who cares? You just type your English over the German in the captions, and you're home free, right? Well: no. Everybody who's anybody in the wonderful world of translation nowadays uses translation tools, in this case TRADOS.

TRADOS does two things for you: it stores each and every sentence you translate in a translation memory (a TM), so you (sort of) never need to translate anything twice, and it also makes it much easier to step through a document translating. The use of TRADOS makes translation much easier, and it also helps you stay consistent in your use of words and phrases.

Herein lies the problem: those fields were untouchable by TRADOS. There are two modes in TRADOS: one steps through the document using Word macros but doesn't deal well with text boxes (and yes, you'll note they're in text boxes). So that approach was out. The other (the TagEditor) converts the entire document to an XML format, then edits that in a very convenient way. The TagEditor makes short work of text boxes, but those field results were invisible to it.

Stuck! And so for a series of three jobs from that customer, I just didn't use TRADOS on the figure attachments, and hated it. Last week, though, I took screwdriver in hand (metaphorically speaking) and decided it was showdown time.

OK, that's the teaser -- follow the link to get the ... rest of the story.

2006-11-27 blogmeta

Welcome to the new blog. This is an idea I've been kicking around since the day I heard the term "web log" (we hadn't abbreviated it), but I just haven't had the time until now. You know how it is.

The idea of the blog is simple. I've been programming for a very long time now, and I like to write about it. So when I discovered that after a significant hiatus from daily programming, my Muse had reawakened, I resolved that from this point on, I would spent at least a small amount of time each day programming something. The results have been gratifying; I've completed a number of small tasks I'd been wanting to resolve for a while.

And hence this blog. This is the place where I intend to present said small tasks, on a daily basis, for your perusal and, dare I hope, your amusement. I hope you enjoy it; I know I'm going to.

Some of the things I want to cover in the near future are, in no particular order,

  • A few Word macros I've written to fix things up after TRADOS macros break them
  • My experiences getting Python to read and write TTX files (a file used by TRADOS -- I earn my money with translation nowadays, mostly, so that's why translation tools come up so often
  • This very blog, which is a freaking Rube Goldberg contraption of Perl scripts and Makefiles
  • The amazing Toon-o-Matic, which I use to spin out graphics for my Web cartoon -- remember, the Toon-o-Matic is the work of art; the strip is just a by-product, like hot dogs
  • GUI builders for wxPython, something I've been hacking around on for a while without discernable progress (but I hope that will be changing)
  • Workflow applications, of course.

That ought to keep me in posting fodder for a few months, eh?

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.