The Blog, or, Vivtek 2.0

post #49 dated 2008-05-03 09:54:50 - generative_linguistics - name_generator

Man, just when I think it can't possibly get any busier in my life, well, it does. So I just haven't had much time left over to do programming, and that makes me sad. But when this happens, eventually the Muse forces me to code something. This week, besides getting back to the modbot forum despammer (on which topic I will write later), suddenly last night I found myself writing a Perl script to generate fantasy country names based on a program written in some antediluvian BASIC dialect in the 70's by Jo Walton.

Once I'd written it, I realized it would be cooler if it were online -- because everything is. And so I wrote a little Tcl wrapper based on my Google count wrapper, and slapped it up. And played with it a lot.

So now I have a lot more ideas, of course. I wrote it to generate a random number of syllables up to a maximum, then gave it syllable forms like "cvv", "vv" (for consonant-vowel combinations). The vowels and consonants are likewise simple lists. You can put accents on the vowels, optionally. You can specify prefixes or suffixes. You can make letter choices for the consonants according to some scheme Jo dreamed up.

But really what this is -- or could be -- is a generic generative grammar at the phonological level. So I could well imagine making it more general, by providing a nested and arbitrary hierarchy of phonemes (liquids, fricatives, front and back) and providing more complicated options for rules.

And on the wrapper end, wouldn't it be nice to be able to save a language spec, bookmark it, name it, commment on it, mail it to your friends, etc? How about an "evolutionary strategy" to search for types of words you like the best? Heck, there's all kinds of stuff that could be done.

But this was fun. And from the traffic it's getting, I'm not the only one who finds it hypnotic.

post #48 dated 2007-12-03 18:32:52 - mediawiki

I have seen the future, and it calls itself MediaWiki.

post #47 dated 2007-10-26 20:50:21 - despammed

I finally took the plunge, after 8 years. All vivtek.com mail is now going through the Despammed filters. I just couldn't take the sheer mass of spam any more. (I get more than 1000 email spam messages a day and yes, for 8 years I've sorted through them by hand. Did I mention I read quickly? But lately I've been letting the spam slush pile get too big. Time for a change.)

There are some issues I still want to address -- like setting up a whitelist for customers to be on the very safe side -- but I have to say that the drop in spam in my actual Inbox has been (1) incredible and (2) so worth it.

post #46 dated 2007-10-24 15:37:57 - workflow - wftk

I never really officially released wftk 1.0, of course (the magnitude of the task simply grew and grew and I became less and less certain of my approach -- and then the recession happened.) But I've been thinking a lot of a more reasoned approach lately, and maybe it's time to reboot the wftk project and start more or less "from scratch".

I see the modules in this new approach more or less as the following:

  1. Data management
    This is the basic list-and-record aspect that the repository manager started out addressing. Now, of course, there is SQLite. So a principled workflow toolkit would start by using SQLite for local tables, and add "external tables" (for which the new SQLite has an API) defined in what SAP now calls the "system landscape". It's amazing, by the way, how much of my thinking over the past few years I see reflected in what SAP is doing lately in their NetWeaver stuff.

  2. Document management
    Document management, as I see it, consists of: (1) actual central storage and versioning of unstructured data; (2) storage of metadata about documents; (3) parsing and indexing of unstructured data to produce structured data elsewhere in the system. The document manager should be able to work well in either situations where it controls storage (and thus can initiate action whenever anything is changed) or when it merely indexes a storage which can be changed externally -- that latter might be, for instance, management of a Website's files in the file system. Or just your system files on a Windows machine. Periodically, the document manager could check in and see whether things had been changed, and if so, trigger arbitrary action.

  3. "Action" management
    A central script and code repository defines the actions that can be taken by a system. I consider this to include versioning and some kind of change management and documentation system, including literate programming and indexing of the code snippets. The build process should also be managed here, and should be capable, for instance, of taking algorithms written in C, compiling them into DLLs or .so dynamic load libraries, and calling them from Perl, say. Ultimately.

    Actions, documents, and data would have a nested structure, by the way; there would be global actions, application actions (a given case or project could be an instance of an application), and project/instance actions, and the same applies to data and documents, perhaps. Originally I'd thought of doing the same for users or organizational units, but I really think that if you're defining a common language of actions and data, it should be organized into applications and, perhaps, subapplications or something. But not differ by user! (I might be wrong, of course.)

    The above three modules together allow a data-flow-oriented processing system, but we're still missing:

  4. Outgoing interfaces
    This includes publishing of HTML pages, outgoing mail notifications, other notifications such as SMS or ... whatever. Logged, all of it. It includes report generation into the document management system or the file system, generation of PDFs, etc.

  5. Incoming interfaces
    Given the parsing power of the document management module, this is more an organizational module. The system should be able to receive email, parse it, and take action. Conversational interfaces are covered here as well, from SMTP- and IMAP-like state machines to chatbot NLP interfaces. And of course form submission from Websites also falls into this bucket.

  6. Scheduling
    Whether running on Unix with cron and at, or Windows with ... whatever the hell Windows offers, the system should have a single unified way of dealing with time in a list of scheduled tasks.

  7. Users, groups, roles, and permissions
    This module would be in charge of keeping track of who is performing a given action and whether they're allowed to do so. The original wftk already provided a really nice mechanism which would still be nice here: when judging permissions, any action can get the answers "yes, it's allowed", "no, it's not allowed," and "it's allowed subject to approval." That last invokes workflow for any arbitrary action and that would be a powerful abstraction for nearly any system. It's essentially transaction management on a much more abstract scale.

    And finally, the icing on the cake,

  8. Workflow
    The two components which make workflow workflow are a task list (tasks are hierarchical in nature and so a task can have subtasks as a separate project) and a workflow process definition language. The new wftk should be able to work with any workflow formalism -- after all, the process definitions are considered scripts in the versioned script document repository. The existing wftk engine will almost certainly fit in here with little modification.

    The primary benefit of workflow is that it allows dissociation over time. A running workflow process isn't active on the machine for the weeks or months it might require -- it's simply a construct in the database that gets resurrected as required. There are a boatload of applications in general programming, but nobody sees them as workflow because everybody "knows" workflow is a business application. The wftk was to have changed that, and I think the potential's still there.

    There's also a case to be made for a module for

  9. Knowledge management
    This portion of my thinking is a little less organized. I'd kind of like to lump some kind of concept database in here, perhaps a semantic parser or something. Originally I'd thought that AI would go in here, but I actually think that Prolog might just be another action script language. This is definitely a blurry line in its native habitat, and crikey, he's not happy to see me here!

    But the point of a blog is to write this stuff down as it occurs. So there you have it, this would sit on top of the workflow. Think of it as a way to build smart agents into your data/document/action/workflow management system.

And there you have it -- my plan to wrap up the thought and work of eight years. Oh, and this time I'm not bothering with licensing requirements. Like SQLite, wftk 2.0 will be in the public domain. I don't really care if I get credit or not for every little thing, because frankly, anybody who counts will figure it out. And have you noticed how everything these days uses SQLite? It's because -- well, primarily because it works, but also because you don't have to worry about legal repercussions of using the code.

That's where wftk document management should be, where wftk workflow should be. Simple, easy to use, and ubiquitous.

post #45 dated 2007-10-07 20:43:53 - chatbot - nlp

Something I've wanted to do for a couple of years now is a sort of online chatbot framework thing. In other words, this would be a testbed for different language analysis techniques that could be played with online and tested against real people.

An extension would be to connect a given chatbot to some other chatbot out there somewhere, and see them talk to each other. That could be fun.

The basic framework for this kind of venture could be pretty simple, but could, of course, end up arbitrarily complex. You'd need some kind of principled semantic framework (which would start at a simple box with words in it and ramify through increasingly sophisticated syntactic and semantic analyses -- the idea is to have a framework which can support both simplistic single-word pattern matching to select a response, or use sentence frames to extract some subject patterns to be manipulated in the response, right up through a hypothetical Turing-complete NLP parser.)

The session would contain a list of facts and "stuff" which corresponds to the, I dunno, dialog memory of a conversation. There could optionally be some kind of database memory of earlier conversations with a given contact. Again, this would run the gamut from simple named strings to be substituted into a response pattern, to complete semantic structures of unknown nature, which would be used to generate more sophisticated conversation.

Then the third and final component would be the definition of a chatbot itself. This would consist of a set of responses to given situations (a situation being the current string incoming plus whatever semantic structure has been accumulated during the course of the conversation.) There could be a spontaneity "response", i.e. something new said after some period of time without an answer from the other. Again -- it should be possible to start small and stupid, with simple word patterns, random-response lists, and the like, and build upwards to more complicated semantics.

The ability to detect and switch languages would be of great use, of course, and there should be some kind of facility for that as well.

Wouldn't it be nice to be able to build a chatbot for language practice in, say, Klingon or L&aacut;adan? I mean, how else could you reasonably practice a constructed language?

Anyway, when I have time, I'll certainly be doing something with this idea. Any year now, yessir, any year.

post #44 dated 2007-09-27 10:21:28 - text-analysis - politics

This week, for the second week in a row, Boing Boing features a University of Arizona initiative to "identify people online by their writing style". Homeland Security is of course all over this whiz-bang tech like ants on honey, because... well, I started a comment on the program, only to realize this would better be a blog post.

1. I'll accept that it's possible to come up with some "similarity metric" that says "A is 99% similar to A' but only 32% similar to B", in the sense that we have such-and-such a probability that a given text was written by a certain person. (So we end up with "similarity islands" of texts in the metric space, and we call each of those islands a writer.)

But that means that for any text we have only a certain (finite and non-certainty) probability that a given text is actually written by A. So let's get entirely wild and assume some government researcher with more money than brains, working alone in a highly technical and difficult field, somehow writes an algorithm as good as, say, Google's algorithm for determining the topic of a page, which is inherently an easier topic.

The result? We will be able to find terrorists online as well as Google can avoid giving us crap search results. And forgive me for saying this, but nobody in their right minds would arrest someone based on a Google result.

OK, #2. We are categorizing writers and potentially calling them enemies of the state based on WHAT THEY WRITE. Now, I know that not all the readers of this blog are Americans, but here in America, we have something called the Constitution which means that it is not a crime to write things.

Ah, hell, all snark aside, even if this works, it's still misguided for patently obvious Constitutional reasons. And it's not going to work, not the way they think, because --

3. What this all boils down to is this. Politicians and technocrats think that the world is divided into two groups of people: "our" people, who do what we say and pay us taxes so we can buy nice houses in Virginia, and "those" people, who rouse the rabble and put our salaries in jeopardy. "Those" people, this year, we call "terrorists". Earlier they were "communists", or "labor organizers", or "civil rights activists", or whatever -- the main thing to remember is that everything is stable unless troublemakers stir things up.

And we used to be able to know who those people were, because they looked funny. But on the Internet, nobody knows you're a dog -- and so there is a perceived need, whether it's possible or not, for a technology to identify dogs. Or "terrorists" -- but what they really want is to be able to draw a line down the middle of the Internet between safe people and troublemakers.

And what that means is that the freewheeling exchange of ideas -- OK, and 90% crap -- which is the Internet? It's gone sufficiently mainstream that these people regard it as a threat, exactly like certain neighborhoods, or certain movements, have been in the past. It's too free for comfort, and it's too well-known to be ignored.

I don't know how it'll play out. But this story really exposes a seamy underside of our society. It's depressing.

And then there is the notion of spoofing such an algorithm, as researched by none other than Microsoft, at Obfuscating Document Stylometry to Preserve Author Anonymity. This may well be illegal research at some point in a dystopian future...

post #43 dated 2007-09-12 21:41:29 - politics

I just found myself posting this in comments on another site, and it expresses my feelings pretty damned well. This is a technical blog, but it's my blog, and in this one instance, I feel justified about a political post. And it's not like anybody reads this anyway -- so assuming you even exist, you have no reason to complain.

Thus runs the missive:

Bah. I don't like to rehash 9/11 because the second thing I thought, walking into the IU Union and seeing the smoking tower on the TV, was "Reichstag", and in this one thing, Mr. Bush has not failed me.

The first thing was -- holy shit THERE's something you don't see every day.

You know what? The actual specific damage there was nothing to the United States. Compare it to the damage Hitler did in Europe, then come back and tell me it was at all significant. The only damage it did was in America's collective head -- it was a bee sting, and the last six years have been anaphylaxis. Not sure yet whether it was fatal shock, but the patient still doesn't look good.

Now these same people, after 9/11-ing for years to justify stupidity and blood in Iraq, telling us a tinpot third-world embargoed communist was as dangerous or more dangerous to our mighty nation than the heavily industrialized Germany we faced down and beat while fighting on another front entirely -- against ninjas for cryin out loud -- those people are now telling us we're in an existential fight with Iran over the fate of Western Civilization. They've been at war with us for thirty years, and it's been an existential threat, but we just ... haven't noticed? Does anybody but me see how fricking stupid that sounds?

I spent a lot of time and energy being liberal and anti-war between 2001 and 2004-ish, killed my business thinking about politics instead of noticing the recession, ran up a shitload of unpaid back taxes and debt while killing my business, and every day I watched a pack of deadbeats getting richer and richer off America's pain, all because of America's Reichstag and the willingness of the American people to treat international politics like a horror movie. And that fucking monkey, excuse my freedom, can't even wipe the smirk off his face, even now.

I just don't care any more. America will recover from its panic attack, or it won't. It doesn't matter what I do or say. And that's what 9/11 means to me.

post #42 dated 2007-08-25 21:41:29 - toon-o-matic - toonbots - sisyphygean-tasks

So I actually did something with the Toon-o-Matic for a change. It was fun! Didn't finish what I wanted to do, but at least there was an actual new kind-of-episode for the first time since 2006.

Maybe next year I can do another.

post #41 dated 2007-08-22 21:09:15 - spam, - internet_sleuthing

I've always had a soft spot for good explanations of Internet sleuthing for fun and profit, and here's a dandy example.

post #40 dated 2007-08-16 08:49:11 - tower_defense - evolutionary_programming - ocr - python - pil

OCR in Python. There isn't any, to speak of. While there do exist a few open-source OCR projects (Conjecture seems to have a great deal of promise!), none of them play well with Python. I may want to rectify that at some point.

Anyway, growing bored of simply writing AutoHotKey scripts to play Tower Defense, I quickly realized that I really needed a tool to start the game for me, and track the score and other stats for later analysis.

The first part was no big deal; I whipped out a PyPop applet that could launch a URL. Since I wanted a window that was sized according to the Flash object, that required putting together a local HTML file that could run some JavaScript to pop up the window I wanted, then close itself afterwards. I'll document that when I have time (I propose the abbreviation wIht to save my time typing that phrase.)

Well. That was fun, and it worked, but I really wanted something to monitor the score for me, and timing, and ... stuff. Which meant that I would have to read the actual graphical screen, because there's no handy-dandy textual output on that Flash app.

You'd think that would be trivial in 2007. But you'd be wrong.

Getting a snapshot of the screen was easy enough. I wrapped win32gui to get a window handle by the title (I'll document this later: Idtl), then installed PIL to grab the actual graphical data and manipulate it. To warm up with that, I set a timer to grab a four-pixel chunk of the screen so I could see whether the Flash had started or not (Tower Defense goes through an ad screen, then a splash screen, and only then does the game start.) That took a little putzing around, but the result was gratifying: my little utility could tell me when the game was ready to play. As long as the window was on my primary monitor, anyway (turns out PIL is not good with multiple monitors -- who knew?) So it turned out I had to move the window before all that.

And then I could grab the sections of the screen with the numbers on them for score, bonus, lives, timer, and money... And then it all screeched to a halt, because there are no open-source Python OCR libraries. At all. And clearly I don't have the time to adapt something -- hell, I don't even have time to do all this. I don't even have time to write this blog entry.

So of course I did the natural thing. I wrote my own special-purpose OCR, because clearly that would be saving time. I saved four hours before I started falling asleep, and it still can't tell 8 from 0 (but it does a fine job on the rest of the digits.) It was a lot of fun, actually. Idtl.

So. Proposal. It would be nice to work some with Conjecture, and produce the following: (1) a Python binding that can work in memory with PIL bitmaps, (2) a Web submitter for test graphics, and (3) an online tester and test database reporter. That would be really cool. Of course, only wIht.






Copyright © 1996-2007 Vivtek. All Rights Reserved. Read the disclaimer.
Read our privacy statement, too, if you are concerned.
Problems? Try webmaster@vivtek.com or our answer page.