Keyword text-analysis

2007-09-27 text-analysis politics

This week, for the second week in a row, Boing Boing features a University of Arizona initiative to "identify people online by their writing style". Homeland Security is of course all over this whiz-bang tech like ants on honey, because... well, I started a comment on the program, only to realize this would better be a blog post.

1. I'll accept that it's possible to come up with some "similarity metric" that says "A is 99% similar to A' but only 32% similar to B", in the sense that we have such-and-such a probability that a given text was written by a certain person. (So we end up with "similarity islands" of texts in the metric space, and we call each of those islands a writer.)

But that means that for any text we have only a certain (finite and non-certainty) probability that a given text is actually written by A. So let's get entirely wild and assume some government researcher with more money than brains, working alone in a highly technical and difficult field, somehow writes an algorithm as good as, say, Google's algorithm for determining the topic of a page, which is inherently an easier topic.

The result? We will be able to find terrorists online as well as Google can avoid giving us crap search results. And forgive me for saying this, but nobody in their right minds would arrest someone based on a Google result.

OK, #2. We are categorizing writers and potentially calling them enemies of the state based on WHAT THEY WRITE. Now, I know that not all the readers of this blog are Americans, but here in America, we have something called the Constitution which means that it is not a crime to write things.

Ah, hell, all snark aside, even if this works, it's still misguided for patently obvious Constitutional reasons. And it's not going to work, not the way they think, because --

3. What this all boils down to is this. Politicians and technocrats think that the world is divided into two groups of people: "our" people, who do what we say and pay us taxes so we can buy nice houses in Virginia, and "those" people, who rouse the rabble and put our salaries in jeopardy. "Those" people, this year, we call "terrorists". Earlier they were "communists", or "labor organizers", or "civil rights activists", or whatever -- the main thing to remember is that everything is stable unless troublemakers stir things up.

And we used to be able to know who those people were, because they looked funny. But on the Internet, nobody knows you're a dog -- and so there is a perceived need, whether it's possible or not, for a technology to identify dogs. Or "terrorists" -- but what they really want is to be able to draw a line down the middle of the Internet between safe people and troublemakers.

And what that means is that the freewheeling exchange of ideas -- OK, and 90% crap -- which is the Internet? It's gone sufficiently mainstream that these people regard it as a threat, exactly like certain neighborhoods, or certain movements, have been in the past. It's too free for comfort, and it's too well-known to be ignored.

I don't know how it'll play out. But this story really exposes a seamy underside of our society. It's depressing.

And then there is the notion of spoofing such an algorithm, as researched by none other than Microsoft, at Obfuscating Document Stylometry to Preserve Author Anonymity. This may well be illegal research at some point in a dystopian future...