Could Simple Automated Tools Help Wikileaks Protect Its Afghan Sources?

10 August 2010, 2020 EDT

Julian Assange has a problem. When pressed by human rights organizations to redact any current or future published documents because they too feared the effects on Afghan civilians, he reportedly replied that he had no time to do so and “issued a tart challenge for the human rights organizations themselves to help with ‘the massive task of removing names from thousands of documents.'”

Leaving aside his alleged claims about the moral responsibility of human rights groups for his own errors, the charitable way to think about his reaction is that Assange wants to do the right thing but simply doesn’t have the capacity. Indeed, in a recent tweet he implored his followers to suggest ideas:

Need $700k for our next harm-minimization review… What to do?

Fair enough. Here’s an idea: how about using information technology?

As my husband Household Chief Technology Officer pointed out over coffee this morning, what Assange is essentially in possession of is a large quantity of text data. There are many qualitative data analysis applications that allow users to easily sift through such data in search of specific discursive properties – I use one myself when I analyze interviews, focus groups or web content. Named entity recognition software easily allows users to identify all names or places in large quantities of text. Open-source variants like AFNER are available.

Corporations and governments already controversially use such tools for data-mining, to search for connections between names and places in large quantities of text. Could they not be equally leveraged in the service of privacy and confidentiality? How hard or costly would it really be to use such tools to identify and then redact all names in a set of text automatically by computer or to have a human being (or team of beings with a clear-cut coding scheme) go through the entire dataset with keystrokes and choose what should be removed or blacked out?

For me, it would be hard, unless someone handed me a software package that already blended these elements. But that’s primarily because I’m not a computer programmer. Julian Assange is.

Questions for readers: if you understand software design and available OTS or open-source applications better than I do, how far-fetched is it to solve Wikileaks’ redaction problem in this way? Am I being daftly optimistic here? Or, do you have other ideas in response to Mr. Assange’s query? Comment away.

[cross-posted at Lawyers, Guns and Money]