What’s in your document?

I wonder about my judgment when I decide to try out a tool called Voyeur (http://voyeurtools.org) as one of the first on this site. But it’s worth a look from reporters who need a quick orientation to a website, a set of documents or a long report.

Voyeur was created as an offshoot of an international historical research project called “Data Mining with Criminal Intent.” It won a Digging into Data challenge grant in 2009 to test new methods on one of the largest humanities datasets — the 200,000 criminal trials spanning nearly 250 years from the Old Bailey in London. The idea behind the challenge is that digitalization of materials used by historians and social sciences creates a “big data” problem that is prompting new computational research techniques. There are some obvious corollaries for journalism that we’ll keep an eye on as this year’s awards are announced.

The tool itself is pretty simple. Upload your documents or point Voyeur at a URL and press the button to get a word cloud, a map of the documents and graphs showing their frequency in different segments. Here’s a picture of about 500 pages of e-mails obtained by the Raleigh News & Observer during its investigation of former Gov. Mike Easley and his wife’s contract with North Carolina State University:

No, it’s not text mining or entity extraction or anything of the kind — just a quick overview of the document to give you common words (excluding the little ones), their approximate location in the text, and their context. This pdf worked and took less than a minute. I didn’t have the same luck on a few of the Guantanamo documents released by Wikileaks to see how it would compare documents — it broke.

The actual example may work for only a few days until they take it down, but it’s worth a try.

About Sarah Cohen
Knight Professor of the Practice, Duke University; former reporter and editor for The Washington Post and other newspapers.

Leave a comment