Gardens with or without walls–some thoughts on OpenCalais

Went to an “Emerging Technology Brown Bag” gathering today, organized by Columbia’s Center for New Media Teaching and Learning (CCNMTL). It featured Thomas Tague, who works with OpenCalais. OpenCalais is a free web service developed by Thomson Reuters that automatically extracts rich semantic metadata from digitized content.  It integrates with a number of open source platforms such as WordPress, Drupal, and Firefox.

Though the talk was partly framed as being about the future of journalism, there were none such present (it is spring break, to be fair…). Most of the people who came were librarians.

I’m not a tekkie, so I shan’t pretend to be able to explain the details of what Tague & co are doing, but a few basic points are worth fastening on to and spur the commentary that follows.

Users of the service submit chunks of basic textual content, and Calais returns the text enhanced with categories (percentual fit with sport, news, business, etc), identifies named entities (people, companies, etc), facts (John Doe said “x”) and events (company Y released its quarterly earnings). Even taking into account the obvious fact that these extraction capacities are not generic, but have limited capacities and some pre-set categories, this is obviously interesting and useful for all sorts of large-scale analysis and can enrich datasets in many interesting ways, in addition to being hugely useful for auto-coding of large databases operated by companies such as, well, Thompson Reuters.

In terms of journalism, Tague said that Calais can help with new forms of reporting (based on meta-data), can help investigative reporting of a more traditional kind be more effective, and can cut editorial costs to free up more dollars for actual reporting. And since they are developing this in the open and making many of the tools available for free, it lends itself to experiments of all sorts.

I think what is even more interesting and instructive is the reasons offered for why Thomson Reuters, a “most-assuredly for-profit company”, are doing this and making it available. He pointed to two reasons.

(1) engaging in open development helps the company to stay up-to-date technologically.

(2) they need to prepare themselves for a new information environment where their traditional business model, charging for exclusive content, may not be viable on its own.

The second reasons is obviously hugely important in many ways. If Thompson Reuters, a veritable knowledge-economy giant with roughly $11 billion a year in revenue and  50,000 employees (that is, about five times the size of the New York Times Company), behind probably only a few intelligence services in the amount of information they produce, store, and process (sorry, Bloomberg), is preparing itself for a post-walled garden future, then those who think they can save their much smaller heritage industry organizations by building their own tiny little walled square-inch gardens should maybe think again (newspapers, anyone?).

Tague argued that Thompson can still do a fair amount of business on the basis of charging for access to raw data, but that they also needed to start offering richer meta-data that customers could mix and match easily with other sources in the linked data cloud, and that they most of all needed to move towards a position where their premium good was not simply information and access, but information you can trust. Information that is not stale, misleading, off-the-point, etc.

In the business world, this is of course crucial, and worth paying the kind of premium for that you can build giants like Bloomberg and, well, Thompson Reuters, on. Whether it works for news is anybody’s guess, and in most cases (i.e. “average American newspaper” cases) my guess would be a resounding “no”. Unless the telecoms and content industries manage to change the internet as we know it dramatically over the coming years, it is hard for me to imagine that the Cleveland Plain Dealer controls content that can be monetarized and is valuable enough to sustain a walled garden (what its owner, Advance Publications, has and hasn’t, can and can’t, is another story).

So for Tague, the future is not so much one of information monopoly for companies like Thompson Reuters (their traditional model), or of control over the means of dissemination and thus a certain audience (the old mass media model), but of interoperability, collaboration, and, importantly, standard-setting. Fascinating.

– – – Ben Peters send me the below comment, which for some reason didn’t make it through to the site – – –

Fascinating stuff, here, as usual. I’m going to speculate that these categories are provided by some automated inference maker that bases its inferred categories for the given corpus of text off of a much larger corpus of pre-categorized text. I wonder, if this is not entirely wrong, how the categories in the source text are procured? Shouldn’t the slide around with dynamic use? Every assigned category should come with meta data: P of error, R2, etc., etc. How these tech questions are answered, I think, will tell us the most about the longer-term, more important questions you raise about walls and business. I’ll be glad to keep an eye open with you on this one.”


