Data Center Storage
Image by Waleed Alzuhair via Flickr

There are movements afoot to create an era of open data standards, with proponents arguing that publishers should be doing more to support open data. Governments, visionaries, and technologists are all promoting the seemingly wholesome and harmless notion that direct access to the underlying data is virtuous and necessary, and by using the term “open,” the illusion is that all we have to do is stop keeping it closed, and the data will flow without a problem.

In an interesting post written from bitter experience, Nat Torkington of O’Reilly’s Open Source Convention confesses that the open data vision is incomplete because it’s been built on technological enthusiasm while overlooking the very real barriers to making data not only open but useable:

  1. Funding for data continuity and marketing
  2. A use-case to ensure the right data are being created in the right way for the right people

Torkington talks about the problems his teams have encountered in repatriating data from ad hoc, internal, or idiosyncratic systems into robust, normalized, and accessible data repositories. Believe it or not, this costs money. Most researchers build data systems for the grants they have, and their grants only support the data for that particular study. Once a study is done, the spreadsheets shut down, and there is no budget for maintenance, updating, or migration of the dataset. There’s also no overall infrastructure or standard that makes one data set able to interface with any other. Each data set is an island. As Torkington puts it:

. . . it costs money to make existing data open. That sounds like an excuse, and it’s often used as one, but underneath is a very real problem: existing procedures and datasets aren’t created, managed, or distributed in an open fashion. This means that the data’s probably incomplete, the document’s not great, the systems it lives on are built for internal use only, and there’s no formal process around managing and distributing updates. It costs money and time to figure out the new processes, build or buy the new systems, and train the staff.

So, there’s no money to make the data come together. Next, there’s no money to let users know the data exist and are available. This is a marketing problem, but a real one. As Torkington puts it:

There’s value locked up in government data, but you only realise that value when the datasets are used. Once you finish the catalogue, you have to market it so that people know it exists. Not just random Internet developers, but everyone who can unlock that value. This category, “people who can use open data in their jobs” includes researchers, startups, established businesses, other government departments, and (yes) random Internet hackers, but the category doesn’t have a name and it doesn’t have a Facebook group, newsletter, AGM, or any other way for you to reach them easily.

Torkington lists five “different types of Open Data groupie,” and his experience in the area shines through:

  1. low-polling governments who want to see a PR win from opening their data
  2. transparency advocates who want a more efficient and honest government
  3. citizen advocates who want services and information to make their lives better
  4. open advocates who believe that governments act for the people therefore government data should be available for free to the people
  5. wonks who are hoping that releasing datasets of public toilets will deliver the same economic benefits to the country as did opening the TIGER geo/census dataset

I encountered an Open Data advocate late last year at the Online Information meeting in London. He and a group of like-minded people have started a new initiative called DataCite:

The objectives of this initiative are to establish easier access to scientific research data on the Internet, to increase acceptance of research data as legitimate, citable contributions to the scientific record, and to support data archiving that will permit results to be verified and re-purposed for future study. DataCite will promote data sharing, increased access, and better protection of research investment.

By taking the position of citation relative to data, this initiative seems to skirt the problems Torkington has identified, and that’s not a promising sign. What will it matter if data are citable if they’re idiosyncratic, not maintained, isolated, and serve no clear purpose beyond the study for which they were generated?

Ultimately, Torkington’s model of some promising approaches to Open Data seems very familiar — know why you want to create the data, identify a group who can use the data, build community and data simultaneously, then create useful applications of the data. Unspoken is the fact that these useful applications of the data would probably be what would generate the revenue to maintain the data and elaborate upon it.

At yesterday’s PSP meeting in Washington, DC, I used some of the information I’d gathered for this post during my wrap-up session for the pre-conference. Members of the audience with experience reviewing and publishing large data sets mentioned how reviewing reams of data and preparing them for publication seems likely to dwarf article peer-review requirements in both the time needed and the intensity of effort. Yet, everyone expects data coming from publishers to be usable.

The culture of free often overlooks the real costs the real world creates outside of our own hopes and dreams.

Reblog this post [with Zemanta]