There are movements afoot to create an era of open data standards, with proponents arguing that publishers should be doing more to support open data. Governments, visionaries, and technologists are all promoting the seemingly wholesome and harmless notion that direct access to the underlying data is virtuous and necessary, and by using the term “open,” the illusion is that all we have to do is stop keeping it closed, and the data will flow without a problem.
In an interesting post written from bitter experience, Nat Torkington of O’Reilly’s Open Source Convention confesses that the open data vision is incomplete because it’s been built on technological enthusiasm while overlooking the very real barriers to making data not only open but useable:
- Funding for data continuity and marketing
- A use-case to ensure the right data are being created in the right way for the right people
Torkington talks about the problems his teams have encountered in repatriating data from ad hoc, internal, or idiosyncratic systems into robust, normalized, and accessible data repositories. Believe it or not, this costs money. Most researchers build data systems for the grants they have, and their grants only support the data for that particular study. Once a study is done, the spreadsheets shut down, and there is no budget for maintenance, updating, or migration of the dataset. There’s also no overall infrastructure or standard that makes one data set able to interface with any other. Each data set is an island. As Torkington puts it:
. . . it costs money to make existing data open. That sounds like an excuse, and it’s often used as one, but underneath is a very real problem: existing procedures and datasets aren’t created, managed, or distributed in an open fashion. This means that the data’s probably incomplete, the document’s not great, the systems it lives on are built for internal use only, and there’s no formal process around managing and distributing updates. It costs money and time to figure out the new processes, build or buy the new systems, and train the staff.
So, there’s no money to make the data come together. Next, there’s no money to let users know the data exist and are available. This is a marketing problem, but a real one. As Torkington puts it:
There’s value locked up in government data, but you only realise that value when the datasets are used. Once you finish the catalogue, you have to market it so that people know it exists. Not just random Internet developers, but everyone who can unlock that value. This category, “people who can use open data in their jobs” includes researchers, startups, established businesses, other government departments, and (yes) random Internet hackers, but the category doesn’t have a name and it doesn’t have a Facebook group, newsletter, AGM, or any other way for you to reach them easily.
Torkington lists five “different types of Open Data groupie,” and his experience in the area shines through:
- low-polling governments who want to see a PR win from opening their data
- transparency advocates who want a more efficient and honest government
- citizen advocates who want services and information to make their lives better
- open advocates who believe that governments act for the people therefore government data should be available for free to the people
- wonks who are hoping that releasing datasets of public toilets will deliver the same economic benefits to the country as did opening the TIGER geo/census dataset
I encountered an Open Data advocate late last year at the Online Information meeting in London. He and a group of like-minded people have started a new initiative called DataCite:
The objectives of this initiative are to establish easier access to scientific research data on the Internet, to increase acceptance of research data as legitimate, citable contributions to the scientific record, and to support data archiving that will permit results to be verified and re-purposed for future study. DataCite will promote data sharing, increased access, and better protection of research investment.
By taking the position of citation relative to data, this initiative seems to skirt the problems Torkington has identified, and that’s not a promising sign. What will it matter if data are citable if they’re idiosyncratic, not maintained, isolated, and serve no clear purpose beyond the study for which they were generated?
Ultimately, Torkington’s model of some promising approaches to Open Data seems very familiar — know why you want to create the data, identify a group who can use the data, build community and data simultaneously, then create useful applications of the data. Unspoken is the fact that these useful applications of the data would probably be what would generate the revenue to maintain the data and elaborate upon it.
At yesterday’s PSP meeting in Washington, DC, I used some of the information I’d gathered for this post during my wrap-up session for the pre-conference. Members of the audience with experience reviewing and publishing large data sets mentioned how reviewing reams of data and preparing them for publication seems likely to dwarf article peer-review requirements in both the time needed and the intensity of effort. Yet, everyone expects data coming from publishers to be usable.
The culture of free often overlooks the real costs the real world creates outside of our own hopes and dreams.
6 Thoughts on "Rethinking Open Data Initiatives: It Turns Out Open Data Costs Money, Needs a Purpose"
This is a very big issue. The US Government is actively exploring the issue of selecting, preserving, and providing access to, federally funded scientific data. One of the leading exploratory groups is the Interagency Working Group on Digital Data. (I have done staff work for them.) Their first report came out just over a year ago:
“Harnessing the Power of Digital Data for Science and Society”
The next report, which addresses federal science agency policies, is in the works. Cost is indeed a big issue, because redirecting a substantial fraction of the research budget may be involved. So is preservation infrastructure because, as Kent notes, research grants and contracts run out. Then too, there are intellectual property issues. For these reasons, and others, selection of data for preservation and access is a major policy issue.
At the same time, however, many scientific communities are building local data sharing systems. These range from the Sloan Sky Survey to the Large Hadron Collider. How these grassroots efforts will ultimately play into federal policy remains to be seen. The ultimate role of scientific publishing in this emerging system is also unknown. This is a fascinating issue, but a very difficult one.
For many, if not most scientists, this is a huge issue, but at the same time, it seems an almost insurmountable problem on so many levels. Everyone agrees it’s a good idea in principle, but implementing it, well, that’s another story.
Many labs I know are facing great challenges archiving their data for their own personal use. Each member of an imaging lab can generate terabytes of data every week. Keeping this data around for future mining requires a huge amount of storage space, not to mention redundant backups. It’s unclear which storage media and methods are the most cost-efficient and the most likely to last longer than the next technology cycle. If you then ask the lab to serve up those terabytes and terabytes of data to all comers, you’re adding in both a huge expense and a service/maintenance nightmare.
The next huge issue is standardization of data. This is fairly easy for some types of data, DNA sequences, protein structures, these can easily conform to a standard file format and be put into a database. But that’s only a small fraction of the data types being collected. Images, time lapse movies, western blots, electrophysiological recordings, karyotypes, behavioral observations, does one need to come up with an absolute standard format for recording data for every single method in use? How much time should a scientist spend converting his data into that format? Couldn’t that time be better used doing more experiments?
Last year Steven Wiley wrote a great article explaining why this is so intractable (you may need to freely register with The Scientist to read the whole thing):
Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes. Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results.
The most significant issue inhibiting data sharing, however, is biologists’ lack of motivation to do it. In order to sufficiently control the experimental context to allow reliable data sharing, biologists would be forced to reduce the plethora of cell lines and experimental systems to a handful, and implement a common set of experimental conditions. Getting biologists to agree to such an approach is akin to asking people to agree on a single religion.
He goes on to describe the Alliance for Cell Signaling, and how a huge amount of work went into creating an open set of data on cellular responses, and how that data has been fairly useless for other researchers. He makes the intriguing point that as technology and high-throughput techniques continue to improve, it may just be easier to generate your own data set than to try to integrate someone else’s.
Hi, Kent. Thanks for the kind words about my article. To be clear, I have no problems with “the culture of free”–I come from the world of open source, where a culture of free is doing quite well and many companies and people have done well from that culture. We are only at the start of figuring out how to translate open source successes to open data. Some practices will be shared, some we’ll abandon, some we’ll need to invent.
I think I know where US Federal policy is going on this issue, and they fund most of the basic research. What David C. terms insurmountable or intractable are the Utopian visions, precisely for the reasons Kent states — cost and purpose. But policy is moving forward.
First, as Kent says, it is very expensive to document, preserve and provide data to others. I have seen estimates of 20-30% of project cost. But we are not going to reallocate 30% of the Federal research budget to pay for open data management, which would mean cutting real research by that amount, nor are we going to increase the budget to pay for open data. Then too most data is probably useless to others.
Given these constraints the reasonable thing to do is obvious, and that is where policy is headed. We will apply what I call the “best first” heuristic. This means identifying the most valuable data and then opening up as much as we can afford to. Most funding agencies are already doing this on an ad hoc basis. Policies, as well as grant and contract clauses to implement them, are developing.
Ironically, the biggest obstacle is a lack of cost data, which is sorely needed to motivate policy development. Federal spending policies are not made based on wishes. So far as I can tell OMB has yet to engage on this issue, and little can happen until it does. The other wild card is that the House Science Committee is about to get a new Chairman. But there is no question that this issue is in motion at the Federal level.