Comments on: Rethinking Open Data Initiatives: It Turns Out Open Data Costs Money, Needs a Purpose

By: Visualize This: LinkTV and Sunlight Labs Move to Put Data Into Action « The Scholarly Kitchen

Wed, 10 Mar 2010 10:19:01 +0000

[…] and data consistency to support long-term research adoption. As has been noted here before, open data initiatives can be expensive to create, and they need a clear purpose. Whether Sunlight or ViewConnect will be useful is something only time will […]

By: David Wojick, Ph.D.

David Wojick, Ph.D. — Wed, 10 Feb 2010 15:19:15 +0000

I think I know where US Federal policy is going on this issue, and they fund most of the basic research. What David C. terms insurmountable or intractable are the Utopian visions, precisely for the reasons Kent states — cost and purpose. But policy is moving forward.

First, as Kent says, it is very expensive to document, preserve and provide data to others. I have seen estimates of 20-30% of project cost. But we are not going to reallocate 30% of the Federal research budget to pay for open data management, which would mean cutting real research by that amount, nor are we going to increase the budget to pay for open data. Then too most data is probably useless to others.

Given these constraints the reasonable thing to do is obvious, and that is where policy is headed. We will apply what I call the “best first” heuristic. This means identifying the most valuable data and then opening up as much as we can afford to. Most funding agencies are already doing this on an ad hoc basis. Policies, as well as grant and contract clauses to implement them, are developing.

Ironically, the biggest obstacle is a lack of cost data, which is sorely needed to motivate policy development. Federal spending policies are not made based on wishes. So far as I can tell OMB has yet to engage on this issue, and little can happen until it does. The other wild card is that the House Science Committee is about to get a new Chairman. But there is no question that this issue is in motion at the Federal level.

By: Nat Torkington

Nat Torkington — Wed, 10 Feb 2010 09:21:20 +0000

Hi, Kent. Thanks for the kind words about my article. To be clear, I have no problems with “the culture of free”–I come from the world of open source, where a culture of free is doing quite well and many companies and people have done well from that culture. We are only at the start of figuring out how to translate open source successes to open data. Some practices will be shared, some we’ll abandon, some we’ll need to invent.

By: Science and Web 2.0: Talking About Science vs. Doing Science « The Scholarly Kitchen

Mon, 08 Feb 2010 16:46:53 +0000

[…] the age of the sequenced genome and systems biology, scientists are more and more often dealing with enormous data sets. Experiments require collaboration on a scale never seen before. Though they’re just in their […]

By: David Crotty

David Crotty — Thu, 04 Feb 2010 15:39:48 +0000

For many, if not most scientists, this is a huge issue, but at the same time, it seems an almost insurmountable problem on so many levels. Everyone agrees it's a good idea in principle, but implementing it, well, that's another story. Many labs I know are facing great challenges archiving their data for their own personal use. Each member of an imaging lab can generate terabytes of data every week. Keeping this data around for future mining requires a huge amount of storage space, not to mention redundant backups. It's unclear which storage media and methods are the most cost-efficient and the most likely to last longer than the next technology cycle. If you then ask the lab to serve up those terabytes and terabytes of data to all comers, you're adding in both a huge expense and a service/maintenance nightmare. The next huge issue is standardization of data. This is fairly easy for some types of data, DNA sequences, protein structures, these can easily conform to a standard file format and be put into a database. But that's only a small fraction of the data types being collected. Images, time lapse movies, western blots, electrophysiological recordings, karyotypes, behavioral observations, does one need to come up with an absolute standard format for recording data for every single method in use? How much time should a scientist spend converting his data into that format? Couldn't that time be better used doing more experiments? Last year Steven Wiley wrote a great article explaining why this is so intractable (you may need to freely register with The Scientist to read the whole thing):

Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes. Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results. The most significant issue inhibiting data sharing, however, is biologists' lack of motivation to do it. In order to sufficiently control the experimental context to allow reliable data sharing, biologists would be forced to reduce the plethora of cell lines and experimental systems to a handful, and implement a common set of experimental conditions. Getting biologists to agree to such an approach is akin to asking people to agree on a single religion.

He goes on to describe the Alliance for Cell Signaling, and how a huge amount of work went into creating an open set of data on cellular responses, and how that data has been fairly useless for other researchers. He makes the intriguing point that as technology and high-throughput techniques continue to improve, it may just be easier to generate your own data set than to try to integrate someone else's.

By: David Wojick, Ph.D

David Wojick, Ph.D — Thu, 04 Feb 2010 11:59:46 +0000

This is a very big issue. The US Government is actively exploring the issue of selecting, preserving, and providing access to, federally funded scientific data. One of the leading exploratory groups is the Interagency Working Group on Digital Data. (I have done staff work for them.) Their first report came out just over a year ago:

“Harnessing the Power of Digital Data for Science and Society”
http://www.nitrd.gov/About/Harnessing_Power.aspx

The next report, which addresses federal science agency policies, is in the works. Cost is indeed a big issue, because redirecting a substantial fraction of the research budget may be involved. So is preservation infrastructure because, as Kent notes, research grants and contracts run out. Then too, there are intellectual property issues. For these reasons, and others, selection of data for preservation and access is a major policy issue.

At the same time, however, many scientific communities are building local data sharing systems. These range from the Sloan Sky Survey to the Large Hadron Collider. How these grassroots efforts will ultimately play into federal policy remains to be seen. The ultimate role of scientific publishing in this emerging system is also unknown. This is a fascinating issue, but a very difficult one.