A leaked version of a new US policy on making federal data public is circulating on Capitol Hill. The policy raises some interesting and difficult questions regarding federally funded STM data. How these questions will be resolved is far from clear.
The focus of the policy is the Data.gov website, which has been around for about two years. Data.gov lists and links to federal datasets and tools. It has several hundred thousand items, and most of the content is either statistical data or geospatial data. Not surprisingly, the census accounts for most of the former, while US Geological Survey (USGS) provides most of the latter, but every basic department and agency has put something in. This is because the Office of Management and Budget (OMB) directed that everyone put in at least three datasets, and no one wants to fight OMB. The department and agency listings are available online.
However, there is almost no STM research data in Data.gov, just a few bits and pieces. The Weather Service has some, such as past hurricane tracks, and Environmental Protection Agency (EPA) has some, but there is essentially almost nothing else. The National Institutes of Health (NIH), which accounts for half of the federal basic research budget, has no data in Data.gov. The National Science Foundation (NSF) only has statistical data from its science indicators group, and so it goes across the federal R&D community.
All this may now change because the draft data policy takes a new approach to feeding Data.gov. Now, every department and agency is directed to inventory all of its funded datasets and put them all into Data.gov to the extent practicable. This is basically a fundamental change from voluntary to mandatory inclusion and from “a few of your best” to “everything you have.”
It is far from clear how the STM agencies might respond to such a mandate. They have an enormous number of research datasets, in principle everything that has been federally funded. Most research generates data of some sort. Making it all publicly available would be a colossal task, would raise a host of administrative and legal issues, and would probably swamp Data.gov in the process. Moreover, some of these datasets are gigantic. The research communities have their own big pipe networks just to handle them. It is not clear how they can be made publicly available.
The draft policy seems to be oblivious to this monster issue. The concept of research data is not even mentioned. In fact, the definition of data is incredibly broad, it being simply “structured information” — which makes every sentence data. Only intelligence data is exempted.
If this policy is issued, it may be a wild ride for the US STM funding agencies. So far, the draft has not been published for agency or public comment, but it should be. This is a huge step which may not be practical. The title of the undated draft is, “Managing Government Information as an Asset throughout its Life Cycle to Promote Interoperability and Openness.” Now that it has been leaked, it may soon surface through official channels.
On the other hand, if a lot of STM data is published via Data.gov, it might have significance for the publishing industry. I welcome suggestions as to what these might be, either positive or negative.


Whoohoo, that would be awesome! Can you provide a link to the ‘leaked version’ so we can drum up support?
Posted by brembs | Jan 17, 2013, 7:41 amI do not share your whoohoo because I think this policy would be incredibly stupid but it certainly is a pot boiler. I have not found the draft on-line so far.
Posted by David Wojick | Jan 17, 2013, 2:13 pmI have been invited to speak on this topic at http://conferences.cdrs.columbia.edu/rds/index.php/rds/rds Columbia on 2012-02-27 along with librarians and publishers.I shall argue that those companies which embrace the new rapidly will prosper – those that try to extrapolate fro the present article publishing will fail in this market.
(I wrote more but I wasn’t able to post it for technical reasons so I’ll blog it)
Posted by petermurrayrust | Jan 17, 2013, 8:22 amVery interesting. Where is this circulating? I guess nobody has put it online yet?
Posted by Dan | Jan 17, 2013, 9:41 amI have not found it on-line but I included the title in case someone puts it there.
Posted by David Wojick | Jan 17, 2013, 2:10 pmAs always, the devil is in the details. Is this policy going to really cover every data-set that has been funded by an agency (awesome, but pretty impractical), or just those agency’s own internal data-sets? there is still a long way to go on the latter. For example, it looks as if there is some NSF funding data up, but nowhere *near* everything that they have…
As for bandwidth, that is likely less of an issue than it first appears. Data.gov is for the moment more finding aid than repository; most everything in there is a link to an external agency site. One issue that *is* potentially vexing is organization — keyword searching is the primary finding tool, and it is only going to get less useful as the number of items in the catalog increases.
Posted by Ed Sperr | Jan 17, 2013, 2:15 pmWhen it comes to OMB policy there are no details. If the policy is wrong it is just wrong.
As for bandwidth I know Data.gov is just a link site. But we built a new big pipe network just for the LHC. Many communities have these big pipe networks. It will be a major project to make them publicly accessible and to what end? Should we make every accelerator run publicly accessible? Every satellite feed? The concept is mind boggling. It suggests that people have no idea what big science has become.
Posted by David Wojick | Jan 17, 2013, 4:22 pm