A leaked version of a new US policy on making federal data public is circulating on Capitol Hill. The policy raises some interesting and difficult questions regarding federally funded STM data. How these questions will be resolved is far from clear.
The focus of the policy is the Data.gov website, which has been around for about two years. Data.gov lists and links to federal datasets and tools. It has several hundred thousand items, and most of the content is either statistical data or geospatial data. Not surprisingly, the census accounts for most of the former, while US Geological Survey (USGS) provides most of the latter, but every basic department and agency has put something in. This is because the Office of Management and Budget (OMB) directed that everyone put in at least three datasets, and no one wants to fight OMB. The department and agency listings are available online.
However, there is almost no STM research data in Data.gov, just a few bits and pieces. The Weather Service has some, such as past hurricane tracks, and Environmental Protection Agency (EPA) has some, but there is essentially almost nothing else. The National Institutes of Health (NIH), which accounts for half of the federal basic research budget, has no data in Data.gov. The National Science Foundation (NSF) only has statistical data from its science indicators group, and so it goes across the federal R&D community.
All this may now change because the draft data policy takes a new approach to feeding Data.gov. Now, every department and agency is directed to inventory all of its funded datasets and put them all into Data.gov to the extent practicable. This is basically a fundamental change from voluntary to mandatory inclusion and from “a few of your best” to “everything you have.”
It is far from clear how the STM agencies might respond to such a mandate. They have an enormous number of research datasets, in principle everything that has been federally funded. Most research generates data of some sort. Making it all publicly available would be a colossal task, would raise a host of administrative and legal issues, and would probably swamp Data.gov in the process. Moreover, some of these datasets are gigantic. The research communities have their own big pipe networks just to handle them. It is not clear how they can be made publicly available.
The draft policy seems to be oblivious to this monster issue. The concept of research data is not even mentioned. In fact, the definition of data is incredibly broad, it being simply “structured information” — which makes every sentence data. Only intelligence data is exempted.
If this policy is issued, it may be a wild ride for the US STM funding agencies. So far, the draft has not been published for agency or public comment, but it should be. This is a huge step which may not be practical. The title of the undated draft is, “Managing Government Information as an Asset throughout its Life Cycle to Promote Interoperability and Openness.” Now that it has been leaked, it may soon surface through official channels.
On the other hand, if a lot of STM data is published via Data.gov, it might have significance for the publishing industry. I welcome suggestions as to what these might be, either positive or negative.