A leaked version of a new US policy on making federal data public is circulating on Capitol Hill. The policy raises some interesting and difficult questions regarding federally funded STM data. How these questions will be resolved is far from clear.
The focus of the policy is the Data.gov website, which has been around for about two years. Data.gov lists and links to federal datasets and tools. It has several hundred thousand items, and most of the content is either statistical data or geospatial data. Not surprisingly, the census accounts for most of the former, while US Geological Survey (USGS) provides most of the latter, but every basic department and agency has put something in. This is because the Office of Management and Budget (OMB) directed that everyone put in at least three datasets, and no one wants to fight OMB. The department and agency listings are available online.
However, there is almost no STM research data in Data.gov, just a few bits and pieces. The Weather Service has some, such as past hurricane tracks, and Environmental Protection Agency (EPA) has some, but there is essentially almost nothing else. The National Institutes of Health (NIH), which accounts for half of the federal basic research budget, has no data in Data.gov. The National Science Foundation (NSF) only has statistical data from its science indicators group, and so it goes across the federal R&D community.
All this may now change because the draft data policy takes a new approach to feeding Data.gov. Now, every department and agency is directed to inventory all of its funded datasets and put them all into Data.gov to the extent practicable. This is basically a fundamental change from voluntary to mandatory inclusion and from “a few of your best” to “everything you have.”
It is far from clear how the STM agencies might respond to such a mandate. They have an enormous number of research datasets, in principle everything that has been federally funded. Most research generates data of some sort. Making it all publicly available would be a colossal task, would raise a host of administrative and legal issues, and would probably swamp Data.gov in the process. Moreover, some of these datasets are gigantic. The research communities have their own big pipe networks just to handle them. It is not clear how they can be made publicly available.
The draft policy seems to be oblivious to this monster issue. The concept of research data is not even mentioned. In fact, the definition of data is incredibly broad, it being simply “structured information” — which makes every sentence data. Only intelligence data is exempted.
If this policy is issued, it may be a wild ride for the US STM funding agencies. So far, the draft has not been published for agency or public comment, but it should be. This is a huge step which may not be practical. The title of the undated draft is, “Managing Government Information as an Asset throughout its Life Cycle to Promote Interoperability and Openness.” Now that it has been leaked, it may soon surface through official channels.
On the other hand, if a lot of STM data is published via Data.gov, it might have significance for the publishing industry. I welcome suggestions as to what these might be, either positive or negative.
24 Thoughts on "Leaked Data Policy Raises Monster STM Data Issues"
Whoohoo, that would be awesome! Can you provide a link to the ‘leaked version’ so we can drum up support?
I have been invited to speak on this topic at http://conferences.cdrs.columbia.edu/rds/index.php/rds/rds Columbia on 2012-02-27 along with librarians and publishers.I shall argue that those companies which embrace the new rapidly will prosper – those that try to extrapolate fro the present article publishing will fail in this market.
(I wrote more but I wasn’t able to post it for technical reasons so I’ll blog it)
As always, the devil is in the details. Is this policy going to really cover every data-set that has been funded by an agency (awesome, but pretty impractical), or just those agency’s own internal data-sets? there is still a long way to go on the latter. For example, it looks as if there is some NSF funding data up, but nowhere *near* everything that they have…
As for bandwidth, that is likely less of an issue than it first appears. Data.gov is for the moment more finding aid than repository; most everything in there is a link to an external agency site. One issue that *is* potentially vexing is organization — keyword searching is the primary finding tool, and it is only going to get less useful as the number of items in the catalog increases.
When it comes to OMB policy there are no details. If the policy is wrong it is just wrong.
As for bandwidth I know Data.gov is just a link site. But we built a new big pipe network just for the LHC. Many communities have these big pipe networks. It will be a major project to make them publicly accessible and to what end? Should we make every accelerator run publicly accessible? Every satellite feed? The concept is mind boggling. It suggests that people have no idea what big science has become.
“Most research generates data of some sort. Making it all publicly available would be a colossal task, would raise a host of administrative and legal issues, and would probably swamp Data.gov in the process”
We were talking about this yesterday – do you have any data to support this assertion? I agree that some fields produce huge amounts of data, and others much less, but what’s the expected total amount? Is it really going to swamp Data.gov?
Look at it this way. Right now the listing for DOE (which I know best) is a few dozen items. DOE funds more physical science research than anyone else including the big science stuff. If they took every bit of data from every project they have funded in the last ten years I expect it would be well over a million items. Every run of every instrument? Every model run? Every federally funded measurement?
More broadly there are something like a million and a half journal articles a year and the US probably funds half of that work. Plus there are many applied projects that do not publish. Every one of these efforts produces data, in many cases many kinds of data. We are probably talking several million datasets a year given the loose definition. All to be listed in Data.gov?
Not realistic? My background is federal regulation where every word counts and has legal authority. That is what we are dealing with here. These are OMB rules not Utopian op-eds. The reporting requirements alone would be overwhelming. The fact that it cannot be done as written makes it a very bad rule. OIRA needs to rethink this mandate. It is absurd as written.
Every little bit of data produced by any government funded scientist is indeed a crazy amount, and hopefully the final policy won’t even contemplate that. A more sensible approach may be requiring data archiving at publication combined with some form of encouragement to publish data papers – this would ensure that data entering the public realm were well curated and accompanied by a relatively detailed description of how and why they were collected.
The fundamental problem that you are not adressing and which no one wants to address is that the concept of data is hopelessly vague. It ranges over thousands of orders of magnitude. You keep using the term data as though it were specific. We cannot afford to curate all the data associated with every paper, unless you just mean the final data. What do you mean by the data? That is the big question.
I agree that there is no universal definition of ‘data’. However, that doesn’t mean that it can’t be defined for an individual paper or project.
The phrase in the policy we use (the JDAP) says ‘the data required to reproduce the results in the paper’, which is a definable set of numbers and other pieces of information. It’s normally the final data used to plot the graphs and run the statistics. That’s why archiving data at publication is effective – you’re asking for a definable dataset at a set time point. Why couldn’t the data.gov policy just focus on this too?
One other useful step that stems from targeting data associated with articles is that one can ask reviewers whether the data archived by the paper is sufficient. Over time this will ensure that a community standard develops over what data can reasonably be expected to be archived for each type of paper.
One other useful step that stems from targeting data associated with articles is that one can ask reviewers whether the data archived by the paper is sufficient
How effective would this be as part of the peer review process? In my experience, reviewers don’t spend a lot of time (if any) on the often voluminous supplemental material. Is it fair to ask them to go through a dataset in detail to determine its completeness and utility? I’m not saying it’s a bad idea, but there’s always going to be the question of how to best spend a researcher’s most precious commodity, time.
We don’t ask them to look over the data itself – we just have a section at the end of the ms that lists the various datasets and where they will be archived, and the reviewers tell us whether that list is sufficient. Complicated papers have many different component datasets, and the reviewers are the people best placed to tell us whether all the right ones are in this section. It takes me about fifteen minutes to do this for an unfamiliar paper, so if you’re an expert in the field and you’ve just reviewed it, this step should take much less time.
Admittedly, not that many reviewers are doing it yet (we give them an ‘I didn’t check’ option), but this is the sort of change in reviewer behaviour that takes a year or two to come to fruition.
Tim, when you say the list must be sufficient do you give your reviewers any guidance on this concept of sufficiency? What does it say? How about the authors? Policies need written specification and that is the crunch point for data policy given the vagueness of the concept of data.
So the review process covers their intentions, not their actual actions? Are there any safeguards to ensure they’ve done what they said they were going to do?
Enforcement of policy, as many funding agencies are finding, can be a difficult and expensive process.
Enforcement is by far the greatest cost to the regulator of a regulatory program, hence it should be to the publishers as well if they impose such mandates. Since the number of ongoing data archiving requirements grows every year so should the enforcement cost. Moreover if enforcement is lacking then we may get what I call selective enforcement, which means just going after those you do not like. Enforcing a data policy is a complex business.
Tim, I assume this is the JDAP guidance statement:
“<> requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as << list of approved archives here <<. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species." From http://datadryad.org/pages/jdap
There is no explanation of what data is required except "data supporting the results in the paper". As regulatory language this is completely vague. It creates what I call "scapegoat law" where the meaning is worked out by trial and error, in this case with a bunch of authors. It may even happen with every author if the papers vary a lot, or with every journal an author submits to. This kind of confusion creates a lot of burden on everyone.
As written now every publication is also data and as such needs to be listed on Data.gov. An article is certainly structured information, which is the draft definition of data. So is every email for that matter.
The basic mistake lies in going from “give us your best” to “give us everything” because “everything” is hopelessly vague and broad. I have a diagnostic system of 126 confusions in policies and regulations (or any expository text) and “vague concept” is the worst. This draft is a prescription for confusion when applied to STM and confusion is both disruptive and expensive.
As for your publication scheme, you have a similar problem plus you also want the data collection described, which is a “vague rule” in my taxonomy of confusions. For example data is often generated not collected.
If the policy is written in such a way that everyone has to hand over every email, document or article, then I agree it’s not workable. Maybe they’re still refining it and that’s why this leaked draft shouldn’t be the version to judge?
Data collection is described in the ‘materials and methods’ section of the article, and often in the Supp Mat as well. I can’t see your point here – do you mean that a government policy that tries to tell everyone what should appear in the M&M would be a ‘vague rule’?
With respect to your last point, a lot of articles do use simulated datasets. Sometimes these can be archived, along with the code used to generate and analyse them, and sometimes you just need the code. It can be hard to identify the right license when archiving code, but there are a variety of repositories available.
There are two different issues here Tim. The point of my article is that the draft policy does not recognize the vast case of scientific data. You have an interest and expertise in this area so I suggest you contact OIRA and offer to help. http://www.whitehouse.gov/omb/inforeg_default. (By coincidence I helped set up OIRA back in 1980.) OIRA makes federal information policy.
The other issue is publisher mandates. As I have said I do not think publishers should try to make data policy.