The recent data policy outlined by PLOS and now being enforced is important not only owing to its prominence, but because it makes real for many a good number of the data-centric ideas swanning about these days, many of which have to do with data sharing and data publication.
It spurs an important discussion that may be a bit overdue.
The idea of sharing data has immediate appeal. It sounds scientific, commendable, and direct, as opposed to written reports, which are one step removed from data, and have been criticized in the digital age for obscuring data, hiding it behind old-fashioned graphical charts and tables.
Mainlining data sounds better than reading papers.
Yet, data may turn out to be just as variably applicable and elusive as anything. In some fields (e.g., astronomy), data provide the lifeblood of the field. Most modern astronomers perform their research by analyzing data rather than peering through lenses. Other fields also have strong linkages to direct data observation, such as computational biology and various mathematical fields. For these researchers, data and empiricism are tightly linked.
In fields where direct observation of physical reality can’t be as direct, empirical observations yield data that can be difficult to reconstitute into the conditions that generated them. In these fields — medicine, molecular biology, other biological fields, physics, and the humanities and social sciences — the results of observations can lead to some interesting data, but direct observations either can’t be captured in sufficient detail to be saved in anything approaching a complete manner, or are subject to so many conditions that the data can only provide a basis for interpretation and correlation. This is where statistics can save the day, but that is just another form of interpretation at some level — how you choose to analyze the data is an interpretation, and others might use different techniques. Combining data from separate observational occurrences can be extremely treacherous, given all the conditions, confounders, and temporal aspects to biological systems and populations.
In medicine, approaches to combining the findings from multiple trials have not lived up to expectations. Evidence-based medicine has proven of more modest utility than its proponents envisioned when they started, and clinical guidelines have devolved into a confusing array of competing interpretations of the underlying datasets, leaving clinicians with the chore of choosing which guideline to follow. If data spoke as clearly as we believed, there would be no difference.
How much data to share is unclear. PLOS attempts a definition in its policy when it comes to a “minimal dataset” or:
. . . the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. This does not mean that authors must submit all data collected as part of the research, but that they must provide the data that are relevant to the specific analysis presented in the paper.
As one blogger points out, even this carefully crafted statement leaves a lot of room for interpretation and creates potentially significant burdens or unrealistic expectations for researchers in some fields:
Most behavioral or physiological analysis is somewhere between “pure code” analysis and “eyeball” analysis. It happens over several stages of acquisition, segmenting, filtering, thresholding, transforming, and converting into a final numeric representation that is amenable to statistical testing or clear representation. Some of these steps are moving numbers from column A to row C and dividing by x. Others require judgment. It’s just like that. There is no right answer or ideal measurement, just the “best” (at the moment, with available methods) way to usefully reduce something intractable to something tractable.
Data sharing initiatives come down to “what is the point.” Some argue that providing the underlying data is about validation of a published study. But data can be fabricated and falsified, and sharing falsified data that fit published conclusions does little but validate a falsehood. Also, if your analysis of my data doesn’t agree with my analysis of my data, where does that leave us? Am I mistaken, or are you?
Arguing over data can be a form of stonewalling, as the fascinating scandal at the University of North Carolina (UNC) shows. The whistleblower, Mary Willingham, is not a researcher, but used research tools and methodologies to identify reading and writing deficiencies in high-profile athletes attending the university and making good grades in sham classes. An administrator attempted to discredit her research by selectively and publicly picking apart the data. As Willingham is quoted in an excellent BusinessWeek article:
Let’s say my data are off a little bit. I don’t think they are, but let’s say they are. Set aside the data. Forget about it. The . . . classes were still fake, and they existed to keep athletes eligible.
Data don’t always contain the most salient conclusions of a study. Conclusions can rely on data, but sometimes they go beyond the data.
While data sharing for some is about validating results, for others, publishing data is about enabling big data solutions and approaches.
David Crotty’s post yesterday points to this cultural belief system around data-sharing, one which emanates from Silicon Valley — namely, the belief that data can be accurately parsed and understood if you have enough of it. This belief system has led to some amazing inventions and tools when the data are actually close to some empirical reality — GPS, text searching, and so forth — but data accumulation has many potential weaknesses, not the least of which is the problems of how far data can become divorced from reality over time, how flawed data may go uncorrected, or how reliance on data might impede simpler and more direct empirical observations.
We’ve become somewhat beholden to these technocratic, data-driven impulses in our daily lives. If you’ve ever watched the weather radar on television or on your phone rather than looking out the window — or, more likely, ever argued with your device when it’s not raining outside but the radar shows your area covered with a green splotch — you were substituting data for empiricism. Or, if you’ve ever arrived at an address using GPS to find that the store or facility you expected to be there has either ceased to exist or has moved, you’ve seen the power of empiricism over data firsthand.
What is not measured is also a concern. It’s easy to miss an important fact or skip measuring phenomena in a comprehensive, thoughtful way. Michael Clarke covered this well in a recent post, noting that many measurement tools fall short in rapidly changing environments. In medicine, many long-term studies have reached a point of diminishing returns not only because their populations are aging and dying, but because the study designs did not take into account how relevant race or gender or class might prove, resulting in studies that are dominated by Caucasian men or affluent women. After decades of data collection, such data sets are not aging well, and ultimately will become relics. What factors are the large data sets of today overlooking? What regrets will our medical research peers have in their dotage? What biases are shared when we share data?
Privacy is also a concern (as noted in a comment yesterday, with an emphasis on the Helsinki Declaration), and big data invites unanticipated consequences. In yesterday’s comments, we were also reminded of the researcher who in 2002 was able to use anonymized patient data and publicly available records to identify the health records of the Massachusetts Governor at the time, William Weld. We also all know the story of the Target pregnancy reveal, where Target sent a teenager marketing normally reserved for pregnant women, based on heuristics around her purchase habits (switching to unscented lotions, for example). Her angry father was chagrined a month later when it became clear Target had been correct.
Publishing data can lead to unintended consequences, including privacy violations. PLOS’ data policy has a statement about this, and the policy’s authors were definitely thoughtful on the point. But this only confirms that there is a big problem with any broad data-sharing policy — namely, that any single data provider cannot definitively know if their data, in combination with other data, could enable privacy violations. A single relatively benign-appearing data set might be the key unlocking a half-dozen other data sets. Hence, policies that require publication of data leave the door open for major unintended consequences, as they create scads of data with no single point of accountability but many potential points of exploitation.
The siren song of data is seductive. However, an environment filled with published data needs to have a clear purpose. Is it to validate published reports? If that’s the case, then it has a time-limited value, and should be treated accordingly. Is it to enable big data initiatives? If so, then we need controls around privacy, usage, and authorization. As with everything else, the relevance and utility of data depends on who you are, what you do for a living, and what is possible in your field. Assuming that “data are data” can obscure important subtleties, major issues, and leave us unprepared for dealing with flawed or fraudulent data. And we need to compare the risks, costs, and benefits of these initiatives in science with the risks, costs, and benefits of simply recreating published experiments using actual, empirical reality.
In some fields, data are strongly tied to empirical observations. Those fields already have robust data-sharing cultures, and are actively seeking to make them better. Other domains are more driven by hypotheses, overlapping observations, and intricate interconnections of incomplete datasets that practitioners have to weave into a knowledge-base. Is their approach wrong? The technocrats who believe that “data are data” might argue it is. But unless we’re sure that “data are reality,” it’s probably best to keep our central focus on the best possible direct empirical observations. Data are part of these, but they shouldn’t become a substitute for them.