Editor’s Note: Today’s post is by Mark Hahnel. Mark is the CEO and founder of Figshare, which currently provides research data infrastructure for institutions, publishers, and funders globally.
There has been much made of the recent Nature news declaration of the NIH Data Policy (from January 2023) as ‘seismic’. In my opinion, it truly is. Many others will argue that the language is not strong enough. But for me, the fact that the largest public funder of biomedical research in the world is telling researchers to share their data demonstrates how fast the push for open academic data is accelerating.
While a lot of the focus is on incentive structures and the burden for researchers, the academic community should not lose focus on the potential ‘seismic’ benefits that open data can have for reproducibility and efficiency in research, as well as the ability to move further and faster when it comes to knowledge advancement.
What has been achieved in the last ten years?
My company, Figshare, provides data infrastructure for research organizations and also acts as a free generalist repository. We recently received funding as part of the NIH GREI project to improve the generalist repository landscape and collaborate with our colleagues at Dryad, Dataverse, Mendeley Data, Open Science Framework, and Vivli. This community of repositories has witnessed first-hand the rapid growth of researchers publishing datasets and the subsequent need for guidance on best practices.
The growth in citations from the peer-reviewed literature to datasets in these repositories can be seen in this hockey-sticking plot from Dimensions.ai. Remember, this is just the generalist repositories; it doesn’t include institutional or subject-specific data repositories.
Reflecting on the past decade of open research data, there are a few key developments that have helped speed up the momentum in the space, as well as a few ideas that haven’t come to fruition…yet.
The NIH is not the first funder to tell the researchers they fund that they should be making their data openly available to all. 52 funders listed on Sherpa Juliet require data archiving as a condition of funding, while a further 34 encourage it. A push from publishers has also acted as a major motivator for researchers to share their data. This goes as far back as PLOS requiring all article authors to make their data publicly available back in 2014. Now, nearly all major science journals have an open data policy of some kind. Some may say there is no better motivator for a researcher to share their data than if a publication is at stake.
In 2016, the ‘FAIR Guiding Principles for scientific data management and stewardship’ were published in Scientific Data, and a flurry of debate on the definition of Findable, Accessible, Interoperable, and Reusable data has continued ever since. This has been a net win for the space. Although every institution, publisher and funder may not be aiming for the exact same outcome, it is a move to better describe and ultimately make data outputs usable as a standalone output. The principles for Findable, Accessible, Interoperable and Reusable data emphasize that when thinking of research data, future consumers will not just be human researchers — we also need to feed the machines. This means that computers will need to interpret content with little or no human intervention. For this to be possible, the outputs need to be in machine readable formats and the metadata needs to be sufficient to describe exactly what the data are and how the data was generated.
This highlights the area (in my opinion) that can create the most change in the shortest amount of time: quality of metadata. Generalist repositories will always struggle to capture metadata at the level of subject-specific repositories. This is why subject-specific repositories should always be a researcher’s first port of call for depositing data. It is unlikely, however, that we will see a subject-specific repository for every subject in the next decade. What we will see is a multi-pronged push for better metadata for every dataset. This can be achieved in multiple ways:
- Software nudging users into best practice. A simple first step is to encourage researchers to title their dataset as they would a paper. Hint: titling a dataset “dataset” is as useful as titling your paper “paper”
- Institutional librarians being recruited to curate metadata for outputs before they’re published
- More training for academics on the benefits of making their data more discoverable by making it more descriptive. More discoverable in theory means more potential for reuse and more impact for the researcher — the biggest incentive of all
- Services offering curation. Dryad have been doing this for a decade and we are beginning to see a wider variety of solutions available for a variety of data
- Marking up existing metadata using related information openly available on the web.
For publishers, there is a huge opportunity to aid the researchers in data publication. Most policies require data publication at the point of publishing the associated paper. While the paper will always be the context and interpretation, the machines need metadata around the objects sourced either from the papers directly — meaning linkages between the two are of the utmost importance — or encouraged by editorial staff before the outputs are made public.
Where else is there work still to be done?
One area where progress hasn’t been made significantly is the publication of negative results, or null data. Providing tools for researchers to make all of their academic outputs openly available is just half of the story. For the most part, researchers have zero incentives to publish negative results.
And while the number of researchers sharing data is growing rapidly, this does not mean that researchers want to do so. Evidence from The State of Open Data, suggests that the majority are publishing data for compliance reasons. 39% of researchers surveyed said they are not receiving appropriate credit or acknowledgement. 47% of survey respondents said they would be motivated to share their data if there was a journal or publisher requirement to do so. This lack of incentives, combined with fear over being scooped, may also be responsible for a recent spike in the amount of researchers citing “data available upon request” in their articles; this is something I consider to be bad practice.
What happens next?
If the last 10 years were about encouraging researchers to make data available on the web, the next 10 years should be about making it useful. The concept of a Fourth Paradigm in academic research is envisioned as a new method of pushing forward the frontiers of knowledge, enabled by new technologies for gathering, manipulating, analyzing, and displaying data. The term seems to have originated with Jim Gray, who contributed to the publication of The Fourth Paradigm: Data-intensive Scientific Discovery in 2009 but sadly is no longer with us to see how his predictions have come to fruition.
I have spent a large part of the last decade hypothesizing what the benefits of open data could be in accelerating the rate at which information becomes knowledge. We are beginning to see real world examples, such as AlphaFold, a solution to a 50-year-old grand challenge in biology. A core part of this success story relied on AI training data from the Protein Data Bank — a repository that itself is over 50 years old — pulling together homogenous datasets, ideal for AI and machine learning. The associated quote on publishing the findings from Deepmind also highlights how a combination of well described open data and AI could achieve the lofty goals set out in the Fourth Paradigm:
“This breakthrough demonstrates the impact AI can have on scientific discovery and its potential to dramatically accelerate progress in some of the most fundamental fields that explain and shape our world.”
As such, well-described, open data is primed to accelerate the rate of discovery by providing fuel for our machine overlords to crunch through. The sheer size and volume of the data may continue to outpace the ability to store and query academic outputs in meaningful ways, even more so if we don’t get the fundamentals around subject-specific community best practices in place today. Machine learning and AI can help look for patterns and relationships in the data; that will always be beyond the realms of human endeavor. Human comprehension of these results will always be needed to push the needle further still, much in the same way that a combination of humans and machines proved to be the sweet spot in competitive chess.
As we nurse our COVID hangover, the world has never been more aware of the need to move further faster when it comes to knowledge discovery. The missing puzzle piece to achieve this in the traditional academic publishing process is well-described open data. The effects could be ‘seismic,’ some may say.