This post was co-authored by Ithaka S+R analyst Rebecca Springer and Roger Schonfeld.
The landscape of actors involved in supporting the publication and sharing of research data is a crowded one, populated by research funders, scholarly publishers, university administrations and libraries, and nonprofit organizations. Ithaka S+R has studied a number of key trends in this space, including how scholars manage their own collections (for example, datasets), the hiring of data librarians at research universities, and the ways in which successful data sharing happens in established and emergent data communities (such as spinal cord injury, literary sound recording, and zooarchaeology). Turning to look specifically at large-scale generalist data repositories, it is clear that sharing research data is also increasingly becoming a real business.
Among the most significant players in the landscape today are Dryad, Mendeley Data, figshare, and Zenodo. It’s noteworthy that there is substantial variation in organizational models even among these four repositories. Mendeley Data and figshare are commercial products offered by workflow and analytics companies; Dryad is a membership-based nonprofit organization; and Zenodo is the creation of a government program. Below, we provide an overview this landscape and highlight several recent announcements that may indicate future strategic prospects.
Open Source Options Integrate
On July 17, Dryad and Zenodo, two leading generalist not-for-profit data repositories built on open source platforms, announced a partnership supported by the Alfred P. Sloan Foundation. The partnership focuses on “supporting researcher and publisher workflows as well as best practices in data and software curation” via new open source code and “integrations between our systems.”
Dryad is a nonprofit membership organization that curates and hosts scientific and medical datasets associated with scholarly publications. Dryad’s strengths lie in its manual curation of datasets – all uploaded sets are checked for basic functionality – and in its integration with journal submission platforms. Its strongest relationships are with a variety of ecology and biology journals as well as PLOS ONE. Dryad recently announced the launch of a new platform, which will allow datasets to be published without an associated journal submission.
Zenodo is an initiative of OpenAIRE, an EU organization focused on open science, and is hosted by CERN. Although originally developed to support CERN’s research, it now accepts articles, conference proceedings, datasets, and software code (via integration with GitHub) across all academic fields, including the humanities and social sciences. Unlike Dryad, Zenodo accepts datasets that are not associated with publications. It also boasts a community curation feature, which allows organic and institutional research communities to collect and share resources. (Editor’s Note: Dryad now allows datasets not associated with publications to be deposited, see comment below)
The nature of the partnership is not yet completely clear. But the logic for cooperation between Dryad and Zenodo is fairly straightforward, focusing on developing an increasingly aligned codebase and other integrations, and seems to have been welcomed by the research data management community. Over time, it will be interesting to see if a partnership at this level can be sustained or, if successful, whether further consolidation perhaps including some kind of merger could be imagined. For example, a fully merged Dryad-Zenodo repository could powerfully integrate both data and software publication into article publication workflows. This is key, especially in light of Elsevier’s efforts to integrate Mendeley Data into its own suite of research-to-reader workflow tools.
It’s not hard to see where Mendeley Data’s strategic strengths as a research data repository lie. Elsevier acquired Mendeley, then a social network and research sharing platform, in 2013. The 2015 launch of Mendeley Data, a general-purpose research data repository, created an additional mechanism for Elsevier to lock researchers into a complete workflow ecosystem. Mendeley Data is now able to take advantage of the Elsevier user account and the Mendeley researcher dashboard to allow users to manage their research libraries, references, and datasets in a single location.
But perhaps Elsevier’s best chance to leverage Mendeley Data is through integrating data submission into article publication workflows. Put simply, there are two main reasons why a researcher will publish their data online. Either they belong to a data community with developed norms of sharing and reusing a certain type of data; or they are required to do so at the point of publication by their funder or the publisher themselves. It’s the latter case that’s relevant here, and in this scenario, researchers are likely to follow the path of least resistance: they will deposit their data in whichever repository is most convenient. Mendeley Data currently advertises “custom integration” with manuscript submission systems, and following Elsevier’s acquisition of the manuscript submission system Aries last year, the company appears strongly positioned to push content to Mendeley Data simply by setting it up as the seamless, default option in its publication workflows.
There is another way in which Elsevier could move to elevate Mendeley Data in the research data space. Data sharing proponents – including Elsevier itself in a recent post – have argued that good data management, starting at the beginning of a project, is a necessary foundation for data sharing. Institutional subscribers to Mendeley Data can supply their scholars with group workspaces that ingest raw data and manage data files collectively throughout the research process. Integration with scientific instruments and electronic lab notebooks, as well as automatic, domain-specific metadata applications, are described on Mendeley Data’s website as in development. Such offerings may have the potential to move Mendeley Data upstream in the research process and significantly reduce the marginal effort required to publish a dataset and thereby help to ensure compliance with funder mandates. With data citation on the rise, it’s also possible to envision such an integration becoming important for institutional recordkeeping, parallelling Elsevier’s move to become more involved in preprints and institutional repositories.
For better or worse, much of the current research data landscape is shaped by the data sharing requirements of major research funders. A key function – perhaps the key function – of the large, generalist data repositories is to provide a home for datasets whose publication is mandated by these funders.
figshare has been a particularly active player in the compliance space. Soon after its 2011 launch and 2012 acquisition by Digital Science, Wellcome published a guest post by Mark Hahnel, figshare’s founder and CEO, advertising the repository as a service helping researchers align with Wellcome’s open science directives. The next year, figshare for institutions was launched, offering university libraries an alternative to developing their own institutional repositories from scratch and thereby competing in the commercial cloud-based repository space dominated by bepress’s Digital Commons. But whereas Digital Commons was offered principally for other purposes, figshare for Institutions was marketed explicitly to help universities ensure that their researchers comply with funder mandates – and maintain good standing for future award competitions.
This July – and following closely on the heels of the Dryad-Zenodo partnership announcement – the NIH and figshare launched a one-year pilot representing a new move in the funder compliance game. The concept is simple: NIH-funded researchers are “invited” – though not required – to deposit their data in NIH Figshare, a uniquely branded corner of the broader figshare database. Similar sub-databases are already in place for a number of large publishers, including Springer Nature, Wiley, Taylor & Francis, and PLOS, and datasets deposited in NIH Figshare can also be located via the main figshare website’s search tool.
Two aspects of this announcement, however, are noteworthy. The first is that it follows on the heels of a significant false start in data sharing for the NIH. The NIH Data Commons, an ambitious effort to develop a cloud-based big data platform for health science researchers, was not continued past its 2017-2018 pilot phase. The NIH’s turn to a commercial platform provider could be read as a much more conservative second attempt.
Second, the July announcement somewhat cryptically promised that NIH Figshare would be “curated by trained data librarians.” Allusion in the same announcement to cooperation with the Sloan-funded Data Curation Network (DCN), a shared staffing membership organization for institutional repositories, initially appeared linked to this promise, but DCN swiftly released a statement clarifying that they have no formal partnership with figshare. Instead, figshare appears to be hiring data librarians directly. This service, if built out, could represent an important innovation, bringing a generalist repository closer to the kind of support services offered by subject-specific databases like DesignSafe-CI – and much appreciated by researchers.
figshare has been perfectly happy to bring in content types well beyond datasets as well. Like Zenodo, it also hosts preprints and other scholarly outputs, where it offers an alternative to institutional repositories and preprint services. It also offers shared “project” workspaces for research groups. In recent years, Digital Science has been developing data integrations between figshare and its Symplectic Elements research information management solution, allowing the repository’s preservation and showcasing features to connect directly with management, compliance, and analytics. It has offered these solutions together as a bundled sale for institutions.
In this review, we have examined only a few of the most significant businesses in the research data sharing landscape. The landscape also includes a number of other elements, including institutionally managed repositories, other startups and open source projects, the not-for-profit and highly curated ICPSR, and more. Given the comparatively robust offerings from Elsevier (Mendeley Data) and Digital Science (figshare), Clarivate is strikingly absent from the picture. Suffice it to say that we do not believe that the data repository landscape is yet mature, and it takes little foresight to predict new entrants and additional consolidation.
At the same time, there is little reason to believe that data sharing has developed into a sustainable, let alone profitable, line of business. With growing funder and publisher mandates, however, it is clear that several companies and not-for-profits are staking their claim on the belief that research data must be part of a comprehensive workflow solution.