This post was co-authored by Ithaka S+R analyst Rebecca Springer and Roger Schonfeld.
The landscape of actors involved in supporting the publication and sharing of research data is a crowded one, populated by research funders, scholarly publishers, university administrations and libraries, and nonprofit organizations. Ithaka S+R has studied a number of key trends in this space, including how scholars manage their own collections (for example, datasets), the hiring of data librarians at research universities, and the ways in which successful data sharing happens in established and emergent data communities (such as spinal cord injury, literary sound recording, and zooarchaeology). Turning to look specifically at large-scale generalist data repositories, it is clear that sharing research data is also increasingly becoming a real business.
Among the most significant players in the landscape today are Dryad, Mendeley Data, figshare, and Zenodo. It’s noteworthy that there is substantial variation in organizational models even among these four repositories. Mendeley Data and figshare are commercial products offered by workflow and analytics companies; Dryad is a membership-based nonprofit organization; and Zenodo is the creation of a government program. Below, we provide an overview this landscape and highlight several recent announcements that may indicate future strategic prospects.
Open Source Options Integrate
On July 17, Dryad and Zenodo, two leading generalist not-for-profit data repositories built on open source platforms, announced a partnership supported by the Alfred P. Sloan Foundation. The partnership focuses on “supporting researcher and publisher workflows as well as best practices in data and software curation” via new open source code and “integrations between our systems.”
Dryad is a nonprofit membership organization that curates and hosts scientific and medical datasets associated with scholarly publications. Dryad’s strengths lie in its manual curation of datasets – all uploaded sets are checked for basic functionality – and in its integration with journal submission platforms. Its strongest relationships are with a variety of ecology and biology journals as well as PLOS ONE. Dryad recently announced the launch of a new platform, which will allow datasets to be published without an associated journal submission.
Zenodo is an initiative of OpenAIRE, an EU organization focused on open science, and is hosted by CERN. Although originally developed to support CERN’s research, it now accepts articles, conference proceedings, datasets, and software code (via integration with GitHub) across all academic fields, including the humanities and social sciences. Unlike Dryad, Zenodo accepts datasets that are not associated with publications. It also boasts a community curation feature, which allows organic and institutional research communities to collect and share resources. (Editor’s Note: Dryad now allows datasets not associated with publications to be deposited, see comment below)
The nature of the partnership is not yet completely clear. But the logic for cooperation between Dryad and Zenodo is fairly straightforward, focusing on developing an increasingly aligned codebase and other integrations, and seems to have been welcomed by the research data management community. Over time, it will be interesting to see if a partnership at this level can be sustained or, if successful, whether further consolidation perhaps including some kind of merger could be imagined. For example, a fully merged Dryad-Zenodo repository could powerfully integrate both data and software publication into article publication workflows. This is key, especially in light of Elsevier’s efforts to integrate Mendeley Data into its own suite of research-to-reader workflow tools.
Workflow Integration
It’s not hard to see where Mendeley Data’s strategic strengths as a research data repository lie. Elsevier acquired Mendeley, then a social network and research sharing platform, in 2013. The 2015 launch of Mendeley Data, a general-purpose research data repository, created an additional mechanism for Elsevier to lock researchers into a complete workflow ecosystem. Mendeley Data is now able to take advantage of the Elsevier user account and the Mendeley researcher dashboard to allow users to manage their research libraries, references, and datasets in a single location.
But perhaps Elsevier’s best chance to leverage Mendeley Data is through integrating data submission into article publication workflows. Put simply, there are two main reasons why a researcher will publish their data online. Either they belong to a data community with developed norms of sharing and reusing a certain type of data; or they are required to do so at the point of publication by their funder or the publisher themselves. It’s the latter case that’s relevant here, and in this scenario, researchers are likely to follow the path of least resistance: they will deposit their data in whichever repository is most convenient. Mendeley Data currently advertises “custom integration” with manuscript submission systems, and following Elsevier’s acquisition of the manuscript submission system Aries last year, the company appears strongly positioned to push content to Mendeley Data simply by setting it up as the seamless, default option in its publication workflows.
There is another way in which Elsevier could move to elevate Mendeley Data in the research data space. Data sharing proponents – including Elsevier itself in a recent post – have argued that good data management, starting at the beginning of a project, is a necessary foundation for data sharing. Institutional subscribers to Mendeley Data can supply their scholars with group workspaces that ingest raw data and manage data files collectively throughout the research process. Integration with scientific instruments and electronic lab notebooks, as well as automatic, domain-specific metadata applications, are described on Mendeley Data’s website as in development. Such offerings may have the potential to move Mendeley Data upstream in the research process and significantly reduce the marginal effort required to publish a dataset and thereby help to ensure compliance with funder mandates. With data citation on the rise, it’s also possible to envision such an integration becoming important for institutional recordkeeping, parallelling Elsevier’s move to become more involved in preprints and institutional repositories.
Facilitating Compliance
For better or worse, much of the current research data landscape is shaped by the data sharing requirements of major research funders. A key function – perhaps the key function – of the large, generalist data repositories is to provide a home for datasets whose publication is mandated by these funders.
figshare has been a particularly active player in the compliance space. Soon after its 2011 launch and 2012 acquisition by Digital Science, Wellcome published a guest post by Mark Hahnel, figshare’s founder and CEO, advertising the repository as a service helping researchers align with Wellcome’s open science directives. The next year, figshare for institutions was launched, offering university libraries an alternative to developing their own institutional repositories from scratch and thereby competing in the commercial cloud-based repository space dominated by bepress’s Digital Commons. But whereas Digital Commons was offered principally for other purposes, figshare for Institutions was marketed explicitly to help universities ensure that their researchers comply with funder mandates – and maintain good standing for future award competitions.
This July – and following closely on the heels of the Dryad-Zenodo partnership announcement – the NIH and figshare launched a one-year pilot representing a new move in the funder compliance game. The concept is simple: NIH-funded researchers are “invited” – though not required – to deposit their data in NIH Figshare, a uniquely branded corner of the broader figshare database. Similar sub-databases are already in place for a number of large publishers, including Springer Nature, Wiley, Taylor & Francis, and PLOS, and datasets deposited in NIH Figshare can also be located via the main figshare website’s search tool.
Two aspects of this announcement, however, are noteworthy. The first is that it follows on the heels of a significant false start in data sharing for the NIH. The NIH Data Commons, an ambitious effort to develop a cloud-based big data platform for health science researchers, was not continued past its 2017-2018 pilot phase. The NIH’s turn to a commercial platform provider could be read as a much more conservative second attempt.
Second, the July announcement somewhat cryptically promised that NIH Figshare would be “curated by trained data librarians.” Allusion in the same announcement to cooperation with the Sloan-funded Data Curation Network (DCN), a shared staffing membership organization for institutional repositories, initially appeared linked to this promise, but DCN swiftly released a statement clarifying that they have no formal partnership with figshare. Instead, figshare appears to be hiring data librarians directly. This service, if built out, could represent an important innovation, bringing a generalist repository closer to the kind of support services offered by subject-specific databases like DesignSafe-CI – and much appreciated by researchers.
figshare has been perfectly happy to bring in content types well beyond datasets as well. Like Zenodo, it also hosts preprints and other scholarly outputs, where it offers an alternative to institutional repositories and preprint services. It also offers shared “project” workspaces for research groups. In recent years, Digital Science has been developing data integrations between figshare and its Symplectic Elements research information management solution, allowing the repository’s preservation and showcasing features to connect directly with management, compliance, and analytics. It has offered these solutions together as a bundled sale for institutions.
Looking Ahead
In this review, we have examined only a few of the most significant businesses in the research data sharing landscape. The landscape also includes a number of other elements, including institutionally managed repositories, other startups and open source projects, the not-for-profit and highly curated ICPSR, and more. Given the comparatively robust offerings from Elsevier (Mendeley Data) and Digital Science (figshare), Clarivate is strikingly absent from the picture. Suffice it to say that we do not believe that the data repository landscape is yet mature, and it takes little foresight to predict new entrants and additional consolidation.
At the same time, there is little reason to believe that data sharing has developed into a sustainable, let alone profitable, line of business. With growing funder and publisher mandates, however, it is clear that several companies and not-for-profits are staking their claim on the belief that research data must be part of a comprehensive workflow solution.
Discussion
13 Thoughts on "The Research Data Sharing Business Landscape"
Many interesting points, but I find it ironic that scholarly publishers are excoriated for clunky integrations (I remember a presentation by you at a recent STM meeting in Frankfurt), but when they try to create seamless researcher experiences it’s called “lock in”. Having previously worked at Aries systems for 20 years, I can tell you that inbound and outbound integrations were always designed and implemented in the most generic manner possible. For example, Aries is a leader in the MECA standard for open manuscript exchange – not at all consistent with a “lock in” strategy.
On a separate note, researcher culture and incentives will also play an important role in shaping data deposit practices. Afterall the technical infrastructure for depositing and sharing data files has existed for decades.
Richard Wynne – Rescognito, Inc.
Richard, Thank you for your comment. I think it is entirely possible to despair of poor interfaces and urge efforts to increase seamlessness for researchers while also being concerned about corporate consolidations that will increase customer lock-in. The market is and likely will continue to drive towards consolidation but that is by no means the only way to enhance the researcher experience. -Roger
Interesting review! I’d like to add some comments:
Organizational Models
As stated in the article, the research data sharing landscape also includes “a number of other elements”. As an example, I’d like to mention here the growing community of research data repositories based on Dataverse (https://dataverse.no) representing different organizational models, including institutional repositories (e.g. LibraData; https://dataverse.lib.virginia.edu and Göttingen Research Online; https://data.goettingen-research-online.de/), multi-institutional repositories (e.g. DataverseNL; https://dataverse.nl and DataverseNO; https://dataverse.no), domain-specific repositories (e.g. Portail Data Inra; https://data.inra.fr) as well as international repositories (e.g. Harvard Dataverse; https://dataverse.harvard.edu).
Curational Services
I think one of the advantages of embedding a data repository within an institution or a data / domain community is that it makes institutions take more responsibility of the data produced by their researchers by providing curation of deposited data and other research data management (RDM) services. The article mentions that the service figshare apparently is building by hiring data librarians “could represent an important innovation, bringing a generalist repository closer to the kind of support services offered by subject-specific databases”. All the organizational models I mentioned above (apart from Portail Data Inra) are generic repositories, and some of them offer data curation by trained research data support staff, often with expertise within the different domains served by the repository. Thus, the awaited important innovation is already available for researchers at a growing number of institutions. For smaller institutions lacking the capacity to provide fully-fledged RDM services, I’d suggest they collaborate with other institutions.
Lock-in
@Richard Wynne: I don’t think the discussion of lock-in in this article is primarily relating to missing integrations. The problem is rather that through the smooth integration between different services provided by one and the same company, such companies tend to evolve into monopoly-like stakeholders. I’d rather see the future research communication landscape being driven by the scholarly communities themselves. Another aspect of lock-in is possible restrictions and transfer of rights involved when researchers deposit their data in commercial repositories. Fortunately, institutions are becoming increasingly aware of such pitfalls and include in their RDM policies that researchers must not grant to commercial parties any permissions to exploit and/or publish the research data without the institution retaining the rights to make the data openly accessible for reuse.
Disclaimer
I’m one of the managers of DataverseNO, and involved in the Dataverse user community.
Philipp Conzett — UiT The Arctic University of Norway
One of the big gaps in research data infrastructure is a tool that bridges the gap between broad data policies (‘you must share your data’) and the exact data sharing steps researchers need to take for a particular article or project. All too often researchers don’t share their data because a) they don’t know which datasets they need to make available, and b) stakeholders with policies can’t assess compliance because they too have no idea which datasets should be available.
We’re trying to address this with DataSeer, which uses Natural Language Processing to ‘read’ articles and spot data collection sentences (e.g. “We measured X with a Y machine”). It then works out what type of data has been collected – it’s often just a table of values that can shared as a .csv – and gives the author advice on the best format for sharing that data and the most suitable repository. We’ve got a working prototype, and we’re developing a few use cases. People at welcome to get in touch (tim.h.vines@gmail.com or tim@dataseer.io) if they’d like a demo.
This is a good piece on downstream data sharing solutions. But the biggest roadblocks that need to be removed are more upstream. Data storage is relatively easy compared to data privacy (human subject data), governance, legal, and IP issues. I’m glad Philipp Conzett noted curation. This can represent a significant and often unbudgeted cost and resource challenge for researchers (and their funders) working in human subject research or with large tera and petabyte sized data. Then there is data sharing culture and incentives. Some disciplines are evolving faster than others.
Data sharing is still in its early days but can learn a lot from scholarly publishing vis-a-vis standards, preservation, access, discoverability, etc.
Just want to clarify a statement regarding Dryad, which *does* now allow datasets to be published without an associated journal publication. The following sentence from the post is not accurate: “Unlike Dryad, Zenodo accepts datasets that are not associated with publications.”
Thanks, Roger and Rebecca for this interesting article and insights on research data sharing. You’re right that “data sharing is still in its early days….”, but a subject domain-specific data-sharing platform that is more mature and can offer some lessons is the Global Biodiversity Information Facility (GBF – https://www.gbif.org/). GBIF is an open-access biodiversity data platform that was founded in 2001 and is supported by the world’s governments as an autonomous intergovernmental organization. GBIF data is formatted in Darwin Core Standard (DwC – https://dwc.tdwg.org) and metadata are written based on Ecological Metadata Language (EML – https://knb.ecoinformatics.org/#external//emlparser/docs/index.html). Data licensing has to adhere to one of the Creative Commons Licenses (CC – https://creativecommons.org/) to facilitate data re-use. Datasets are published through a distributed network of data publishing hosts around the world. The distributed nature of the system helps in data quality control and curation at the source of publishing. There are also other quality controls are performed at the global level before data is ingested. This should however not be construed to mean that GBIF data is error-free. GBIF aggregates the different datasets into one integrated global dataset. Authors that want recognition for publishing data, which is a major task that has in the past been underrated and not properly rewarded, can publish data papers. According to GBIF, a data paper is a peer-reviewed document describing a dataset, published in a peer-reviewed journal. Data paper writing tools automate the manuscript preparation process. GBIF in collaboration with Pensoft Publishers (https://pensoft.net) facilitate the preparation of data paper mss through the Arpha Writing Tool (AWT – https://arpha.pensoft.net/) and GBIF’s Metadata Profile (https://github.com/gbif/ipt/wiki/GMPHowToGuide) that is provided as part of GBIF’s Integrated Publishing Toolkit (IPT – https://www.gbif.org/ipt). The Brazilian GBIF Country Node has also developed a data paper writing tool called NephilaPaper (https://ferramentas.sibbr.gov.br/nephila/).
Dryad, Mendeley Data, figshare, and Zenodo are still some way from being able to integrate or present data the way GBIF does but they are important repositories for the many different data types out there that are outside the scope of GBIF. I believe that as the preservation of research data gains currency, the market will expand and create more opportunities for existing as well as new players to experiment and grow the field.
Hi Siro! And thank you for this comment. What you’re describing here is a kind of data community, at least in the parlance that my colleagues Danielle Cooper and Rebecca Springer have used, but the connection with a data paper is interesting — I will look into that further. I’ll be writing further about the broader landscape here, including generalist services like Dryad, figshare, Zenodo, and Mendeley Data but also data communities and other models, in another piece shortly. Here’s a link to Danielle’s and Rebecca’s piece in case you’re interested:
https://sr.ithaka.org/publications/data-communities/
Warmly
Roger
Hi Roger,
Indeed GBIF is a data community and thanks for the link to the data communities article. I will keep track of your upcoming articles on this subject.
Otherwise, great to link up. Best wishes and regards to all that still remember me on your end.
Hi Roger and Rebecca,
Good analysis!
You highlight exactly what we’re trying to achieve as Mendeley Data in terms of adding value for researchers throughout the data lifecycle, and how we aim to reduce the effort involved in managing data effectively, helping researchers achieve compliance as well as better research outcomes, leading to increased citations etc.
Keeping our platform open and interoperable, through providing open APIs, using community standards, and integrating with third parties, is always a key focus for us. We have 35+ repository integrations for our data search engine for instance (https://data.mendeley.com/datasets), and forthcoming connection between our active data management tool and other repositories, like for example Zenodo and Dataverse.
Kind regards,
Wouter
Research Data Management – Elsevier & Mendeley Data