Editor’s Note: Today’s post is by Pierre Montagano. Pierre is the Director of Business Development for Code Ocean, a cloud-based computational reproducibility platform. He has worked in publishing for 20 years, starting his career in education with Pearson in 1998. He then moved on working at McGraw-Hill and Cambridge before joining the startup world in 2015. Code Ocean is the winner of the 2018 ALPSP Award for Innovation in Publishing.
Publishers often argue that the industry has undergone a massive transformation in the last 20 years, moving smoothly, swiftly, and effectively from print to digital. Yet our most vocal critics within academia frequently accuse the industry of being antiquated and failing to meet researchers’ needs, while many of our more recent attempts at innovation have, as Sarah Andrus noted in her recent post, failed to find an enthusiastic audience. Having spent the better part of two decades in traditional publishing, I think about these issues often. The simplest explanation I can offer is that in publishing’s move from print to digital, little has genuinely changed beyond the method of delivery.
When publishers began their switch to digital in the 1990s, they started by recreating existing print artifacts in digital media: the printed journal article became the PDF, the fixed layout of this new digital format exactly mirroring the printed original. Instead of this hybrid format serving merely as an early, transitory stage in the development of more highly-functional digital formats, however, the PDF has remained the most popular format among researchers despite the existence of newer, more fully-featured interactive equivalents. Its lasting popularity derives from those qualities it shares with its print predecessors: it is both fixed and available offline through saving and printing.
By focusing so intently on replicating the strengths of print formats, however, publishers and researchers have restricted themselves to reproducing in their digital formats only those elements of the research process that print can convey: primarily, the results obtained, and the conclusions drawn from them. And though communicating results has long been the accepted role of publishing within the scientific process, in light of opportunities opened up by digital delivery and the internet, we no longer need to restrict publication to this limited stage of the process.
Today, with advances in technology that enable us to distribute far more than just words on pages, we can push the boundaries further to capture, connect, and circulate different research artifacts that substantiate the scientific process to facilitate greater insight. Much of what occurs before publication – the experimentation, the analysis, and the possibility that results will disprove the hypothesis – can now be shared. So too, as innovations such as hypothes.is are demonstrating, can the conversation that takes place around that research. Our current publishing process produces, in the words of Jon Claerbout (quoted in a 1995 article by Jonathan B. Buckheit and David L. Donoho titled ‘WaveLab and Reproducible Research’), ‘not the scholarship itself [but] merely advertising of the scholarship’; we now have the capability, however, to expand what we mean by publication so that it includes what Claerbout terms ‘the actual scholarship’ – which, in the case of his own field of research, comprises ‘the complete software development environment and the complete set of instructions which generated the figures.’
Increasing demands for transparency and reproducibility in research have resulted in a widening of what are considered publishable research outputs. The open data movement has seen the data behind research being shared in repositories and increasingly recognized as a publishable output. By curating and distributing this wider range of research outputs, we can expose more of the research process to fellow researchers, enabling deeper and more dynamic engagement. Assembling more of the elements that bring researchers to the conclusions in their published articles – data, code, lab notebooks, protocols, reagents, annotations, and referenced work – we can create a web of interconnected research objects that better facilitates the process of science.
It is not enough for us merely to facilitate the publication of a wider range of scholarly outputs, however. We must also offer tools that enable researchers to interact with them. Publishers first established their place within the scholarly ecosystem by doing certain things more efficiently than could academics – most of all, circulating research more widely than scholars could do on their own. Now that anyone with an internet connection can post their research online, the value of our contribution to that ecosystem is increasingly being called into question, and we must focus on what we can do within this new environment to benefit researchers, smooth their scientific processes, and accelerate the pace of research.
To justify our existence, then, we must enable researchers to do more with these outputs than simply read or write about them. By turning the research paper itself into a functional research tool, we can offer greater value to the community.
Code offers one such opportunity: alongside data, it too is rapidly becoming a central component of the scientific record, often turning data into something functional and constructive. As data increase in complexity and become more integral to the work, researchers need to deploy code for analysis, making it necessary to curate that code in an executable fashion. Successfully running other researchers’ code on one’s own computer is no small task, however. Code is not universal, and researchers develop algorithms, software simulations, and analyses in different programming languages, which can have multiple versions, further complicating the task. Analyses also depend on different files, packages, scripts, installers, and more, making the process of getting code running successfully both time-consuming and complex.
This presents considerable challenges for those who need to run that code to establish its validity, as well as that of the underlying data. It can also create limitations for those researchers looking to reuse that code for furthering their own experiments. The same problems can occur for complex datasets and different types of analyses.
Peer review of these sorts of research outputs is another stumbling point. Increasingly, journals are interested in reviewing data and code as part of the article acceptance process. Nature has described reviewing code as ‘cumbersome’, since it ‘requires authors to compile the code in a format that is accessible for others to check, and reviewers to download the code and data, set up the computational environment in their own computer and install the many dependencies that are often required to make it all work.’ Again, this is a point where publishers can provide the tools needed to ease author and reviewer burden.
For code, our approach at Code Ocean is based around providing self-contained executable compute capsules that includes the code, data, results, and run environment within an article of record, which will both save researchers time and provide them with interactivity, facilitating both reuse and collaboration. It’s just one innovation among many that aim to expand what we mean by “publication” to incorporate more of the scientific process, to ease the burden on scholars, and to expand the concept of what the research paper can offer the community.
Similar initiatives around data, methodologies and annotation are already saving researchers time and effort while adding value to the publication process and the services offered by publishers. In doing so, they remind us that the biggest benefit publishers have always offered researchers is not so much circulating their research as performing the tasks that need doing – whatever they may be – more effectively than the researcher can on their own.
3 Thoughts on "Guest Post: Transforming the Research Paper into a Functional Tool"
Pierre, thanks for this terrific article. This strikes me as one of the most interesting–and difficult–pieces of the scholarly infrastructure.
You likely know that PDF/A is meant to solve this problem for the document itself. The PDF/A document I create today, in principle, should be readable by software years and years from now. In other words, a standard conceivable has solved the problem.
It strikes me that your solution would eventually need to work in the context of standards for a number of elements, including the operating systems and compilers. Can you link me to anything that discusses that?
Thanks for the question and I believe that the answer lies in adapting container technology. An article about this was written by my colleagues Seth Green and April Clyburne-Sherin https://psyarxiv.com/mf82t. Nature Methods also wrote an editorial titled “Easing the burden of code review” https://www.nature.com/articles/s41592-018-0137-5
Perfect. That’s exactly what I need in terms of next steps.
In a recent project we looked at Kubernetes for deploying a set of microservices. It turns out we didn’t have enough time to (1) learn enough about Kubernetes and (2) feel that we could deploy something as sturdy as we needed. That was on us, though, as I think this kind of container technology is only going to grow in usage and usefulness.