Editor’s Note: Tyler Whitehouse is the CEO of Gigantum, a research-focused software startup. He is a research mathematician by training: after a postdoctoral fellowship at Vanderbilt University, he spent six years managing machine learning programs for US agencies such as DARPA, ARPA-E, and IARPA. At Gigantum, Tyler focuses on executing the company’s vision: to deploy an open technology that reduces technical differences between research producers and consumers, enables the rapid dissemination of research and customized code, and promotes global access to science.
Over the last few years, there have been two significant changes in the scientific community’s approach to technology. First, researchers have become much better experienced in terms of software development and are devoting more time to software creation. Second, the scientific community has become more comfortable with the use and deployment of cloud services and infrastructure through easy-to-access enterprise-level software.
What does this mean for academic research software? Researchers are now able to create viable products and services in a highly decentralized fashion, which means it is increasingly likely that there will be a gap between the software landscape that publishers and other stakeholders currently envision, and the one that will soon exist. As researchers become more tech-savvy, many are developing products for themselves and leveraging commercial, yet open source, technologies to create modern platforms and services that not only fit their needs, but are easy to use; this is a bellwether for the tools researchers and publishers are likely to see on the horizon.
This is a 2-part post: Today, Part 1 looks at three key trends in the research software space:
- Tools are increasingly researcher-driven and user-oriented.
- Tech-enabled tools are coming to the fore faster than ever.
- Decentralized tools and services are changing the game and creating new opportunities.
Tomorrow, Part 2 (a joint post with Isabel Thompson, Senior Strategy Analyst at Holtzbrinck Publishing Group) will look at the implications of these trends, in order to:
- Show how new tools are being leveraged to solve science’s hardest problems (e.g., the reproducibility crisis).
- Consider how this may impact the broader scholarly communications space.
Trend 1: Academic Research Software is Increasingly Researcher-Driven and User-Oriented
While publishers have been marketing to researchers for some time, it is only recently that researchers themselves have begun to seriously develop user-friendly software for other researchers, rather than just bare-bones tools for their own labs. As the ecosystem of open source languages and tools has matured, some researchers have begun adopting a more product-minded outlook, developing software with the end user explicitly in mind rather than as an afterthought, a trend picked up from the commercial market. Utility is to be maximized. Learning curves are to be minimized. Adoption is to be pursued.
The Jupyter Notebook Revolution
The academy-developed product that most exemplifies this trend is the Jupyter notebook. Beginning as a physics researcher’s attempt to make his daily life easier, the tool has evolved over the last 10 years into an open source, interactive programming framework used globally across academia and industry. This article provides a nice overview of what Jupyter is and why it’s popular. The key idea, though, is that it is an “interactive web tool…which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document”.
To give a sense of the scale and speed of adoption, by September of 2018, one analysis showed that the number of Jupyter notebooks on Github had increased from ~200,000 in 2015 to over 2.5 million as of September 2018. By March of 2019, there were over 4 million on GitHub according to a follow up analysis. The browser-based environment appeals to a broad user spectrum, largely because of its support for computational and exploratory work in Python.
Released in 1991 and named in homage to Monty Python’s Flying Circus, Python is one of the most popular open source programming languages for data science. One survey has over 70% of respondents using Python regularly. Python’s massive penetration into data science has pushed Jupyter notebooks into all levels of academia and industry. To grasp Jupyter’s traction, you only need to look at the variety of its steering council and sponsors.
Despite being an academic project, Jupyter was developed with end users in mind. The goal was to lower the bar to entry for doing and learning computational work without alienating more experienced and discriminating users. In the process, it completely reset expectations of the level of ease and power required from interactive programming environments for open source languages. While not everybody loves them, everybody admits that Jupyter notebooks have changed the game.
It is important to realize that Jupyter could not have been developed outside of the open source-oriented research community for a variety of reasons. The permissive licensing that contributed to its dissemination, the flexibility of installation and use on a variety of resources, and the emphasis on reducing high frequency frictions in the daily life of researchers required deep experience with, understanding of, and sympathy for academic users.
Trend 2: Tech-Enabled Tools are Coming to the Fore Faster than Ever
Academically developed platforms for open and decentralized distribution are not necessarily new — the Galaxy Project, for instance was released in 2005 and is still in heavy use by the bioinformatics community — but these platforms can now be developed faster than ever. This increase in speed is due to the fact that researchers can now create actual infrastructure using enterprise-level tools combined with commercial cloud resources. Small teams have gone from just developing tools for narrow use cases to developing and deploying internet-scale services. This approach is much more Silicon Valley than ivory tower in nature.
A major enabler of innovation in academic software has been the software container, which is a software deployment technology similar in function to a virtual machine (think remote desktop or running Windows inside of macOS) but much lighter weight and easier to create. Containers allow developers to create software without worrying too much about how to make it run on different machines, because the container technology takes care of that. This makes it far easier to create software that can be used by a broad audience.
Just as the global economy runs on goods crossing the oceans in shipping containers, the digital economy increasingly runs on services shipped and deployed in software containers. Docker and Kubernetes exemplify the freely available but commercially developed container tools that researchers have available. In a nutshell, Docker is how the world creates its containers and Kubernetes is how the world deploys and orchestrates collections of containers on cloud resources. With these kinds of technologies at their disposal, researchers can significantly increase the rate of innovation for academic products and services. The result has been a variety of approaches and business models, from community based open source projects like Binder and Pangeo, to companies like Nextjournal, Code Ocean, and Gigantum. We will discuss some of these in more depth in tomorrow’s post.
Trend 3: Decentralized Production and Services are Pushing the Envelope for Research Software
One impact of these changes is the broadening decentralization in the production, use and control of software for academics. In this context, decentralization refers to the ability of small groups to operate independently from traditional stakeholders in the academic and publishing space. While this decentralization may not influence consolidation trends within the publishing world, it will definitely shape the landscape of products and services used by researchers, thereby influencing trends in both research and publishing — which will be discussed in more detail tomorrow.
The most obvious form of decentralization is the creation of high quality software by researchers and academically-generated startups. While decentralized tool creation has been a part of the academic community since open source first came around, now, the quality and power of the software products being developed are much closer to enterprise-level than ever before. This results in a faster rate of innovation as well as rising expectations in quality for free and paid products by researchers.
Another, perhaps less obvious, type of decentralization comes from a shift in the type and intent of products being developed. Researchers are increasingly creating products that are to be used by other researchers in order to provide services in turn, no longer just to solve their own technical problems. Jupyter was an early leader in this regard, in that it aimed to be a DIY service that could be deployed even by fairly non-technical researchers. This new ethos is a radical departure from the previous mindset, in which a single group developed tools for a narrow use case, or in which monolithic services were offered and controlled by a single organization. Modern researchers are not just interested in becoming independent themselves, they are also investing time to help other researchers be independent as well.
This shift in focus comes from the surrounding technological changes, but it also comes from a community based ethos that accompanies the open science movement. Researchers look to create strategies to hedge against the ability of large stakeholders to impose structures and policies that researchers find adversarial. Community control, diversification and decentralization of products and services are increasingly adopted hedging strategies.
This rapidly-evolving and dynamic environment is exciting, and it offers lots of opportunities to improve individual researchers’ workflows, as well as the quality of research as a whole. In Part 2 tomorrow, we’ll look at some examples of how the three trends discussed above impact science, software, and the scholarly communication industry more broadly.