Editor’s Note: Today’s post is by Stuart Leitch. Stuart is the Chief Technology Officer at Silverchair.
Artificial intelligence and large language models have the potential to change, even disrupt, every aspect of scholarly publishing, from how infrastructure and platforms are developed to how content is discovered, used, and licensed. This post is based on the opening keynote from Silverchair’s Platform Strategies 2023, “AI in Scholarly Publishing,” and offers a conceptual overview of the technologies, and a view of how our industry fits in.
Smart people disagree about generative AI. Some people see this as the latest the latest fad with a peak of inflated expectations in the hype cycle. Others see something far more profound coming. Likewise, there are a range of views on how good or bad AI will be for humanity.
In my view, we’re experiencing an unprecedented era where AI has transitioned from highly specialized models with complex architectures operating in narrow domains, to general models with simple architectures with very broad domain applicability. AI models have outgrown mere classification and pattern recognition and are now creative.
On the contrary, Software 2.0, written in abstract languages like neural network weights, is less friendly to humans. This software doesn’t involve human-written code due to its impractically high number of weights (tens to hundreds of billions of numbers in a matrix). Instead, we set goals on desired behavior and give it computational resources to learn towards these goals. We can’t debug the output, and we have very little detailed understanding of how it works. It’s a whole new paradigm in software development that is rapidly escaping beyond narrow, bounded contexts into the broad.
In other words, we don’t “program” Software 2.0 as we do 1.0. Rather, we “train” it. In 1.0, we explicitly write software, algorithm by algorithm into features and components. The properties and capabilities of the system are exactly those we have “programmed” in. In 2.0, we provide the data and compute, and the model learns on its own, building its own internal structures in ways we don’t yet really understand. The properties and capabilities of 2.0 systems are emergent, much like how properties emerge in complex systems in nature.
Google Research’s illustration above shows how capabilities have progressively emerged as the parameter count of the LLM increases. With each new order of magnitude, new capabilities surface. Our ability to engineer such systems radically outstrips our abilities to understand them. As models scale, they acquire new skills unevenly but predictably, even though these observed scaling laws remain somewhat mysterious.
As leading AI players build some of the world’s largest supercomputers, we’re about to find out what capabilities emerge next as billions of dollars are rushed to the top companies in the arms race to AGI.
The vast majority of pundits agree that these systems are going to be extremely intelligent. But there’s a difference between intelligence and wisdom. What’s of concern is whether these systems will be wise. That’s going to in part heavily depend on the training data we feed them.
We’ve seen Reddit, Twitter/X, and 4Chan become the poster children for humanity’s most reactive thought processes, dominated by hot-takes and flared emotions. The premium will be on the more deliberative, disciplined System 2 style thinking for which academia is society’s primary institution, with publishers as the gatekeepers.
Academia has developed an amazing tree of knowledge which is arguably the most important data for Large Language Models to be trained on. The frontier foundation models are widely assumed to have been trained on a wide variety of paywalled content. There are ferocious legal efforts underway to get this content out of the training sets.
The provocation I put forward is what happens when we have models growing ferociously in capability, but we decline to train them of the very best sources of human wisdom and instead have them learn on the longer tail of less rigorously curated information, or information that is out of date. What does that do to the risk that these technologies are needlessly rough on humanity? If we succeed in getting the LLMs unhooked from the best sources of information right as they are set to have whole new sets of capabilities emerge and are being integrated into everything, how might that play out? Do we want the tech oligopoly turning to simulations to generate the training data?
If we treat this as a gifted child that may take over the world, we owe it to humanity to give it the best education possible and to ground it in the best of human wisdom. It for sure knows about Nietzsche, Machiavelli, Sun Tzu, and Clausewitz.
My provocative thought is that rather than trying to get premium scholarly information out of LLM training sets, we should fight to get it in there, on terms that are economically sustainable.
There is far more at stake than just preserving our business models. This is the time to look deeply into our missions underlying our organizations and consider what role we may play in nudging the outcomes. This transition is simultaneously fraught with existential risk and holds the promise to solve many of humanity’s greatest challenges. Food for thought.