No one will dispute that AI (Artificial Intelligence) needs to “eat” data, preferably in massive quantities, to develop. The better the data quality, the better the result. When thinking about the potential applications of AI in scholarly communications as related to research artifacts, how will that work? How might AI be trained on high quality, vetted information? How are the benefits and costs distributed?
This month we asked the Chefs: Where does scholarly communication and academic outputs fit in to the world of AI development?
Judy Luther: In scholarly communications there is an expanding body of openly available content from preprint servers, such as arXiv and bioRxiv, and Open Access journals and books. In addition, there is a growing variety of formats that include datasets and code, open peer review, media, and other elements of the scholarly research cycle. This volume of content provides a rich resource to be mined for all stakeholders as well as a broader audience.
Although the most visible application of AI is in the consumer sector, AI isn’t new to research. One example is Meta, the scientific search engine that was acquired by the Chan Zuckerberg Initiative (CZI) in 2017. Sam Molyneux and his sister Amy worked with a team of engineers and scientists for six years to develop the AI tools to analyze newly published as well as existing research. They arranged partnerships with publishers to access copyrighted content.
Biomedicine was a great place to start with 4,000 articles published each day. There is both the volume of content to work well at scale and for that very reason, the need for tools to synthesize it for those working in the field. Meta’s analysis detects patterns in research which helps researchers identify collaborators and funders find investment opportunities. They can even predict which papers will have the most impact.
CZI planned to make Meta useful to the entire scientific community. This raises questions about the arts and humanities as well as the social sciences. Disciplines vary and the tools may need to be adapted given the difference in the format and nature of the research outputs. Will it take a wealthy donor, a collaborative effort or eventually plug and play tools to include all research? The timing will depend on the costs and the value of the benefits generated.
Tim Vines: We need lots of AI tools to help researchers do better research: tools to help run the lab, tools to improve data sharing, tools to make peer review more efficient. Many of these AI tools will learn from training data generated by annotating research articles.
These are small beer, though, compared to the ambitions of, for example, Meta and the AI work going on inside Elsevier. These projects digest vast amounts of the research literature with the apparent goal of automating research more broadly, particularly the step of hypothesis generation.
But does the research literature actually embody our understanding of the world? Perhaps this understanding resides instead in the heads of researchers, where (patchy) information from the literature is augmented by informal interactions with colleagues at home and at conferences.
Peer review tends to purge outlandish ideas from manuscripts, so that AI trained only on annotated research articles may be limited to generating mundane hypotheses; the great intellectual leaps that generate whole new avenues of research may never come from a machine.
Peer review tends to purge outlandish ideas from manuscripts, so that AI trained only on annotated research articles may be limited to generating mundane hypotheses; the great intellectual leaps that generate whole new avenues of research may never come from a machine. If that’s the case, Meta and Elsevier may want to gather AI training data from drunken conversations at the conference bar as well.
David Smith: In theory (which is probably why there’s interest from the VC players and other deep-sea denizens of the valley) scholarly outputs are somewhat akin to the Klondike in the late 1800s. In reality however, having led a team that’s built an AI (it reads and classifies engineering text [brilliantly]), one of the keys to a successful machine is really good quality, accurate, well described data. And this is where the challenges start.
The research article, frankly, isn’t a very good “raw material” as things currently stand. It’s not written to be consumed by a machine. Its components can’t be easily decoupled and utilized; they lack enough context and description and organization to collect at scale and, oh yeah, often times the data is wrong… Because research is a journey involving scholars trying to become slightly less uncertain about the worlds they are trying to understand.
It’s not much better when you look at datasets. Again, sorting and managing and assembling a good dataset for a machine to get to work on, requires a lot of effort. You can try going without, but your AI isn’t going to be very good once you really start to try and use it properly. There’s a big difference between a good demo and a good outcome.
There’s a big difference between a good demo and a good outcome.
And given that most of that data is wrong… An AI built on wrong data can be devastating — for a very sobering look at the details on some medical image datasets used to train AIs, check out Luke Oakden-Rayner’s forensic analysis. The Chest x-ray 14 analysis should chill you.
Yet… The potential here is probably beyond imagining. To realize that potential, considerable work is needed:
- We need to have a proper discussion about suitable machine outputs to accompany a research article
- Research articles need to have declarative and speculative statements expressed semantically in order to assist machines trying to pattern match and cross reference the march of progress.
- There needs to be a robust discussion about how to put in place the necessary checks and balances to enable AIs to be built with truly representative and well-constructed datasets.
There needs to be a robust discussion about how to put in place the necessary checks and balances to enable AIs to be built with truly representative and well-constructed datasets.
This will all cost lots of money. Therefore I furrow my brow when I see various policy pronouncements about making scholarly outputs machine readable. Because letting folk download some xml doesn’t really count, and I don’t see anybody looking into what it takes to really enable good quality starting material.
Alice Meadows: Call me a Luddite, but the more I learn about the increasing use of AI, the more it alarms me. In theory it sounds great — solving in minutes or even seconds problems that used to take years or even decades; facilitating the development of time-saving devices, so we can (allegedly) spend more time on the fun/interesting stuff; and in our own world of scholarly communications, providing opportunities to speed up and supposedly improve functions from peer review to data analysis and beyond. But we all know that AI also has at least one significant dark side: it is not just susceptible to bias but riven with it. And although we might like to think that using scholarly, peer-reviewed content to drive AI will solve this problem, I’m not so sure.
We know that our own content is not exactly free of bias. That, for example, people of color are underrepresented both as authors and reviewers; that the voices of scholars and scientists from the global north are very much more likely to be represented in the databases and other resources harnessed for the purposes of AI; and that women’s research is still less likely to be published than that of their male colleagues. Not to mention the fact that citations and other metrics, which are often used as a proxy for quality, are typically based on quantity (which may or may not equate to quality) and are also just as likely to be bias-prone.
So, while using scholarly content as the basis for AI may be an improvement over using content that hasn’t even been fact checked and/or that is clearly biased (geographically, politically, demographically, or otherwise), I’m not yet convinced it will solve the underlying problem. To quote Safiya Noble, whatever content they use, the algorithms on which AI is based still mean that “content can be skewed and disinformation can flourish.”
Jasmine Wallace: Scholarly communication is driven by academics sharing and publishing their research findings, most of which are comprised of the theoretical analysis of methods and principles. If we were to imagine a more interconnected network of academics, scholars, and researchers expanding the reach of their collective knowledge, we’d land at the doorstep of Artificial Intelligence. Already the scholarly publishing community has seen advancements in editorial and production use of AI. Automated reporting, content translation, predictive analytics, content personalization, and image recognition are just a few ways in which AI has already helped to advance academic outputs.
However, as we move deeper into machine learning and dive further into more advanced predictive analytics, we’ll have to provide enhanced metadata to maximize the effectiveness of AI. In order to increase AI’s ability to provide better, more intelligent output we must start with higher quality input. Machines can now take our metadata and employ automated analysis which could be used to improve many of our processes and workflows. Furthermore, with well-defined calculations, an algorithm can help us solve issues that would take months or years to resolve.
…as we move deeper into machine learning and dive further into more advanced predictive analytics, we’ll have to provide enhanced metadata to maximize the effectiveness of AI. In order to increase AI’s ability to provide better, more intelligent output we must start with higher quality input.
Publishers have recently become more diligent about data by defining what the stats mean and how the information can be leveraged. We are using this evidence to make more informed decisions both for ourselves and the communities that we serve. We’ve become more intentional and specific with our academic outputs which have put us in a good position to pivot into a new realm of data mining. As the gatekeepers of scientific dissemination to the world for decades, coupled properly with AI, that data we have on data has the ability to push academia to the next level. Harnessing the power of scholarly communication, we can be at the heart of AI developments. If we can train our systems to find and make connections unbeknown to the human mind, the possibilities of what we might discover are endless.
During a panel discussion on AI being used in copyediting, Sundari Ganapathy talked about how the biggest hurdle to advancing AI is getting people to stop fearing it. She explained that if they were able to have humans interact more with the machines and train them to complete necessary jobs better, then we could begin to see more improved systems. Lastly, she stated that AI should be thought of as enhancing human knowledge, not replacing it – as is often the response. That being said, aside from the cost associated with most uses of these technologies, I think that in order to see more useful developments in AI we’ll need to get more fearless buy-in.
…aside from the cost associated with most uses of these technologies, I think that in order to see more useful developments in AI we’ll need to get more fearless buy-in.
David Crotty: In the responses above, Judy talks about the preponderance of content that’s out there freely available for potential mining, and David mentions the great costs involved in making that content useful for AI training. I suspect that in the near term, intellectual property battles are going to be fought around the question of for-profit AI reuse. Because something is available on the internet, does that mean it should be free to reuse to drive corporate profits? We’ve recently seen the EU come down on the freewheeling practices of internet platforms using copyrighted content without direct permission through the provisions in the new copyright directive. Will we see the same sorts of regulation put in place for reuse of copyrighted content in this manner?
If we are moving into an open access world, then the copyright issues may eventually matter less, but who will pay to make the material truly machine readable? Should those costs be added to an author’s publication charges that are paid out of their research budget? Should academic libraries cover the costs for IBM to build their next generation business tools? Is this instead a new (paid) service that publishers can offer to the AI community? Or will it be cheaper/easier for members of that community to take openly licensed content and do the restructuring themselves?
Ann Michael: I’m going to take poster’s license here and drop back to some context. Where to start? I recently finished a course at NYU Stern on AI. The course ranged from history and definitions, types of AI, how AI and deep learning work, and even coding a neural network in PyTorch (don’t ask!). AI, like many things, has had several eras of advancement and contributing factors to its development and current thinking. It also has a range of sub-definitions:
- Narrow or weak AI that is expert at performing a specific task
- General or strong AI a system that exhibits elements of human intelligence
- Super Intelligence “any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest”
To date, we have only seen weak AI.
I agree with the Chefs’ responses regarding how data is critical, including data intended to be read by a machine, that there are dangers and cautions to ensure that we don’t perpetuate the sins of the past (and present) or limit our creativity. A very important concept to also consider is AI working with humans. Yes, for the more straightforward tasks AI may replace people (which wouldn’t be the first time automation has resulted in a change in the jobs that humans perform). However, for more complex or creative tasks AI and humans may just be better together. One example of this is covered in a study published in the Journal of the American College of Radiology, where researchers found doctors and AIs together were more effective than either were separately.
For more complex or creative tasks AI and humans may just be better together.
My takeaway is that like most paths of discovery, especially in technology, AI comes with risks. Some of those risks are substantial. In large part the risks are traced back to two things: 1) technology advances faster than human nature, understanding, and laws, and 2) technology advances without sufficient mitigation strategies built in to address significant risks. The latter is difficult to combat because until we understand something fully we cannot effectively mitigate (or in some cases identify) all of the associated risks.
It is also difficult to strike a balance between what are legitimate risks (like perpetuating bias) and what are perceived risks that might be more rooted in our current understanding. The bottom line is that we can very likely break things before we ultimately find the most effective solution. We can also hobble a potentially impactful path because we are perceiving “risks” that might be better described as an allegiance to the status quo. The balance between these two states is pretty difficult and often results in iterations of development. In my opinion, this also represents a healthy tension. And this tension in itself is a risk mitigation strategy!
Now it’s your turn: Where do YOU think scholarly communication and academic outputs fit in to the world of AI development?