Wiley Leans into AI. The Community Should Lean with Them.

During the Frankfurt Book Fair earlier this month, a great deal of the programming during the STM Conference and much of the programming around the fair was focused on artificial intelligence (AI) and its impacts on scholarly publishing. During the conference, I had an opportunity to speak at length with Josh Jarrett, Senior Vice President at Wiley. Josh leads Wiley’s newly formed AI Growth team, which is charged with exploring new opportunities for business growth and impact in the emerging AI ecosystem. In this capacity he oversees Wiley’s AI licensing, the newly announced Wiley AI Partnerships co-innovation program, and efforts to engage with the broader AI ecosystem. Here are some of the highlights of our conversations before and during the Book Fair.

Josh Jarrett, Sr. Vice President at Wiley, speaking on stage at the Frankfurt Book Fair. He stands, hands reaching outward in a black coat and white shirt. Josh is standing in front of a powerpoint slide highlighting "Trend #2" from a survey of academics that Wiley undertook. The slide reads, in part: "Researchers and authors are primed to embrace Al but aware of the need for guidance." It also notes that 2/3 of survey researchers cite a lack of guidance/training as barriers to adoption of AI. — Josh Jarrett speaking about Wiley’s approach to AI and research it has conducted during the Frankfurt Book Fair, October 2024.

Wiley has been working with AI for some time now. Where have you chosen to focus and why?

Wiley has been investing in AI in our technology stack for several years, but our efforts have ramped up in the last 18 months focusing on (1) colleague productivity and creativity, (2) AI publishing and product innovation, and (3) new AI growth and impact opportunities.

We decided to lean in on AI. We think that this creates a lot of opportunities — and some challenges, but ones that we’re going to be able to navigate as effectively as possible by being part of the solution: leaning in, experimenting, learning, sharing with our stakeholders, and adapting as we go. We want Wiley and our partners to shape the AI future, as opposed to being shaped by it.

Maintaining faith in the publisher’s responsibility for the scholarly record is more challenging today. How can AI be used by publishers to be more effective in editorial review/production but without increasing the risk of plagiarism?

We’re enthusiastic about the opportunities for AI to streamline and improve the publication process. Everything from developing research reviews, generating hypotheses, helping to identify new research questions by finding white space in the current literature, analyzing data and developing simulations, and assisting with the authoring and publishing process.

But it’s also an incredibly confusing time for authors. We’ve conducted an extensive survey of authors feelings about AI. They’ve indicated to us they are grappling with fundamental questions about the role of AI in their work. Is it okay to use AI to refine my language? Can I use it to help analyze my data? What about generating hypotheses or identifying research gaps? Authors are encountering a variety of AI tools, from basic grammar checks to complex data analysis, and they are unsure where to draw the line.

We’re seeing a growing divide between early adopters who are enthusiastically embracing AI in their research process, and those who are more cautious or skeptical. This is creating a new kind of pressure in the academic community. Authors are wondering if they’ll be left behind if they don’t use AI, but they’re also worried about potential backlash if they do.

We also know AI can aid research misconduct, and some people will opt to use any new tool or technology in unethical ways. Our role as publishers is to bridge this divide and create standards. processes, and tools that work with integrity for everyone. We’re also working on next generation detection software that we hope will help solve that, at the same time as ramping up our focus on the incredible potential of AI to transform research practice and research communications.

“What authors really want is guidance for what meets the new ethical standards of AI-assisted authoring”

I see a lot of new problems with applying AI in our space, everything from technical, public policy, legal, licensing, to standards. Do you see others, and what are some solutions? Can we find a pathway for the STM publishing community to address these?

As the lines begin to blur between what content is fully human-authored and what content is, at least in part, machine-created, we expect that regulatory and policy positions will evolve.

What authors really want is guidance for what meets the new ethical standards of AI-assisted authoring, particularly in high-stakes, rigorous publishing formats like peer-reviewed journals. When is a little AI too much AI? We won’t realize the benefits of AI until authors are comfortable that they’re operating in the context of accepted professional ethical frameworks.

Scholarly publishers have a responsibility to safeguard the scholarly and scientific record, so I don’t think we have the luxury to sit back and hope that these issues take care of themselves. As I noted previously, we’ve been conducting research related to what authors need and expect of publishers when it comes to using AI responsibly. One of the big takeaways is that 70% of them are looking for guidance and training from publishers to offer guidelines for the responsible use of AI in publishing.

According to published reports, Wiley has earned at least $40 million so far from licensing to the LLMs. Do you plan to continue earning at that pace, and are other publishers who haven’t been licensing yet leaving money on the table? Are there any risks to these commercial relationships?

Wiley will continue to sign more licensing agreements and while so far they’ve primarily been for book content, there’s emerging demand for journals. Consistent with our mandate to disseminate research, while providing adequate protections and compensation, our goal is to negotiate deals that benefit our entire ecosystem of copyright holders. We leverage our scale to make sure copyright holders receive fair returns, and we work to ensure that any licensing of their content — whether for traditional uses or for AI — is done with care, protection, and responsibility.

Whether or not other publishers enter into AI licensing arrangements is up to them, but we think we’re all better served if rightsholders work collaboratively and commercially with AI players. It gives us a better seat at the table to make sure our and our partners interests are safeguarded.

What would you say are the main differences between Large Language Model (LLM) versus Retrieval-Augmented Generation (RAG) licensing?

We’re starting to see three market segments emerging. The first segment is for foundational LLM training, which is where you hear so much about Big Tech. Some of these models are proprietary, some are open, but only a handful of players can afford the sheer cost and complexity of building these models. These players are looking for huge amounts of data to train these models.

For the most part, the focus is on books and web content for general understanding of language, reasoning, and general knowledge, not more specific but somewhat niche academic journal content.

The second segment are organizations looking to fine tune a customized model. This is where an organization could be a foundational LLM developer or a different organization that wants to optimize an LLM to understand a specific use case: answer a set of pharmaceutical questions, do local language translation, and so on. These developers are still looking for a relatively large amount of information, but they especially need high quality, specialized content in a particular domain.

The third segment, often referred to as RAG, represents a different approach by combining the capabilities of an LLM with an external retrieval mechanism to improve the accuracy and relevance of the generated responses. So, when a question is asked, the model searches a database or set of documents to find the most relevant information (the retrieval). The retrieved information is then fed into the LLM, which uses it to generate a more informed and accurate response (the augmented generation). This approach is much better for use cases where accuracy and authoritative responses are required. For obvious reasons, this RAG approach is getting a lot of attention and has applications in scientific discovery and scholarly communications.

And within these three segments, we’re seeing different opportunities emerge, and, of course, different complexities that come along with them.

Should the industry share knowledge when it comes to things like attribution in result sets and verbatim text representation limitations?

Absolutely. We recently announced we’re inviting collaboration with start-ups and scale-ups to deliver specialized AI solutions.

Attribution is a good example – and the answer varies depending on the stage of the AI development process. Authors deserve credit for their work. The challenge during the LLM training stage is that in a model with billions of parameters, these different types of information are not neatly compartmentalized but rather diffused throughout the network. This makes attribution in the training stage extremely difficult technically.

Right now we’re primarily focused on the stage of the process where authoritative content is being directly referenced or quoted. In the longer term, when we solve for those use cases that don’t involve direct reference or quotation, the form of attribution might be different. Whether that’s confidence-based attribution, or statistical attribution, or domain-specific tagging, or something else entirely is a conversation that we need to have as a community. We are engaged with several groups exploring attribution solutions.

There is discussion about if and how authors and copyright holders should opt their works in or out of deals with AI model developers. What is the role of authors’ interests/expectations in this process?

Our view is that society benefits when AI models are trained on high-quality, authoritative content. That’s something we believe aligns with the shared objectives of impact and reach in the publishing community. It’s equally critical that these models don’t use copyrighted material without proper permissions.

The commitment we make when we sign contracts with our authors and other copyright holders, such as our society publishing partners, is to ensure that their interests are safeguarded in a rapidly evolving digital landscape. AI and LLMs are just two of the most recent developments in what has been a hugely dynamic period in publishing. Most agreements with authors and copyright holders include broad dissemination rights across formats consistent with this shared mission.

We’ve seen claims that AI model developers can rely on fair use to use content without needing to license it. Among other flaws with these claims, when there is a robust marketplace for licensing, the argument for fair use fails. By establishing a clear and structured marketplace for licensing, we’re not only protecting the interests of authors and copyright holders, but we’re also safeguarding the very concept of copyright itself.

This is why many in the publishing and author advocacy communities are pushing for licensing frameworks that allow for responsible AI training while maintaining the integrity of the authors’ work.

At Wiley, we’re also mindful of how content is matched with specific AI use cases. For instance, foundational LLM training deals may include only backlist and archived content. This approach gives a second life to backlist content and it creates a buffer, giving us the flexibility to revisit newer content and control how it’s licensed in the future.

“Collaboration across publishers, institutions, and regulatory bodies will be key to creating an ecosystem where AI supports the goals of open science, transparency, and reproducibility.”

What would you recommend the community do to advance some of these issues and build awareness? Where should we start?

As a community, we need to build on the strong foundations being set already. The STM Association has general guidance on Generative AI in Scholarly Communications. Publishers like Wiley also have AI principles (AI Principles | Wiley) which address these topics. COPE has a position statement on Authorship and AI tools (COPE’s position statement on Authorship and AI tools).

There are some gaps, however, for example how we handle AI disclosures. It’s not enough to just outline general principles or positions. We need practical guidelines that ensure researchers, publishers, and institutions reasonably disclose when and how AI tools have been used in the research process, in keeping with any regulatory requirements. For instance, whether AI was used to assist in data analysis or to help draft the manuscript. This kind of transparency is critical for maintaining the integrity of scholarly work and ensuring that other researchers can accurately assess and build upon it.

We also need to make sure that peer review processes evolve alongside AI technology. Reviewers will need to evaluate AI-assisted research, and this might require new training or even new roles within the peer review system itself. We’re entering an era where traditional review methods may not always suffice, especially when evaluating research generated or assisted by AI.

Working toward AI ethics in research is important. This includes addressing concerns like data privacy, the potential for algorithmic bias, and ensuring AI tools are used in ways that enhance, rather than replace, the intellectual rigor of human researchers. Collaboration across publishers, institutions, and regulatory bodies will be key to creating an ecosystem where AI supports the goals of open science, transparency, and reproducibility.

So, we have some solid resources in place, but there’s still a lot of work to be done in expanding these frameworks. Focusing on areas like AI disclosures and evolving peer review processes will be crucial steps in making sure AI serves the scholarly community in a way that upholds its values of rigor, transparency, and trust. By building on existing principles and setting higher standards for AI integration, we can ensure that this technology is used responsibly to drive progress in research and discovery.

Todd A Carpenter

@TAC_NISO

Todd Carpenter is Executive Director of the National Information Standards Organization (NISO). He additionally serves in a number of leadership roles of a variety of organizations, including as Chair of the ISO Technical Subcommittee on Identification & Description (ISO TC46/SC9), founding partner of the Coalition for Seamless Access, Past President of FORCE11, Treasurer of the Book Industry Study Group (BISG), and a Director of the Foundation of the Baltimore County Public Library. He also previously served as Treasurer of SSP.

Discussion

2 Thoughts on "Wiley Leans into AI. The Community Should Lean with Them."

Why is this question the only one not in bold? “According to published reports, Wiley has earned at least $40 million so far from licensing to the LLMs. Do you plan to continue earning at that pace, and are other publishers who haven’t been licensing yet leaving money on the table? Are there any risks to these commercial relationships?”

By James Charlie
Oct 31, 2024, 10:05 AM

Looks like an HTML glitch on our end. Should be fixed now.

By David Crotty
Oct 31, 2024, 10:09 AM

The Scholarly Kitchen

Wiley Leans into AI. The Community Should Lean with Them.

Todd A Carpenter

Discussion

SSP Originals Auction is Back!

SSP Announces Release of Individual Results for the Insights Benchmarking Compensation & Benefits Study

Todd A Carpenter

Related Articles:

Next Article: