In a conference room at the British Medical Journal’s (BMJ’s) offices in Fitzrovia, on the fifth floor of a glassy geometric cube that hovers above one of London’s most storied literary neighborhoods, Ian Mulvany was searching for words as to why he found a certain kind of software function beautiful.

Mulvany is BMJ’s Chief Technology Officer, and the function in question was a Lambda, a small, self-contained piece of software that exists to do a singular thing. In Mulvany’s freshly coded battery of journal screening tools, a Lambda function fires when an event triggers it, performs its task, and disappears. It doesn’t remember what happened before. It doesn’t persist. It has no opinion about what comes next. It just answers one question and goes quiet.

“With Lambda functions, what these cloud providers have done is say: you can get to the function, the thing that changes the state, that changes the data, and we will take care of all of the scaffolding and make that disappear for you,” Mulvany told us. “So all you need to do is write that function, and it only gets invoked at the moment that you need it to.”

BMJ had leveraged their Amazon Web Services (AWS) hosting relationship into a pro bono consulting engagement in which Amazon’s solution architects worked alongside BMJ’s engineers to build an agentic manuscript screening pipeline. 

Initially scoped for eight weeks, the project was extended by another month because, as Mulvany put it, the AWS team “really enjoyed working on the project with us.” He was careful about that word, agentic. “By agentic, what we mean here is you fire a query off and the large language model has its own loop and it runs, figures its way through that query, has some parameters, behavior, and then finally completes and then you get the result,” he explained, “rather than you as the human sending a query through and waiting for the response and you driving that process.”

In 3-4 months they built six agents, each handling a discrete task in the desk assessment of a new submission: novelty search, integrity checks, ethical guidelines compliance, tobacco funding declarations, reporting standards verification, and statistical reasoning. One can be upgraded without touching the others. If one starts giving unreliable results, it can be turned off and the rest keep running. The architecture is designed to be smarter than any individual piece of it.

The vision is that if editorial safety standards are met and BMJ can integrate these AI tools with their manuscript submission system, editors at BMJ’s journals would open a new submission and find, alongside the manuscript, a structured readout from AI agents that had already done the kind of preliminary evaluation that used to take days of human attention. Or, more honestly, the kind of evaluation that in many cases simply didn’t happen. Not every journal has the resources to run a biostatistical review on every submission. Most don’t.

The architectural pattern Mulvany described–composable, modular, agent-based–is surfacing independently across the industry wherever a publisher has the technical capacity to look beyond the monolithic systems that have defined editorial workflows for 20 years.

Man in hat and suspenders staring up at an equation filled chalkboard

Angela’s organization, the American Society of Clinical Oncology (ASCO), has been on a parallel track with Google Cloud. Last year, the two organizations worked collaboratively on building the ASCO Guidelines Assistant, using Google Cloud AI to surface answers from their discrete collection of evidence-based guidelines. Angela’s takeaway was less about the product than about what building it revealed. 

The project required deliberation on which staff would work with which AI capabilities, who owned the validation of AI-generated outputs, and how to sequence the exposure so that people who had never thought about prompt design or retrieval logic were suddenly accountable for both. ASCO was, in effect, designing their own orchestration layer, not in code, but in reporting lines, meeting cadences, and uncomfortable conversations about which parts of people’s jobs were about to change. 

Perhaps more valuable than providing a new member benefit, the project was a great, and intense, starting point for thinking about how AI fits with staff, with the member community, and with content offerings. It’s much more about the people and the culture than it is about the technology.

We have grown used to systems that automate workflow. A peer review system routes manuscripts, tracks status, enforces sequences: which editor sees what, when, in what order. The customizations are about process. If Journal A requires a data availability statement and Journal B doesn’t, you configure that.

What Mulvany built automates judgment. An agent that evaluates whether the statistical methods match the study design is not routing a manuscript. The system makes preliminary evaluative judgments that humans then review. The human role shifts from doing the assessment to auditing the assessment.

The distance between BMJ’s fifth-floor conference room and the average society publisher’s editorial office is not measured in years. Mulvany did not feel ahead. He felt like someone with enough runway to find out. The project ran not on a budgetary line item but on attention — engineering time redirected toward uncertain problems, and a cloud partner willing to show up for work neither side could fully define. That is the gap; not certainty against confusion, but the organizational capacity to sustain uncertainty.

The Organizational Gap

The industry doesn’t have an AI strategy problem. It has a vocabulary problem. 

When a society CEO tells the publishing director that the board wants to hear an “AI story” in 6 months, the instruction sounds reasonable. It is, in fact, the beginning of a category error. “AI” is not one thing. It is at least five different things happening simultaneously inside a publishing organization, each with different stakeholders, different risk profiles, and different resource requirements: editorial tools, research integrity, content licensing, member services, and internal productivity. Asking for “an AI strategy” is like asking for “a weather strategy.” The question needs decomposing before it can be answered.

Probably no one feels more pressure for an AI story than Steven Heffner at IEEE: “We are definitely compelled to have a narrative. We are the IEEE.”  IEEE’s volunteer leaders hold positions at Microsoft, Apple, and Google. They expect engineering execution. 

Thad Lurie at the American Geophysical Union (AGU) offered a different flavor of that pressure. “What can we do with AI is a really dangerous question,” he told us, “because the answer is a whole lot of stuff, and it’s really easy to go down that rabbit hole and chase shiny objects, because they’re very, very shiny.” 

The question, Lurie argued, isn’t what AI can do. It’s whether you’re solving a real problem, improving productivity, or generating motion to make sure the organization isn’t doing it just to be doing it. The result is a double bind that society technology and content teams now live inside. On one hand: let’s experiment. On the other: it better work, or we will lose the audience. 

Experimentation, by definition, means putting things out that don’t fully work yet. But the fear of looking like amateurs, of damaging the society’s reputation with its members, is real. And the culture clash with new technology partners intensifies the pressure. 

The cloud vendors and AI companies that societies are now working with bring development cycles that feel alien: short and intense engagements, narrow scopes, assumptions that the society’s data is normalized and in one place, and an expectation that everyone has agreed on a minimum viable product. The acceptance of putting partially completed products in front of society members is a massive culture shift for organizations whose brand depends on rigor.

And the people being asked to navigate it are, almost universally, underresourced. 

“We have one guy in IT that is trying to keep up and thank goodness, because it can’t just be me,” said one society leader. Another told us, “I have to review the AI contracts myself at night because I don’t have time during the day.” A third noted a simple absence: “There is a lack of legal expertise on intellectual property issues.” 

Many society IT departments have historically focused on integrating tools with the member database and providing infrastructure support. The AI question lands in a different part of the organization entirely, and in most cases it has been placed directly in the lap of whoever has “publishing” somewhere in their title.

“Publishers come to us and say, ‘We need help with our AI strategy.’ It’s like asking about your internet strategy,” said Phill Jones of More Brains. Other consultants we spoke with had noticed the same pattern, which, they observed, requires new expertise at their own firms too. 

Michael Clarke noted the structural advantage of scale: a few of the larger societies have dedicated resources set aside at the enterprise level for AI tooling and training. Others have budgeted smaller amounts to experiment with off-the-shelf tools, but lack the internal expertise to do much more than buy systems that may or may not be integrated into staff and member workflows. 

Several of the leaders we spoke with described 2026 as “a year of learning and discovery,” a framing that is honest about the gap but also, perhaps, a way of giving themselves permission to not have answers yet.

For many of these organizations, the default answer to “what’s your AI story?” has become research integrity screening. Not because it’s the most important application, but because it’s the most legible one. The vendors are selling it, the conferences are discussing it, and the board can understand it. It is the safest possible story to tell.

It might also be the wrong one.

The Judgment Problem

Respondents almost invariably went to integrity screening when we asked “What are you doing with AI in your workflow?” Yet the research integrity tools most publishers are piloting predate the LLM era by years. 

Image forensics detects pixel-level anomalies against a graph of known manipulation patterns. Plagiarism detection runs similarity matching against an authoritative corpus. Papermill screening looks for statistical signatures in submission metadata. These are discriminative techniques: narrow, task-specific, trained to spot anomalies against a known baseline. They are not the contextual reasoning people mean when they say “AI” in 2026.

One publishing leader put it directly: “A lot of the tools that vendors are making available to us are on research integrity. That is an incredibly small fraction of the problems facing journals and where we need to be focusing.” 

Publishers building an AI story around integrity tooling are, in a sense, building their internet strategy around email security. It’s real, it matters, it’s even table stakes, and it is not the thing.

One society publisher said they are piloting several AI-powered integrity tools but keeping them internal to assess usefulness before rolling them out to volunteers. The deeper problem, she noted, is the lack of integration with the prominent peer review platforms. Even if journal offices can build useful agents using their own enterprise AI, they mostly need to download papers from their peer review system, upload them into their AI environment, run the jobs, download a report, and upload it for the authors, editors, reviewers, and other staff. It’s not scalable.

The legacy platforms are not keeping up, lamented more than one publishing leader. “I can sympathize with the peer review platforms,” explained one. “We need to understand which parts of the peer review process everyone is comfortable using AI for, because everything on the wishlist requires training, testing, and constant upkeep.”

That last sentence is the real question, and it cuts deeper than platform architecture. If we assume that some percentage of peer reviewers are already uploading papers into an LLM to help them with reviews, which they should never do, the interesting question is what they’re asking:

  • Find me other publications on this specific topic published within the last 6 months.
  • Read this paper and my review and let me know if I missed anything important.
  • Does the data reported in this paper support the conclusions?
  • Give me three ways this paper could be improved.

We don’t know what the prompts say, because we correctly and explicitly do not condone the use of AI in peer review. But what if we did know? Could we not build agents within a safe, monitored environment, under our own guardrails? The people need to decide what they are comfortable having the technology do.

David Crotty, who runs the Scholarly Kitchen and the journals program at Cold Spring Harbor Laboratory Press, put the reviewer question more bluntly. “Peer review is a voluntary activity,” Crotty told us. “There are no negative consequences if one of my journals says ‘can you peer review this paper’ and you say ‘no, I don’t have time.’ That’s fine. There’s no punishment for that. But if AI is your shortcut because you didn’t really have time to engage with the paper, just say no.”

His objection isn’t moral. It’s functional. “You’re a movie critic. Should you actually see the movie, or do you want to just read a paragraph summary of what the movie’s about and then write your review?” Using AI to synthesize a paper before reviewing it “is not cheating,” Crotty said. “It’s just not effective. When I peer review a paper, I’m going through it line by line, figure by figure. I’m looking at every point on that graph. Did they cite the right papers? How am I going to know that from reading a summary?”

The point cuts in two directions. Reviewers who offload to AI aren’t cheating the publisher; they’re failing the author and the science. But the insight also clarifies what a sanctioned AI tool for peer review should actually do: not synthesize, not summarize, not shortcut the reading, but extend what a careful human reviewer can see. Mulvany’s statistical reasoning agent, the one that reads tables, identifies discrepancies between reported totals and underlying data, and writes Python code to recheck the numbers, is exactly this kind of tool. “This is insane. This is wild,” Mulvany said when he saw it working. It does what a human reviewer should do but rarely has time for, and what a journal has never had the resources to require on every submission.

There’s a governance subtlety that Mulvany had already learned, though. His team deliberately removed accept/reject recommendations from the agent outputs. “One thing we’re worried about is trigger-happy editors will just take any signal to reject, and we don’t want to get there.” The agents present findings, not opinions. The human role is to synthesize those findings into a judgment. This is a design decision about organizational behavior, not about the technology.

The question is how to get there without a willing cloud partner and a CTO who knows what a Lambda function is. Adam Hyde has been thinking about this gap. Hyde is the founder of pure.science, an AI workflow orchestration platform built specifically for publishers, and his premise is simple: many publishers will not get access to an AWS solutions architect to spend 10 weeks building bespoke manuscript agents. Pure.science offers a canvas where workflows are assembled from modular nodes, with an option that lets the user describe what they want in plain language and receive a custom agentic workflow in return. 

“If you’re a domain expert and you can prompt the workflow into existence and then refine it in very simple means,” Hyde told us, “then you are actively shaping better outcomes from the system.” His practical observation on the build-your-own path: by the time you’ve actually built it, it’s out of date. The platform model exists because the alternative keeps requiring engineering time that you don’t have.

And the window for building these tools is tighter than it appears. While the industry focuses on catching bad papers, it is less prepared for what comes next: competent ones. 

Economist and researcher Scott Cunningham has framed the structural problem precisely. As AI tools polish the left tail of low-quality manuscripts out of existence, editors will face thousands of equally “competent” submissions, with diminishing signals to discriminate between them. The result is noisier, more arbitrary desk rejections, not because editors are less rigorous, but because the distribution has compressed.

And the people making those desk rejections are already stretched to breaking. Jennifer Brogan, a publishing consultant at Maverick who advises societies on content and licensing strategy, has watched the editorial pipeline thin from the inside. “If something doesn’t happen to make the editor’s job more manageable, we are not going to get good editors,” she told us, “which is going to just worsen the scientific integrity process that we have.” 

Universities have stopped relieving faculty time for editorial roles. Younger researchers see Retraction Watch headlines and calculate that the reputational risk of the job now exceeds the prestige. The volume problem that Cunningham describes is about to hit a system that is already losing the people it depends on.

When you finally build tools sensitive enough to check everything, you discover that nothing passes. ASCO, testing the Clear Skies tool, found that the integrity-checking agents flagged nearly every paper for something. At scale, a signal that fires on everything is no longer a signal. Mulvany hit the same problem from the other side. His reporting standards agents were too thorough: they found imperfections in nearly every manuscript because they were checking against a standard of perfection that no actual paper meets. 

“I think what we’re going to find,” Mulvany told us, “is that many papers don’t meet our platonic ideal.” When everything clears the bar, or everything fails to clear it, the bar stops being useful. The tools have revealed a gap between standards and practice that has always existed but was never visible at this resolution. What publishers do with that visibility is not a technology question.

No One Is Ahead

We tried to open each interview by asking our colleagues where they felt ahead and where they felt behind in AI strategy or implementation. No one felt ahead. Everyone identified something they were losing ground on.

Colette Bean, who leads the journals program at the American Physiological Society (APS), said the thing that keeps her up at night isn’t any single decision. It’s the possibility that decisions are arriving faster than anyone can make them well. “The real thing that’s challenging with AI is the speed of change,” Bean told us. “You wake up every day and something really significant has changed. You blink and you’re like, oh, OK, this really cool idea we had, a few others have already launched it.”

That’s the ethical problem underneath the operational one. Publishing organizations are being asked to make consequential, hard-to-reverse decisions about content, staff, and partners in an environment where the landscape shifts before the approval cycle completes. Staff concerns are legitimate and not evenly distributed: younger employees often arrive with principled reservations about environmental costs, job displacement, and what it means to outsource creative work. “Some staff members are not all-in on the use of AI,” one publishing leader told us. “They have real concerns about environmental impacts, entry-level job losses, and the loss of creative functioning.” Building trust and agency around the goals may overcome some hesitation, but telling staff they are wrong is not a strategy.

The licensing partner questions are genuinely hard. Is this potential licensee delivering actual value, or trying to co-opt market share by putting logos on their page? Are the answers generated from your content correct? Might your organization be attributed for wrong ones? Is this licensee already using your content in unauthorized ways?

And the questions get harder still in the open-access context. AI bots are crashing journal sites and making usage metrics unreliable. Publishers are beginning to ask whether an access wall in front of OA content makes sense, whether abstracts should be replaced with short summaries, whether the bronze model is quietly dying. Nobody wants to say yes to those questions. But this is where society publishers will face the sharpest tension between mission and margin, and where the speed of change that keeps Colette Bean up at night may not leave much room for deliberation.

Underneath the anxiety, though, a quieter consensus was forming across our conversations, not about tools or vendors or strategies, but about what publishing organizations are actually for. 

Ian Mulvany put it this way: “We have to very strongly think, what is the piece of the human contribution that is value, and how do I take everything else away so that my time, which is finite, which never can be added to, is maximized and amplified and gets the most value and attention around it?”

What’s durable is editorial judgment, community trust, and the capacity to evaluate and certify. What’s transient is specific models, specific vendor relationships, specific delivery mechanisms. Everyone we spoke with was hopeful that AI would improve the work they do, the content they publish, the services they offer members. Nobody was quite sure how. The tools are the same tools everyone else has access to. The difference, it turns out, is not in the technology. It is in the people who decide what to do with it.

 

We are grateful to the following individuals for speaking with us or contributing to our thinking: Colette Bean, Jennifer Brogan, Kivmars Bowling, Mike Clarke, Dana Compton, David Crotty, Steven Heffner, Adam Hyde, Phill Jones, Melissa Junior, Penelope Lewis, Thad Lurie, Ann Michael, Ian Mulvany, David Sampson, Jasper Simons, Andrew Smeall, Karla Soares-Weiser, and Vicky Williams.

A note on AI use: Claude (Anthropic) was used extensively in the production of this series. It cleaned and structured interview transcripts, identified the twenty-theme analytical framework from the corpus, and mapped verbatim quotes to themes across respondents. It served as a drafting and editing partner throughout, and as the ever-present voice insisting that if a paragraph wasn’t interesting, it didn’t matter whether it was true. The analysis, arguments, editorial judgments and at least half of the em-dashes are ours.

Angela Cochran

Angela Cochran

Angela Cochran is Vice President of Publishing at the American Society of Clinical Oncology. She is past president of the Society for Scholarly Publishing and of the Council of Science Editors. Views on TSK are her own.

Discussion

4 Thoughts on "AI Rollout Is a People Problem: A Pulse on All Things AI, Part 2"

I’m looking forward to reading this, but I would like to see part 1 first–is there a part 1? If so, can you please link to it? Thanks!

Bravo Ian Mulvany! Personally, I have a few differences of opinion as to the need for AI-based solutions on all six cases, but your approach to experimentation and integration is fab. Curious how narrow the discipline needs to be for the agents to run with sufficient consistency..? My favourite aspect is how you keep binary simplicity off the table.
Which is why this irks me:
“ What Mulvany built automates judgment.” (paragraph 13)
Followed much later by “ The agents present findings, not opinions. The human role is to synthesize those findings into a judgment. This is a design decision about organizational behavior, not about the technology.”
The first claim is the dangerous one — especially when anyone wants to believe that we should. The second claim is far more likely/useful one. ?!?
Sounds like a fascinating conversation! Sometimes being an expert in prompting has nothing to do with prompting AI… 🙃

Leave a Comment