Guest Post — May the AI Be With Science

Editor’s note: Today’s post is by Nihar B. Shah, Associate Professor in the Machine Learning and Computer Science departments at Carnegie Mellon University.

A long time ago, in a journal not so far away, scientific review was the domain of humans alone. But now a new force, artificial intelligence (AI), has entered the scholarly galaxy. AI rises as both ally and adversary, with the power to bring balance or betrayal to scientific research. Early research on AI in science focused on reviewing — for instance, automating reviewer assignment in computer science for more than a decade — while more recent work has broadened to explore AI’s role across the entire research process.

The Original Trilogy: AI in Reviewing

First, let’s consider three aspects of AI in the review of scientific papers.

A New Hope: Identifying Errors and Building Benchmarks

There is both excitement and apprehension about the role of AI in peer review. Human reviewers are already overloaded, and AI brings a new hope. To move beyond speculation, we need strong, systematic evaluations that can measure how well AI actually performs on the objectives of peer review, such as identifying errors and automating quality checks. Different approaches have been tried — see the “AI reviewing” section of this survey for details and references.

Comparing against past reviews: A number of studies generate AI reviews for papers with publicly available human reviews (e.g., available on OpenReview.net) and then compare these with human reviews. This approach uses existing data, but inherits the subjectivity and inconsistency of past reviews, and often reduces the evaluation to predicting reviewer scores rather than capturing richer aspects of review quality.
Surveying researchers: Various other studies ask people to rate AI-generated reviews. The evaluators are usually either authors of the paper being reviewed or other researchers in the field. But such evaluations face several challenges. Evaluators of reviews tend to be biased by style, such as by longer responses, and authors prefer more positive feedback (as found in experiments mentioned here). When labs develop and test their own AI reviewers, bias can creep in the evaluations; for instance, if participants know that the AI reviewer is designed by a prestigious group or if participants feel pressure not to criticize AI reviewers developed by colleagues.
Creating gold standards: Some studies build objective datasets for targeted tasks, such as inserting deliberate errors into papers and testing whether AI can detect them. Early experiments suggest promise in error detection, e.g., with the ancient model GPT-4 having a comparable performance to 79 reviewers (see the “Evaluating correctness” section of the aforementioned survey). This approach of creating gold standards, however, faces two challenges: (i) it is harder to achieve gold-standard human labeling at the same scale as other approaches, and (ii) tests need to be representative of the types of errors one may encounter in practice.

It will be beneficial for each field to develop domain-specific review-quality benchmarks tailored to the specific challenges of that field, the needs of its journals, and the multiple dimensions of review quality (not just the prediction of a final score). These benchmarks should be grounded in objective criteria and designed to support both evaluation and training. Safeguards against data contamination should also be taken, as publicly available benchmark data can leak into subsequent model training and produce inflated, misleading performance measures.

The Empire Strikes Back: Adversarial Attacks and Red Teaming

There are plenty of past cases where participants have tried to game the review system (examples found in Littman and Shah et al.). If AI enters the process, attempts to also exploit the deployed AI are inevitable. Some tricks are simple, like hiding instructions in white text, but these can be detected easily. Others can involve more subtle manipulations designed to steer the AI’s judgment, and there is evidence that such manipulations can be successful. For instance, see studies by Hsieh et al. and Eisenhofer et al., which demonstrate vulnerabilities of AI-based reviewer-assignment systems, and those by Lin et al. showing vulnerabilities of AI-based reviewers.

For this reason, any deployment of AI in the review process should be rigorously “red teamed” – deliberately attacked to expose its weaknesses and thereby improve defenses (e.g., as done by Hsieh et al.). This can be done by human testers or even by other AI systems trained to find exploits. The developers can then use this information to make their AI systems more robust.

Return of the Jedi: A Case for Human Judgment

Even as AI becomes more capable, there remain aspects of peer review where human judgment is useful:

Avoiding monoculture: A handful of large language models dominate the field, each with its own biases. If they are allowed to make subjective decisions, these few systems could end up shaping what kind of research is accepted, and by extension, which directions science pursues.
Preserving agency: Peer review is an exercise in critical introspection. Relying too heavily on AI risks diluting that capacity.
Valuing excitement: A central part of peer review is the subjective judgment of what research feels exciting. Feedback from expert researchers is valuable here since their perspectives help signal what the community may find meaningful.

For now, the most practical path forward is hybrid workflows — using AI to reduce the burden on human reviewers by targeting parts of the review process where its performance has been rigorously validated, while also involving humans in key judgments. If editors want to enforce exclusively human reviews, while there are techniques like this available, editors should be prepared for the inevitable cat-and-mouse dynamic where reviewers will continue to use LLMs, circumventing defenses, and editors will continue looking for new ways to detect it.

The Prequel Trilogy: AI in Research

Now, let’s look at three aspects of AI in research, ranging from the use of AI by authors in parts of the research to complete automation of the research workflow.

The Phantom Menace: Automated AI Scientists and Hidden Pitfalls

“AI scientist” systems, which autonomously generate hypotheses, execute experiments, and write papers, hold great promise for accelerating discovery. They can rapidly iterate across entire research workflows and allow for the exploration of ideas at a speed no human team can match.

Automation also brings challenges, even when done with the right intentions. AI scientist systems may produce outputs that pass peer review at reputable venues, but their internal methods can harbor hidden pitfalls. This includes the appearance of novelty even when it is not (e.g., Gupta and Pruthi). These can also include methodological problems, from practices resembling p-hacking to overstated claims of algorithmic performance, ultimately leading to unreliable results (e.g., Luo et al.).

Importantly, the aforementioned methodological flaws may remain invisible if only the final paper is examined. Evaluating automated research, therefore, requires a change in the paradigm of the review process. It is no longer sufficient to evaluate only the submitted paper; when automated AI scientists have produced research, journals and conferences should require submission of the trace logs and (wherever applicable) code of the full automated research workflow along with the paper. Evaluating these artifacts, using AI or otherwise, is more effective for identifying methodological failures and ensuring the rigor of the automated research.

Attack of the Clones: CV Padding and Fake Paper Mills

Unsurprisingly, AI can just as easily be used to flood the system with low-quality content. AI-generated “slop papers” are already on the rise, and some researchers have padded publication lists or citations with such AI-generated slop. Because AI can produce text that looks legitimate, including experiments that were never conducted, the fake paper problem is poised to grow far worse. We are entering an arms race: AI generates fakes, and AI is used to detect them.

One response is to move the goalposts. Instead of valuing publication counts or citation numbers, greater weight should be placed on reproducible research and demonstrable contributions to society. This is, however, much easier said than done.

Another response is to make faking data harder. Journals can incentivize submission of raw data, provenance records, or audit trails secured through public key cryptography. The FDA regulation on electronic records and electronic signatures (21 CFR Part 11) already mandates such safeguards in regulated industries, and compliant products are in use today. Scholarly publishing could build on this foundation to bootstrap broader adoption of similar protections.

Revenge of the Sith: Harnessing AI, Responsibly

There has been a lot of discussion recently about asking authors whether they used AI in their research or detecting its use. Results have varied from about 35% to 45% use. If these estimates are accurate, the real surprise to me is not how many authors are using AI, but how many are not. AI is an incredible force, as long as it is used responsibly. (For this article, it helped me polish drafts substantially faster.)

There are also discussions on requiring authors to report all of their uses of AI, including the prompts and other related details. In my view, such reporting can quickly become more burdensome than useful. Researchers may employ AI in numerous ways, and even if they were able to document them all, verifying such reports would be nearly impossible. Moreover, these reports may not be highly reproducible as older closed-source models may become unavailable, and parts of the AI workflow, such as system prompts, remain opaque.

Instead, a greater emphasis should be placed on verification. Using the force of AI by itself is not the problem. The dark side entices humans to grow complacent – accepting AI’s outputs uncritically and letting convenience override responsibility. An alternative approach is to require authors to verify AI outputs and report how that verification was done. If AI-generated code, did they verify it or run unit tests to ensure it worked as required? If AI suggested references, did they confirm that those references actually existed and were relevant? And ultimately, authors must take full responsibility for the final outputs of their research.

All in all, AI has opened a new chapter in the saga of science. If guided with integrity, it can open galaxies of possibilities. May the AI be with science.

Author acknowledgments: Thanks to Lettie Conrad and Roohi Ghosh for their valuable feedback on this article.

Nihar B. Shah

Nihar B. Shah is an Associate Professor in the Machine Learning and Computer Science departments at Carnegie Mellon University. His research focuses on peer review, developing computational tools with strong mathematical guarantees on performance, and conducting experiments for evidence-based policy design. The author's photo is from his talk at the Peer Review Congress and is credited to Ted Grudzinski, American Medical Association.

Discussion

2 Thoughts on "Guest Post — May the AI Be With Science"

Thank you for a great article.
I’ll add that “avoiding monoculture” also fits under “A New Hope,” as monocultures arise not just among AI models, but in human social structures as well. AI reviewers might help us better see where human pre-judgment dismisses novel questions or new paradigms.

By Darby Orcutt
Sep 19, 2025, 7:47 PM

Equating AI with a space fantasy. Right on-brand. See https://www.the-geyser.com/review-more-everything-forever/

By Kent Anderson
Sep 23, 2025, 9:30 AM

The Scholarly Kitchen