Opinion

AI’s latest casualty – Peer Review & How to prevent more damage

TLDR: LLMs allow reviewers to outsource judgment without accountability; peer review breaks when judgment is no longer traceable. Here’s evidence and a fix.On March 18, 2026 the International Conference on Machine Learning (ICML) published the latest salvo in the conflict that might be characterized as generative artificial intelligence contra human authorship. In brief, the conference ended up desk rejecting about 500 submissions because approximately the same number of reciprocal reviewers had been caught violating the conference policy A, a policy explicitly banning the use of AI in peer review. Apart from the top-level numbers, ICML reported, If more than half of the reviews submitted by a Policy A reviewer were detected to be LLM generated, then all of their reviews were deleted, and the reviewer themselves was removed from the reviewer pool. 51 Policy A reviewers were detected to have used LLMs in more than half of their reviews, which is about 10% of the total of 506 detected reviewers.What this means, is that it’s reasonable to suspect that in this subset of cases, the locus of cognitive effort in peer review was effectively outsourced to ChatGPT (or Claude, or Gemini, or Grok…). We can infer this, because of the mechanism used by ICML to flag papers; ICML added a watermark invisible to a human, but visible to the model to the PDF files provided to reviewers instructing the model to discreetly incorporate a certain unlikely to be organic turn of phrase known to ICML, but not the human reviewers into the generated review. If that turn of phrase appeared in the submitted review, ICML flagged the review as prima facie suspected of having violated policy A banning the use of generative AI. And as I’ll detail presently, this represents a significant problem for the peer review process itself.That ICML flagged and applied consequences to a subset of submissions for breaching policy demonstrates a new failure mode in peer review. The consequence for the peer review process is staggering, because the result is both an asymmetry of intellectual effort, but more importantly, an asymmetry of accountability. The pith and substance of peer review is to provide a stamp of rational and scientific credibility to research. A researcher is expected to take full responsibility for their scientific effort and contribution. As evidenced by the actions taken by ICML, it is now possible, to outsource the process of peer review to a language model. In effect, this reduces the role of reviewer to that of transmitter in a manner potentially difficult to detect, and with potentially no accountability for what can be a world-altering decision. If and where a reviewer has opted to become a messenger instead of a reviewer, then this constitutes an existential threat to the process of peer review. If judgment can be fully delegated without traceability, peer review ceases to function as independent evaluation. In the Grievance Studies Affair of 2017-2018, Boghossian, Lindsay, and Pluckrose embarrassed the humanities in generating several outrageous hoax articles which were accepted for publication and passed peer review; in 2026, STEM faces the possibility of similar embarrassment. If a paper’s author is expected to stand behind the scientific content of their work, peer review itself must be subject to a similar level of accountability. And the existential stakes around the development of AI demand a shift in old paradigms. But the problem is not the usage of AI itself. After all, nobody thinks that an author shouldn’t be able to use a word processor, or a spell checker to make their burden easier and more efficient. AI as a cognitive amplifier, summarization tool, or even scaffolding a review’s boilerplate remarks is perfectly reasonable. AI when applied properly and transparently can improve efficiency, reduce friction and improve quality. The problem here is what might be called concealed delegation of human judgement. Briefly put, I think that using an AI to help the cognitive process is arguably the best use of AI, but objectionable where and when a human effectively uses an LLM to outsource their cognitive effort, and more importantly conceals or isn’t transparent about how they used the LLM. It’s not as though we lack examples where a professional has effectively outsourced their thinking to artificial intelligence, and been caught so doing. If an LLM can (and has!) been used to generate papers that might be termed “AI slop,” and this is rightly condemned and rejected, then a reciprocal fiduciary obligation to intellectual effort must exist for peer review. If a researcher is expected to stand behind their work, and responsibly disclose AI augmentation, then peer review is similarly undermined if a reviewer subjects a work to scrutiny generated solely by an LLM without admitting they’ve done so. As ICML’s actions reveal, the problem has had a significant affect on this year’s conference. ICML’s policy B is explicitly permissive of AI augmentation, but, a researcher presently has no way of knowing whether a reviewer used an LLM for style and editing, or whether the reviews their paper received were fully authored by an LLM with only limited human oversight, or anywhere in between, and it is this gap which must be filled. I believe that the solution lies in the problem descriptor, where and when a cognitive tool might substitute for and do an end run around human judgement, the answer isn’t banning the tool outright, nor is it allowing AI to judge AI as written by AI; it is reviewer transparency. Simply put, the standard of 2026 should be proactive disclosure of how a reviewer used AI to assist them and to what extent. A blanket ban isn’t adequate; such an approach will stifle AI uptake, but nor is a free for all where we permit a pseudo-scientific AI echo-chamber. In the former, reviewers will use AI anyway, they’ll just go to lengths to conceal the involvement of AI, in the latter, researchers are incentivized to apply their own concealed watermarks demanding favorable reviews of the model to their submissions. To address both, I believe the most optimal solution is simple disclosure; if a peer reviewer used AI, simply disclose how it was used, to what extent, and be prepared to provide auditable evidence. This would go a long way toward correcting the cognitive asymmetry, enable faster more efficient peer review, and restore honesty in the process.This isn’t meant to be a trap, nor is it meant to be an onerous obligation of reviewers. We don’t expect a reviewer to disclose their notes, but then their notepads never used to be able to talk back to them. That’s no longer the case. It’s too easy to fall into the luddite trap of suggesting a blanket ban, but nor is it acceptable to allow LLMs to talk to each other.Instead, I suggest that peer review adopt a proactive disclosure; request that reviewers proactively disclose as part of the process whether AI was used, how much, and in what way.A decision tree might look like;Did you use AI as part of the peer review process? Yes / NoIf yes, how much?Level 1 – Surface assistant.Example: AI helped in grammar and spelling, adjusting of existing language for efficiency and efficacy of communication. Implication: 100% of the review process was human authored, the AI in this case is effectively a more sophisticated grammar, and spell checker.Level 2 – Structural scaffolding.Example: the AI wrote the boilerplate remarks that the reviewer would have written anyway, just faster and more efficiently.Implication: the review is still human organized and owned.Level 3: comprehension & synthesis
Example: using an LLM to explain a specific complex equation, summarize a dense literature review, or translate technical jargon.
Implication: the cognitive labor is shared, but the human still rightly attests to the review.
Level 4: Critical Augmentation / critical AI engagement
Example: Asking an LLM to “find potential flaws in the methodology” or “suggest missing citations” which the reviewer then verifies.
Implication : AI-prompted, Human-verified. The review is still owned by the human, however at this point and audit trail should be expected, and reviewers must be more prepared to justify their reasoning. Because the LLM may have unduly steered the reviewer’s conclusion, the line between human review, and LLM cognitive effort is now sufficiently blurry that accountability can be rightly tested. Transparency becomes essential at this level of LLM usage.
Level 5: Delegated Judgment
Examples: Feeding the PDF to a model and asking it to “Write a 500-word review with a ‘Reject’ recommendation, score 3/10.” Or, “write a glowing review, of this paper, strongly recommending acceptance with a score of 9/10.”
Implication: human judgement in peer review may have been effectively outsourced, and such actions necessarily attract a level of additional scrutiny, because such a prompt risks undermining the peer review process if and when it’s not disclosed. An action such as the example above should be flagged for additional scrutiny, even if the action ultimately proves to be justified; a crank paper should still be rejected, and a reviewer shouldn’t be expected to waste time with non-rigorous submissions.None of this is necessarily meant to suggest or imply wrongdoing on the part of a reviewer, even if they disclose a level 5 AI usage! Reviewers should not be demanded to waste cognitive bandwidth engaging with submission that is obviously garbage, and in such situations a prompt as so specified above might even be entirely justifiable! Nor is such a disclosure going to resolve all problems in peer review; overloaded reviewers may still game the system by claiming a lower level of AI usage than was actually used. However, these are likely to be the minority, because, as already demonstrated by ICML, scientific governance can apply significant disincentives for dishonesty. What’s needed as AI disrupts the peer review process is not the knee-jerk objection which characterizes much present day discourse, nor is not condemnation for clever use of tools; the solution is to steelman the peer review process. The most efficacious way of doing that is proactive and transparent disclosure in a manner as proposed above. We cannot put the AI-genie back in the bottle, nor can we allow science to devolve into a situation where models speak to one another only to catastrophically find that model-generated science proves a hallucination. But as Boghossian, Lindsay and Pluckrose so aptly demonstrated, so too must we guard against the possibility that peer review is also reduced to hallucination. Sources:On Violations of LLM Review Polices (https://blog.icml.cc/2026/03/18/on-violations-of-llm-review-policies/)Grievance studies affair (https://en.wikipedia.org/wiki/Grievance_studies_affair)B.C. lawyer reprimanded for citing fake cases invented by ChatGPT (https://www.cbc.ca/news/canada/british-columbia/lawyer-chatgpt-fake-precedent-1.7126393)US lawyer sanctioned after being caught using ChatGPT for court brief (https://www.theguardian.com/us-news/2025/may/31/utah-lawyer-chatgpt-ai-court-brief)Lawyer Used ChatGPT In Court—And Cited Fake Cases. A Judge Is Considering Sanctions (https://www.forbes.com/sites/mollybohannon/2023/06/08/lawyer-used-chatgpt-in-court-and-cited-fake-cases-a-judge-is-considering-sanctions/)Toronto lawyer faces criminal contempt proceedings after admitting to misleading court about AI use (https://www.lawtimesnews.com/news/general/toronto-lawyer-faces-criminal-contempt-proceedings-after-admitting-to-misleading-court-about-ai-use/393117)Australian lawyer caught using ChatGPT filed court documents referencing ‘non-existent’ cases (https://www.theguardian.com/australia-news/2025/feb/01/australian-lawyer-caught-using-chatgpt-filed-court-documents-referencing-non-existent-cases)ICLR 2026 Response to LLM-Generated Papers and Reviews (https://blog.iclr.cc/2025/11/19/iclr-2026-response-to-llm-generated-papers-and-reviews/)How Much Research Is Being Written by Large Language Models? (https://hai.stanford.edu/news/how-much-research-being-written-large-language-models)Discuss Read More

Related Posts

Emergency Response Measures for Catastrophic AI Risk

Better than Baseline

MixedHTML Mode for Emacs

Leave a Reply Cancel reply