Generative AI creates transparency crisis in research
Researchers are calling for a sharper, fairer way to disclose artificial intelligence (AI) use in research, warning that current publishing rules are failing to keep pace with how quickly generative AI has entered academic work. A new analysis published in AI & Society claims that researchers, editors, supervisors and integrity officers now face a transparency crisis: AI tools are being used across research workflows, but there is still no shared vocabulary for explaining what those tools actually did.
The study, titled "The AIR framework for research transparency: a critical analysis of stage-specific AI disclosure in the context of accessibility and research integrity," examines the AIR, or AI in Research, framework as a proposed disclosure system for classifying AI involvement across different stages of academic work, while also warning that poorly designed transparency rules could disadvantage disabled and neurodivergent researchers.
Why AI disclosure has become a research integrity problem
Researchers are now using AI for literature searches, coding, transcription, data analysis, writing support, manuscript preparation and public communication. However, many journals and institutions still rely on broad or inconsistent rules that ask authors to disclose AI use without clearly defining what counts as meaningful use.
The author argues that this gap has produced confusion at several levels. Some researchers do not know what publishers expect them to report. Early-career scholars may avoid legitimate tools out of fear that disclosure will be treated as misconduct. At the same time, hidden AI use can weaken trust, especially when readers cannot determine whether AI affected the methods, findings, interpretation or writing.
The study states that existing systems were not built for this challenge. Authorship rules such as the ICMJE criteria are designed to decide who qualifies as an author, not how to describe machine assistance. Contributor systems such as CRediT identify human roles, but do not explain how AI may have contributed within those roles. Ethics statements from publishing bodies provide principles, but often lack operational detail. Journal rules, meanwhile, vary widely and can quickly become outdated when they focus on specific tools rather than research practices.
The AIR framework attempts to fill that gap by shifting attention away from a simple yes-or-no question about AI use. Instead, it asks where in the research process AI was used and how deeply it was involved. The framework divides research into seven stages: discovery, implementation, analysis, writing, publication, outreach and evaluation. It then classifies AI involvement into five bands, ranging from no AI use to substantial use.
The risk of using AI to organize notes is not the same as the risk of using AI to shape research questions, automate analysis or draft large portions of a manuscript. A single label such as AI used cannot capture those differences. The author presents AIR as a transparency tool, not a ban or approval system. It aims to help researchers describe what happened so others can judge whether the use was appropriate and properly checked.
AIR is usable but struggles at higher AI involvement
To test whether AIR can be applied consistently, the author conducted a pilot study with 15 trained raters drawn from UK research integrity and doctoral supervision networks. The participants reviewed nine research scenarios and classified each by research stage and AI engagement level. Overall inter-rater reliability reached a Cohen's kappa score of 0.72, which the paper treats as substantial agreement, suggesting that trained evaluators can apply AIR with workable consistency. The strongest agreement appeared for cases where there was no AI use or only minimal assistive use. Discovery, implementation and analysis stages also showed relatively strong reliability.
The results also exposed a serious weakness. Agreement declined as AI involvement became more substantial. The lowest reliability appeared for the A4 category, which covers substantial AI use where iterative collaboration may blur the line between human and machine contribution. Raters often disagreed over whether a case belonged in A2, A3 or A4, especially when researchers used AI in repeated back-and-forth exchanges while still making final decisions themselves.
The shift is notable because the highest bands are the ones most likely to trigger scrutiny from journals, supervisors or integrity officers. If evaluators cannot consistently identify substantial AI use, researchers may be over-classified as using AI more heavily than they did, or under-classified when the tool played a larger role. Either outcome weakens the purpose of disclosure.
According to the author, the problem is not simply poor training. Some AI practices are genuinely hard to classify. A researcher may use AI to suggest theoretical approaches, reject some outputs, refine others, and then write the final framework independently. In that case, AI did not make the final decision, but it still shaped the researcher's thinking. The paper says AIR needs a way to acknowledge such boundary cases without forcing false precision.
The study proposes a boundary-case designation for situations that fall between categories. Rather than pushing researchers to choose a single band when the classification is unclear, the disclosure could state that the use sits between two levels and explain why. That approach, The author argues, would preserve transparency while reducing anxiety and stigma.
Accessibility concerns expose a major weakness in AI transparency rules
The author warns that disclosure rules can unintentionally harm disabled and neurodivergent researchers if they treat all AI assistance in the same way. For some researchers, AI is not used to outsource intellectual work, it functions as assistive technology. A neurodivergent scholar, for instance, may use AI to break large tasks into smaller steps, organize a literature review, reduce cognitive overload, scaffold writing or support executive function. A researcher with a sensory, motor or mental health-related access need may use AI to make academic work possible or less burdensome.
The paper states that requiring detailed disclosure of such use could pressure researchers to reveal disability-related information they have not chosen to share. A journal or reviewer may ask why AI was used and how it was verified. To answer fully, the researcher might have to disclose a diagnosis or access need, which creates a conflict between transparency and privacy.
The author says this is not a minor edge case. Academic environments often place disabled researchers in difficult positions: disclose and risk stigma, or stay silent and lose access to needed support. If AI transparency systems ignore that reality, they may reproduce inequality while claiming to protect research integrity.
To address this, the paper proposes a protected A1-Access sub-band. This category would cover AI use as a personal accessibility accommodation when the tool supports access but does not shape the substantive research claims. Researchers could disclose that AI was used for accessibility support without giving details about their disability or medical status. Editors and reviewers would not be allowed to demand further personal information.
Transparency and inclusion should not be treated as competing goals, the paper contends. A disclosure system that forces vulnerable researchers to carry a heavier burden is not a stronger integrity system. It is an unequal one.
The analysis also warns against stigmatizing legitimate high-band AI use. Some fields, especially computational research, large-scale text analysis and data-intensive social science, naturally rely on automation. A high AI-use classification may mean the work needs more verification, not that it is less scholarly. AIR's risk coding could be misread as a moral warning if institutions do not separate verification burden from judgments about whether a method is appropriate.
The author thus recommends replacing simple risk labels with a more nuanced system that distinguishes how much checking is required from whether the AI use is acceptable in context. For example, AI-generated code may require careful validation, but it may still be entirely appropriate in a computational study. The concern is not AI use itself, but whether the researcher can verify, explain and take responsibility for the work.
The paper also highlights the risk of adversarial compliance. Because AIR depends heavily on self-reporting, researchers who want to hide substantial AI involvement may classify it as minor support. AI text detectors are unreliable, and universal audit trails would raise privacy and workload problems. As a partial solution, the study proposes periodic spot-check validation, in which a sample of published papers with AIR disclosures would be independently reviewed to see whether classifications align with trained judgments. The goal would be education and system improvement, not routine punishment.
The final recommendation is a community-maintained repository of edge cases. As AI tools evolve, researchers will encounter situations that no static guideline can fully anticipate. A shared database of anonymized examples, reviewed by experts from research integrity, publishing, accessibility and disciplinary communities, could help institutions apply AIR more consistently over time.
- FIRST PUBLISHED IN:
- Devdiscourse
Google News