How Generative AI Is Reshaping Systematic Reviews: Strengths, Weaknesses and Reality

Generative AI shows strong potential to streamline structured tasks in systematic reviews—such as PICO formulation, data extraction and parts of risk-of-bias assessment—but remains unreliable for literature searching and study selection. Overall, it is a useful assistant but not yet capable of replacing human reviewers.


CoE-EDP, VisionRICoE-EDP, VisionRI | Updated: 17-11-2025 09:09 IST | Created: 17-11-2025 09:09 IST
How Generative AI Is Reshaping Systematic Reviews: Strengths, Weaknesses and Reality
Representative Image.

A comprehensive research by various institutions is an evaluation of how generative artificial intelligence is transforming systematic reviews in healthcare, drawing on research collaborations spanning China, Japan, the United States, the United Kingdom, Germany, Australia, Denmark, Canada, Korea, Switzerland, and several major universities, hospital research units, and health-technology assessment institutes. Across 30 studies published between 2023 and 2025, the analysis reveals both clear strengths and persistent weaknesses: AI tools such as ChatGPT, Claude, Gemini, Llama, and Bing AI can streamline structured review tasks but still struggle with judgment-heavy or domain-specific reasoning.

A Rapid Expansion of Tools and Clinical Fields

The included studies cover a wide range of medical domains, from oncology and diabetic retinopathy to dentistry, chronic pain, radiology, dermatology, psychiatry, exercise therapy, and pediatrics. Researchers evaluated an array of generative AI models, including ChatGPT-3.5, ChatGPT-4 4o, Claude-2 and Claude-3, Bard/Gemini, Llama, Mixtral-8x22B, and Bing AI. These tools were tested either prospectively alongside human reviewers or retrospectively against previously published systematic reviews. Their performance was measured using accuracy, sensitivity, specificity, F1-score, and Kappa coefficients, providing a multi-dimensional view of reliability.

Strong Performance in PICO but Major Weakness in Search

In tasks involving PICO or PICOT formulation, generative AI performs impressively. ChatGPT-4 consistently generated more accurate, complete and clinically meaningful PICO elements than earlier versions, and its outputs remained stable one month later. Bard showed stronger inter-rater reliability, yet ChatGPT offered richer content overall. However, the optimism fades at the literature-search stage. Across multiple evaluations, AI-generated search strategies produced either overwhelmingly large but irrelevant result sets or missed most of the key studies included in prior systematic reviews. ChatGPT retrieved as little as 4% of relevant publications in one assessment, while Bing recovered just 5%. The findings show that despite advances in language modelling, generative AI still lacks the structured logic and indexing awareness required for reliable biomedical database searching.

Study Selection Still Too Risky for Automation

Performance in study selection, title/abstract screening, and full-text eligibility was the most inconsistent. Four studies reported encouraging results, suggesting that with refined prompts or iterative training, AI can reach adequate sensitivity and specificity. A few authors even proposed using AI to replace one human screener. Yet the majority urged caution: ten out of fourteen studies warned that AI models frequently misapply inclusion criteria, misunderstand clinical nuance, and occasionally exclude clearly relevant studies. These errors often stemmed from hallucinated reasoning or gaps in domain knowledge, making AI unreliable for tasks where precision and interpretive judgment are essential.

Data Extraction and Risk Assessment Show Real Promise

Data extraction emerged as one of the strongest use cases. Multiple studies found high accuracy and strong agreement between AI outputs and human-extracted data, particularly with ChatGPT-4, Claude, and newly updated multimodal models. In some cases, ChatGPT-4o demonstrated enough reliability to serve as a “second reviewer,” offering substantial time savings without compromising quality. Performance, however, varied with prompt structure and the complexity of the extracted variables. Risk-of-bias assessment followed a similar pattern: AI handled structured, objective domains well but struggled with nuanced judgments such as selective outcome reporting. Claude-3 variants often outperformed ChatGPT, and hybrid approaches, where humans refine AI-generated assessments, achieved very high levels of agreement.

Overall, the document concludes that generative AI is best deployed as a supportive partner rather than a replacement for human reviewers. It excels in structured, repetitive tasks but falters in complex reasoning, database searching, and eligibility judgment. The authors advocate for hybrid workflows combining human oversight with AI-assisted acceleration, supported by better prompting standards, domain-specific training, and transparent validation. With further refinement, generative AI may one day transform systematic reviews, but for now, it remains a powerful assistant, not an autonomous reviewer.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback