Large Language Models Show Promise in Automating Evidence Synthesis, Says ADB
Researchers at the Asian Development Bank found that advanced AI models like Gemini 2.5 Pro can significantly speed up systematic reviews by accurately extracting qualitative information from scientific papers. However, the study warns that AI still struggles with complex statistical data, meaning human oversight remains essential for reliable meta-analysis.
As the volume of scientific research continues to grow at record speed, researchers are struggling to keep up. Systematic reviews and meta-analyses, which combine evidence from many studies to guide policymaking and scientific understanding, can take more than a year to complete. Much of that time is spent manually reading papers and extracting data.
Now, researchers at the Asian Development Bank (ADB) are exploring whether artificial intelligence can help speed up the process. In a new study, ADB researchers Aditya Retnanto, Yohan Iddawela, and Elaine S. Tan tested several advanced large language models (LLMs), including Gemini 2.5 Pro, GPT-5, Sonnet 4.0, and DeepSeek R-1, to see how well they could extract information from full scientific articles.
The goal was simple but ambitious: determine whether AI can help researchers review large amounts of scientific evidence faster and more efficiently.
Testing AI on Health and Education Research
To evaluate the models, the researchers built an automated system that converts PDF research papers into text and then asks AI models to identify important information using detailed coding instructions.
The team tested the system on two very different sets of studies. The first involved mobile health interventions for noncommunicable diseases, including apps, wearable devices, and text-message-based health programs aimed at improving physical activity. The second focused on education research during the COVID-19 pandemic, including studies on learning loss, online learning, and tutoring programs.
The AI models were asked to extract both qualitative information, such as country, age group, intervention type, and study design, and quantitative information like sample sizes, means, standard deviations, and confidence intervals.
One important feature of the system was a "thinking" column, where the AI models explained their reasoning before giving final answers. According to the researchers, this made it easier to audit the outputs and understand how the models arrived at their conclusions.
Gemini 2.5 Pro Emerges as the Best Performer
Among all the models tested, Gemini 2.5 Pro delivered the strongest overall performance, especially in identifying outcomes and extracting qualitative information. The model consistently performed well in tasks such as identifying diseases, participant groups, countries, intervention types, and assessment tools.
The study found that AI models sometimes identified more outcomes than human reviewers. In the health studies, for example, Gemini 2.5 Pro extracted an average of more than seven outcomes per paper, while manual reviewers identified fewer than two on average.
The researchers say this difference highlights how human reviewers often apply subjective judgment and undocumented filtering rules, while AI models follow the coding instructions more strictly.
Other models, including GPT-5 and Sonnet 4.0, also showed strong performance in text-based extraction tasks. Meanwhile, lower-cost models like Llama 4 Maverick processed papers very quickly but produced weaker results overall.
Quantitative Data Remains a Major Challenge
While the AI systems performed well with text-based information, they struggled with numerical data extraction.
The models often failed to correctly interpret tables, mixed up statistical values, or selected numbers from the wrong sections of papers. In many cases, they could not accurately calculate effect sizes when only raw numerical data was available.
For example, some models extracted already-reported confidence intervals instead of performing the calculations required by the coding manual. Others confused baseline sample sizes with endline sample sizes.
The researchers concluded that current LLMs are not yet reliable enough to independently handle complex statistical extraction needed for meta-analyses. Human oversight remains essential, especially when dealing with quantitative research data.
AI Can Assist Researchers, But Not Replace Them
Despite the limitations, the study paints an optimistic picture of AI-assisted evidence synthesis. The researchers believe LLMs can already help researchers save significant time during screening and qualitative coding tasks.
However, the study also stresses that better coding manuals are needed. Human experts often rely on implicit knowledge when reviewing studies, while AI systems interpret instructions literally. The researchers argue that future coding guides must become much more detailed and step-by-step if AI systems are to perform reliably.
The paper concludes that AI should currently be viewed as a powerful assistant rather than a replacement for expert reviewers. With proper supervision, transparency, and carefully designed workflows, LLMs could become valuable tools for managing the growing flood of scientific research and helping policymakers access evidence more quickly and efficiently.
- FIRST PUBLISHED IN:
- Devdiscourse
ALSO READ
-
Punjab's New Anti-Sacrilege Law: A Strong Stance Against 'Beadbi'
-
EIB Approves €1 Billion for FiberCop to Expand Italy’s Ultrafast Broadband Network
-
Pakistan Enters China’s Bond Market with Panda Bond Backed by ADB and AIIB
-
India-UK Tackle Steel Roadblock for Quick CETA Implementation
-
ADB Makes Entry into Türkiye’s Infrastructure Sector with $40 M Maritime Logistics Investment
Google News