Large Language Models Show Promise in Automating Evidence Synthesis, Says ADB

Researchers at the Asian Development Bank found that advanced AI models like Gemini 2.5 Pro can significantly speed up systematic reviews by accurately extracting qualitative information from scientific papers. However, the study warns that AI still struggles with complex statistical data, meaning human oversight remains essential for reliable meta-analysis.

CoE-EDP, VisionRI | Updated: 17-05-2026 12:20 IST | Created: 17-05-2026 12:20 IST

Large Language Models Show Promise in Automating Evidence Synthesis, Says ADB — Representative Image.

As the volume of scientific research continues to grow at record speed, researchers are struggling to keep up. Systematic reviews and meta-analyses, which combine evidence from many studies to guide policymaking and scientific understanding, can take more than a year to complete. Much of that time is spent manually reading papers and extracting data.

Now, researchers at the Asian Development Bank (ADB) are exploring whether artificial intelligence can help speed up the process. In a new study, ADB researchers Aditya Retnanto, Yohan Iddawela, and Elaine S. Tan tested several advanced large language models (LLMs), including Gemini 2.5 Pro, GPT-5, Sonnet 4.0, and DeepSeek R-1, to see how well they could extract information from full scientific articles.

The goal was simple but ambitious: determine whether AI can help researchers review large amounts of scientific evidence faster and more efficiently.

Testing AI on Health and Education Research

To evaluate the models, the researchers built an automated system that converts PDF research papers into text and then asks AI models to identify important information using detailed coding instructions.

The team tested the system on two very different sets of studies. The first involved mobile health interventions for noncommunicable diseases, including apps, wearable devices, and text-message-based health programs aimed at improving physical activity. The second focused on education research during the COVID-19 pandemic, including studies on learning loss, online learning, and tutoring programs.

The AI models were asked to extract both qualitative information, such as country, age group, intervention type, and study design, and quantitative information like sample sizes, means, standard deviations, and confidence intervals.

One important feature of the system was a "thinking" column, where the AI models explained their reasoning before giving final answers. According to the researchers, this made it easier to audit the outputs and understand how the models arrived at their conclusions.

Gemini 2.5 Pro Emerges as the Best Performer

Among all the models tested, Gemini 2.5 Pro delivered the strongest overall performance, especially in identifying outcomes and extracting qualitative information. The model consistently performed well in tasks such as identifying diseases, participant groups, countries, intervention types, and assessment tools.

The study found that AI models sometimes identified more outcomes than human reviewers. In the health studies, for example, Gemini 2.5 Pro extracted an average of more than seven outcomes per paper, while manual reviewers identified fewer than two on average.

The researchers say this difference highlights how human reviewers often apply subjective judgment and undocumented filtering rules, while AI models follow the coding instructions more strictly.

Other models, including GPT-5 and Sonnet 4.0, also showed strong performance in text-based extraction tasks. Meanwhile, lower-cost models like Llama 4 Maverick processed papers very quickly but produced weaker results overall.

Quantitative Data Remains a Major Challenge

While the AI systems performed well with text-based information, they struggled with numerical data extraction.

The models often failed to correctly interpret tables, mixed up statistical values, or selected numbers from the wrong sections of papers. In many cases, they could not accurately calculate effect sizes when only raw numerical data was available.

For example, some models extracted already-reported confidence intervals instead of performing the calculations required by the coding manual. Others confused baseline sample sizes with endline sample sizes.

The researchers concluded that current LLMs are not yet reliable enough to independently handle complex statistical extraction needed for meta-analyses. Human oversight remains essential, especially when dealing with quantitative research data.

AI Can Assist Researchers, But Not Replace Them

Despite the limitations, the study paints an optimistic picture of AI-assisted evidence synthesis. The researchers believe LLMs can already help researchers save significant time during screening and qualitative coding tasks.

However, the study also stresses that better coding manuals are needed. Human experts often rely on implicit knowledge when reviewing studies, while AI systems interpret instructions literally. The researchers argue that future coding guides must become much more detailed and step-by-step if AI systems are to perform reliably.

The paper concludes that AI should currently be viewed as a powerful assistant rather than a replacement for expert reviewers. With proper supervision, transparency, and carefully designed workflows, LLMs could become valuable tools for managing the growing flood of scientific research and helping policymakers access evidence more quickly and efficiently.

FIRST PUBLISHED IN:
Devdiscourse

Large Language Models Show Promise in Automating Evidence Synthesis, Says ADB

Testing AI on Health and Education Research

Gemini 2.5 Pro Emerges as the Best Performer

Quantitative Data Remains a Major Challenge

AI Can Assist Researchers, But Not Replace Them

ALSO READ

Punjab's New Anti-Sacrilege Law: A Strong Stance Against 'Beadbi'

EIB Approves €1 Billion for FiberCop to Expand Italy’s Ultrafast Broadband Network

Pakistan Enters China’s Bond Market with Panda Bond Backed by ADB and AIIB

India-UK Tackle Steel Roadblock for Quick CETA Implementation

ADB Makes Entry into Türkiye’s Infrastructure Sector with $40 M Maritime Logistics Investment

TRENDING

Anbalagan: Sworn in as Pro Tem Speaker

Navigating the Turbulent Waters of Global Oil Trade

Attempted Break-in at Haryana Congress Chief's Ancestral Home Sparks Outrage

Ebola Crisis in Central Africa: A Looming Threat

OPINION / BLOG / INTERVIEW

Digital Formula Advertising Undermining Breastfeeding, WHO and UNICEF Warn

How Kenya’s Interest Rate Caps Transformed Mobile Lending and Competition

Large Language Models Show Promise in Automating Evidence Synthesis, Says ADB

Humanitarian Aid in CAR Often Targets Camps Rather Than the Poorest People

DevShots

Latest News

Navigating the Turbulent Waters of Global Oil Trade

Attempted Break-in at Haryana Congress Chief's Ancestral Home Sparks Outrage

Ebola Crisis in Central Africa: A Looming Threat

Call for Fair Employment for Home Guards in Tamil Nadu

The Comedy Survival Challenge: Harsh Gujral's Journey through Laughter and Reality

Connect us on

SECTORS

EDITIONS

OTHER LINKS

OTHER PRODUCTS

CONNECT