Agent-based AI outshines leading medical models in disease detection
While expert-tuned models like VisionUnite and RetiZero were trained on millions of ophthalmological cases, MedAgent-Pro’s modular structure achieved superior results in zero-shot settings, without requiring disease-specific fine-tuning. This underscores the efficiency and generalizability of the agentic design.

A groundbreaking study titled "MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow" has introduced a next-generation medical AI system, MedAgent-Pro, that significantly surpasses existing multi-modal large language models (MLLMs) in diagnostic accuracy, particularly in identifying glaucoma and heart disease. Developed through a collaboration between the National University of Singapore and the University of Oxford, the system integrates agent-based reasoning with quantitative visual analysis, offering a new standard in AI-driven diagnostics.
MedAgent-Pro addresses longstanding deficiencies in MLLMs, such as poor visual perception, reasoning hallucinations, and a lack of explainability in medical image interpretation. Unlike traditional end-to-end models that rely heavily on a single language-based engine, MedAgent-Pro adopts a modular agentic workflow. It separates diagnosis into two hierarchical stages: knowledge-based planning at the task level and multi-tool, evidence-based execution at the case level.
At the heart of MedAgent-Pro is a sophisticated orchestration of specialized agents. The Retrieve-Augmented Generation (RAG) agent sources clinical guidelines and medical literature, ensuring diagnostic plans align with established standards. A Planner Agent, built on GPT-4o, structures diagnostic workflows into actionable sequences tailored for each disease. At the case level, an Orchestrator Agent selects diagnostic steps based on available patient data, which may include 2D retinal images, 3D echocardiograms, and clinical measurements.
These inputs are then processed by Tool Agents such as segmentation models, classification networks, and vision-language models. A Coding Agent calculates key indicators like vertical cup-to-disc ratio for glaucoma or left ventricular ejection fraction for heart disease. Finally, a Decider Agent integrates the outputs into a clinically explainable diagnosis, supported by visual evidence and medical literature references.
Two diagnostic approaches are tested: one relying on a large language model (LLM) and another using a Mixture-of-Experts (MOE) mechanism. In benchmark tests using the REFUGE2 and MITEA datasets, MedAgent-Pro achieved a mean accuracy (mACC) of 90.4% and an F1 score of 76.4% for glaucoma diagnosis using the MOE decider—far outperforming leading ophthalmology MLLMs like VisionUnite and RetiZero. In comparison, GPT-4o and LLaVa-Med either refused to provide diagnoses or classified most patients as healthy regardless of symptoms.
For heart disease, where diagnosis is based on analysis of 3D echocardiography images, MedAgent-Pro again proved dominant. The MOE decider achieved an mACC of 66.8% and an F1 score of 52.6%, significantly higher than baseline models. The agentic framework’s ability to process multi-modal inputs, visual, clinical, and textual, and combine them with medical protocols set it apart from both general-purpose and domain-specific alternatives.
Case studies included in the research highlight the model’s explainability. In a representative glaucoma diagnosis, the system evaluated four key indicators: vertical cup-to-disc ratio, rim thickness, peripapillary atrophy, and optic disc hemorrhages. Segmentation tools calculated anatomical dimensions, while VQA models examined visual anomalies. The coding agent computed precise metrics, and the decider, applying weights derived from literature, produced a diagnosis with a documented rationale. The case-level reasoning included direct references to seminal studies such as Kass (1994) and Hood et al. (2013), reflecting a commitment to evidence-based medicine.
This structured reasoning framework addresses one of the most critical barriers to the clinical adoption of MLLMs: trust. While prior models frequently generated unverifiable or inconsistent results, MedAgent-Pro offers transparent, reproducible workflows that mirror human diagnostic processes. Its reliance on retrieved medical literature, combined with quantitative assessment from trained models, positions it as a potential gold standard in AI-assisted healthcare.
The ablation studies conducted by the research team further validate the modular approach. When relying solely on a single indicator like the vertical cup-to-disc ratio, the best accuracy reached 81.7%. However, when all four glaucoma indicators were considered via the MOE decider, the mACC rose to 90.4%. In contrast, the LLM decider showed declining performance with increased indicator complexity, confirming its limitations in multi-factorial integration.
While expert-tuned models like VisionUnite and RetiZero were trained on millions of ophthalmological cases, MedAgent-Pro’s modular structure achieved superior results in zero-shot settings, without requiring disease-specific fine-tuning. This underscores the efficiency and generalizability of the agentic design.
The researchers expect future studies to focus on scaling the system to cover a broader range of diseases and integrating human-in-the-loop feedback from medical professionals. They also plan to expand the dataset size, add more modalities, and explore real-time clinical applications. The team aims to enhance clinical validation and bridge the gap between AI research and frontline healthcare delivery.
- FIRST PUBLISHED IN:
- Devdiscourse