How vision-language models are improving AI’s understanding of medical images
The latest advancements in generative AI have led to a paradigm shift in MedVQA, allowing AI models to generate complete answers rather than select from a predefined list. Instead of rigid classification-based approaches, generative MedVQA leverages deep learning techniques such as autoregressive decoding, transformer-based architectures, and multimodal large language models (MLLMs). These methods enable AI to process images, understand questions, and generate medically relevant responses in real-time.
Medical imaging is one of the most critical tools in modern healthcare, enabling doctors to diagnose diseases, monitor conditions, and guide treatment plans. Recent advances in Artificial Intelligence (AI) and Large Language Models (LLMs) have introduced a revolutionary approach to Medical Visual Question Answering (MedVQA) - a system that allows clinicians to ask questions about medical images and receive intelligent, AI-generated answers.
A new study, "Generative Models in Medical Visual Question Answering: A Survey," by Wenjie Dong, Shuhao Shen, Yuqiang Han, Tao Tan, Jian Wu, and Hongxia Xu, published in Applied Sciences (2025, 15, 2983), explores how generative models are reshaping MedVQA systems. The study examines the shift from discriminative models - which rely on selecting predefined answers - to generative models that use LLMs and multimodal learning to provide more flexible, free-text responses. It highlights advancements in vision-language pretraining, instruction tuning, and fine-tuning strategies that enhance AI-powered medical reasoning.
The shift from discriminative to generative MedVQA
Traditional MedVQA systems primarily used discriminative models, which functioned as classifiers, selecting answers from a fixed set of choices. While effective in simple tasks such as "Is there a tumor?" or "What organ is shown?", these models struggled with complex, open-ended medical queries that required detailed reasoning.
The latest advancements in generative AI have led to a paradigm shift in MedVQA, allowing AI models to generate complete answers rather than select from a predefined list. Instead of rigid classification-based approaches, generative MedVQA leverages deep learning techniques such as autoregressive decoding, transformer-based architectures, and multimodal large language models (MLLMs). These methods enable AI to process images, understand questions, and generate medically relevant responses in real-time.
For example, early models like CGMVQA (2020) and MedFuseNet (2021) integrated generative features with classifiers. However, it was not until 2023 and 2024 that generative MedVQA saw rapid progress, driven by breakthroughs in vision–language pretraining (VLP) and multimodal AI models like GPT-4, LLaVA-Med, and Med-Flamingo. These models now combine medical image interpretation with advanced natural language processing, significantly improving AI-assisted clinical decision-making.
How generative AI is improving medical imaging interpretation
Generative MedVQA models use a four-step approach to process medical images and provide AI-generated answers:
- Image Feature Extraction – AI-powered models extract features from X-rays, CT scans, MRIs, and pathology slides using vision transformers (ViT) and CNNs (Convolutional Neural Networks).
- Textual Understanding – Large language models, such as LLaMA, GPT, and BioMedBERT, process the medical question and extract key terms relevant to diagnosis.
- Multimodal Fusion – AI combines image and text features to create a context-aware response, ensuring clinical accuracy.
- Answer Generation – Instead of selecting an answer from a list, the model creates a full-text response based on AI-driven medical reasoning, providing detailed explanations for clinical decisions.
For instance, Med-Flamingo, a multimodal AI model, integrates GPT-based language understanding with a ViT image encoder to answer complex diagnostic queries with high accuracy and medical relevance. Meanwhile, LLaVA-Med enhances instruction tuning, enabling doctors to refine AI-generated responses based on specific medical guidelines.
Challenges in deploying generative MedVQA models
Despite their potential, generative MedVQA models face several challenges in real-world applications. The study identifies key obstacles and possible solutions:
- Data Limitations – AI models require massive amounts of high-quality, annotated medical images. While datasets like VQA-RAD, MIMIC-CXR-VQA, and PMC-VQA provide training material, data scarcity remains a challenge, especially for rare diseases. Researchers are developing synthetic dataset generation techniques using GPT-based augmentation and AI-assisted medical annotations to address this gap.
-
Hallucination Risks – One of the biggest challenges with generative AI is the risk of producing inaccurate or misleading medical information. Unlike traditional models that rely on structured outputs, generative models can "hallucinate" incorrect diagnoses if not properly trained. The study suggests integrating retrieval-augmented generation (RAG) frameworks, which allow AI models to reference verified medical knowledge sources (e.g., PubMed, clinical databases) before generating answers.
-
Lack of Clinical Trust – Doctors and medical professionals hesitate to rely on AI-generated responses without human verification. The study emphasizes the need for explainable AI (XAI) techniques, such as Shapley Additive Explanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), to provide transparency and ensure AI-generated answers align with clinical best practices.
-
Computational Costs and Scalability – AI models like LLaVA-Med and Med-Flamingo require extensive GPU resources for training and deployment, making them expensive for smaller hospitals and clinics. Researchers are exploring parameter-efficient fine-tuning (PEFT) techniques such as LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) to reduce computational demands while maintaining accuracy.
Future of AI-powered medical visual question answering
The integration of AI and multimodal language models into medical imaging analysis marks a new era in healthcare. The study predicts that future MedVQA systems will become even more sophisticated, with enhancements such as:
- Real-time AI-Assisted Diagnostics – Future AI models will integrate with hospital imaging systems to provide instant diagnostic insights, reducing radiologist workload and improving patient outcomes.
- Cross-Modality AI Models – Instead of being trained only on X-rays or MRIs, future MedVQA systems will analyze multiple imaging types (CT, ultrasound, pathology) in a unified framework, improving accuracy across different medical fields.
- Personalized AI Responses – AI-powered MedVQA systems will tailor answers based on individual patient data, ensuring personalized diagnostics and treatment recommendations.
- Regulatory-Compliant AI Models – As AI becomes more integrated into healthcare, governments and medical institutions will implement regulations for AI-driven diagnostics, ensuring ethical and transparent use of AI in medicine.
The future of AI-powered medical imaging looks promising, however, challenges such as data scarcity, hallucination risks, and regulatory concerns must be addressed before such generative models can be fully deployed in hospitals and clinics worldwide.
- FIRST PUBLISHED IN:
- Devdiscourse

