GPT-4 Vision performs like medical trainee in dermatology

Rosacea was consistently diagnosed correctly from images, while atopic dermatitis and squamous cell carcinoma proved more difficult for the model to classify accurately. Perhaps more revealing was GPT-4V’s apparent preference for textual input over image data. In cases where image interpretation failed, the presence of a clear clinical scenario often allowed the model to recover and correctly identify the condition. However, when a correctly diagnosed image was paired with a misleading text description, the model followed the text and got the diagnosis wrong—highlighting a modality prioritization that skews heavily toward language.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 07-05-2025 18:23 IST | Created: 07-05-2025 18:23 IST
GPT-4 Vision performs like medical trainee in dermatology
Representative Image. Credit: ChatGPT

The intersection of generative AI and clinical medicine continues to deepen, with dermatology emerging as a key domain for testing real-world applications of large language models. Now, researchers at the University of Calgary have published a pioneering evaluation of GPT-4 Vision (GPT-4V), OpenAI’s multimodal AI system capable of processing both images and text. The study, “Evaluating the Diagnostic and Treatment Capabilities of GPT-4 Vision in Dermatology: A Pilot Study,” explores the accuracy of the model across diagnostic and treatment tasks using textual, visual, and combined inputs.

What emerged was a mixed picture: GPT-4V excelled at interpreting clinical text and offering reasonable treatment suggestions but struggled to reliably diagnose dermatological conditions using images alone. The research presents a nuanced view of the model’s capabilities and limitations, particularly as AI systems move closer to integration with clinical workflows.

How does GPT-4V perform in diagnosing skin conditions?

The researchers tested GPT-4V across three input scenarios using nine common dermatological conditions. These included acne, rosacea, psoriasis, melanoma, basal and squamous cell carcinomas, atopic dermatitis, actinic keratosis, and vitiligo. In the image-only setup, GPT-4V correctly identified the primary diagnosis for just 54% of images. This accuracy improved dramatically in both text-only and multimodal (image plus text) scenarios, where it achieved an 89% correct diagnosis rate.

Rosacea was consistently diagnosed correctly from images, while atopic dermatitis and squamous cell carcinoma proved more difficult for the model to classify accurately. Perhaps more revealing was GPT-4V’s apparent preference for textual input over image data. In cases where image interpretation failed, the presence of a clear clinical scenario often allowed the model to recover and correctly identify the condition. However, when a correctly diagnosed image was paired with a misleading text description, the model followed the text and got the diagnosis wrong—highlighting a modality prioritization that skews heavily toward language.

The researchers noted that GPT-4V may not have been fine-tuned on medical images, which could explain its subpar image-based performance. Its vision model was likely trained on broad internet image datasets, lacking the specific granularity needed for medical dermatology. These results also align with broader concerns from OpenAI about the model’s inconsistency in medical imaging interpretation.

Is GPT-4V ready to recommend treatments?

Beyond diagnosis, the study examined GPT-4V’s treatment recommendation capabilities using a modified Entrustment Scale—a five-point scale adapted from Canadian medical education standards. The AI's treatment suggestions in the image + scenario group earned a higher average score (4.067) compared to the scenario-only group (3.933), with a statistically significant difference. This suggests that GPT-4V can reasonably contextualize treatment plans when given multimodal inputs, performing roughly at the level of a senior medical student or early resident, according to evaluating dermatologists.

Still, there were meaningful gaps. For example, GPT-4V occasionally skipped essential steps in treatment workflows, such as failing to recommend biopsy or staging before advanced melanoma therapy. In other cases, its recommendations didn’t fully account for social or geographic factors affecting treatment access. These shortcomings indicate a lack of real-world nuance and clinical depth, despite technically accurate textbook answers.

Additionally, the study pointed out limitations in the evaluation methodology. The Entrustment Scale, while structured, showed only 37.5% agreement between evaluators, reflecting the inherent subjectivity in clinical judgment. The overlap between evaluators and scenario creators also introduced the possibility of confirmation bias in the assessment process.

What are the broader implications and risks?

The study’s findings underscore a wider concern in the field of AI in medicine: the potential for racial and data bias. Most dermatological training datasets, including those likely used in GPT-4V's pretraining, are disproportionately composed of images of lighter skin tones. This bias can translate into less accurate diagnostic performance for richly pigmented skin, exacerbating existing disparities in care. Prior studies have already demonstrated that deep learning models trained on biased datasets can systematically underperform when assessing diverse skin types.

The authors caution that without targeted improvements, including the integration of more representative training datasets and domain-specific fine-tuning, GPT-4V may inadvertently reinforce these disparities. They also call for robust regulatory frameworks to ensure AI tools are tested rigorously across different demographics and clinical conditions before being deployed in practice.

The study additionally emphasizes the importance of clinician oversight when using AI tools. GPT-4V, despite its capabilities, is not ready to function autonomously in clinical dermatology. Instead, it could serve as a useful assistive tool, helping with triage, patient education, or literature review, provided its limitations are clearly understood.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback