AI-powered skin cancer screening: Promising but not perfect

Despite the promise of AI in medical diagnostics, the study highlights several limitations of GPT-4 in detecting melanoma and other skin lesions. One key concern is the model’s high false-positive rate, where it frequently misclassified benign or atypical lesions as melanoma.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 26-02-2025 16:11 IST | Created: 26-02-2025 16:11 IST
AI-powered skin cancer screening: Promising but not perfect
Representative Image. Credit: ChatGPT

Early detection of skin cancer, particularly melanoma, is crucial in improving survival rates and treatment outcomes. With advancements in artificial intelligence (AI), machine learning models are increasingly being explored as potential tools for screening and diagnosis. AI-powered image recognition has shown promise in dermatology, but how well does it perform against clinical standards?

A recent study, "Beyond the Surface: Assessing GPT-4’s Accuracy in Detecting Melanoma and Suspicious Skin Lesions from Dermoscopic Images," authored by Jonah W. Perlmutter, John Milkovich, Sierra Fremont, Shaishav Datta, and Adam Mosa, investigates GPT-4’s ability to detect melanoma and atypical skin lesions using dermoscopic images. Published in Plastic Surgery (2025), the study provides critical insights into the reliability of GPT-4 as a diagnostic tool and its limitations compared to clinical diagnosis.

GPT-4’s Performance in Detecting Melanoma and Suspicious Lesions

The study evaluated GPT-4’s diagnostic accuracy using 200 dermoscopic images from the PH2 dataset, which includes clinically diagnosed cases of common nevi, atypical nevi, and melanoma. The AI model was tested by uploading images along with a structured prompt to simulate a diagnostic scenario. GPT-4 was then tasked with providing three differential diagnoses for each image, ranked by likelihood. Its performance was analyzed using key diagnostic metrics, including accuracy, sensitivity, and specificity.

For melanoma detection, GPT-4 demonstrated 68.5% accuracy, with 52.5% sensitivity and 72.5% specificity. While the model showed a moderate ability to distinguish between melanoma and non-melanoma cases, its sensitivity was notably low. This means GPT-4 failed to correctly identify nearly half of the melanoma cases in the dataset, a concerning finding given that early detection is crucial for effective treatment.

When it came to detecting suspicious lesions, including atypical nevi and melanoma, GPT-4’s performance was slightly better, achieving 68.0% accuracy, 78.0% precision, and an F-measure of 70.0%. However, its diagnostic ability remained significantly different from clinical diagnoses, as confirmed by statistical analysis (P = 0.0169), indicating that GPT-4 still falls short of being a reliable stand-alone diagnostic tool.

Challenges and limitations in AI-based skin cancer detection

Despite the promise of AI in medical diagnostics, the study highlights several limitations of GPT-4 in detecting melanoma and other skin lesions. One key concern is the model’s high false-positive rate, where it frequently misclassified benign or atypical lesions as melanoma. Among the 160 benign or atypical lesions in the dataset, GPT-4 labeled 38 cases as melanoma, potentially leading to unnecessary patient anxiety and clinical referrals. Conversely, some melanoma cases were misclassified as non-suspicious, raising concerns about missed diagnoses and delays in treatment.

Another limitation of GPT-4’s performance is its lack of adaptability to diverse skin types. The PH2 dataset primarily consists of Fitzpatrick skin types I to IV, with no representation of darker skin tones (Fitzpatrick types V and VI). This lack of diversity means that GPT-4’s ability to accurately diagnose skin lesions across all ethnic groups remains uncertain. Given that skin cancer often presents differently across various skin tones, AI models must be trained on inclusive datasets to ensure equitable healthcare outcomes.

Furthermore, GPT-4’s diagnostic decisions were based solely on image analysis, without considering patient history, sun exposure, genetic predisposition, or other clinical factors. In real-world dermatology, diagnosis is not made on image analysis alone but involves a holistic assessment that includes a patient’s medical background. AI models that do not incorporate contextual clinical data may struggle to match the accuracy of dermatologists.

Comparing GPT-4 with clinical standards and other AI models

To understand where GPT-4 stands in the broader AI landscape, the study compared its performance with other machine-learning models that have been tested on the same PH2 dataset. The findings revealed that GPT-4 underperformed in accuracy, sensitivity, and specificity compared to specialized AI models designed explicitly for skin cancer detection.

For instance, a machine-learning model developed by Oukil et al. achieved 99.5% accuracy, 99.2% sensitivity, and 99.6% specificity—far surpassing GPT-4’s capabilities. This superior performance was attributed to advanced segmentation techniques and tailored feature extraction methods that allowed the AI to analyze color, texture, and lesion morphology in greater detail. Similarly, other AI models, such as those by Alfred (2017) and Majumder (2018), demonstrated higher diagnostic accuracy than GPT-4.

When compared with clinicians, GPT-4 also fell short. Expert dermatologists demonstrated a sensitivity of 84.2% compared to GPT-4’s 52.5%, reinforcing that human expertise remains essential in diagnosing melanoma. While non-expert dermatologists and general practitioners showed lower accuracy than AI models in some cases, GPT-4 still did not outperform them significantly. These comparisons suggest that GPT-4 is not yet reliable enough to replace human dermatologists or specialized AI tools in clinical settings.

The future of AI in skin cancer screening

Although GPT-4 has limitations, the study acknowledges its potential as a preliminary screening tool, particularly in telehealth and underserved communities where access to dermatologists is limited. In rural areas, where there is a severe shortage of dermatologists, AI-powered image analysis could serve as an early warning system, prompting patients to seek professional evaluation if a suspicious lesion is detected.

To enhance AI’s reliability in dermatology, future improvements must focus on reducing false positives and false negatives, improving sensitivity for early-stage melanoma detection, and incorporating a broader range of skin types and lesion variations into training datasets. Additionally, integrating AI models with clinical history, genetic factors, and real-time physician input could significantly enhance diagnostic accuracy.

The study concludes that while GPT-4 should not be used as a replacement for professional medical diagnosis, continued advancements in AI algorithms have the potential to complement dermatological assessments. As AI research progresses, collaboration between AI developers, dermatologists, and regulatory bodies will be essential in ensuring that AI tools meet clinical safety standards and provide equitable healthcare benefits.

Conclusion

The assessment of GPT-4 in detecting melanoma and suspicious skin lesions underscores both the potential and current limitations of AI in dermatology. While the model demonstrates moderate accuracy, it falls short in sensitivity and specificity when compared to clinical diagnosis and specialized AI models. False positives and negatives remain major concerns, highlighting the need for further refinement before AI can be confidently used as a diagnostic aid.

As AI continues to evolve, its role in dermatology could shift toward supporting early detection efforts, expanding telemedicine capabilities, and assisting dermatologists in diagnostic decision-making. However, until AI models achieve greater reliability, clinical evaluation by a trained medical professional remains the gold standard for skin cancer detection. The study provides valuable insights into the future direction of AI-powered diagnostics, emphasizing that while AI can enhance healthcare accessibility, it is not yet a substitute for expert medical judgment.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback