AI model shows disparities across race, gender and age in medical imaging


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 28-03-2025 10:09 IST | Created: 28-03-2025 10:09 IST
AI model shows disparities across race, gender and age in medical imaging
Representative Image. Credit: ChatGPT

A new study has found that state-of-the-art vision-language artificial intelligence (AI) models used for diagnosing chest X-rays significantly underdiagnose patients from historically marginalized groups, such as women, Black individuals, and particularly Black women, when compared to human radiologists. The findings raise urgent concerns about algorithmic fairness in clinical AI and the risk of amplifying existing healthcare disparities through biased automation.

The research, published in Science Advances, evaluated the CheXzero vision-language foundation model, a self-supervised system capable of diagnosing a wide range of radiographic pathologies without explicit training labels. The study tested the model’s diagnostic performance and fairness across five large, globally sourced chest X-ray datasets, MIMIC (Boston), CheXpert (Stanford), NIH (Bethesda), PadChest (Spain), and VinDr (Vietnam),  encompassing over 850,000 images and nearly 200,000 patients. While the model achieved expert-level diagnostic accuracy overall, researchers documented striking disparities in false negative rates, particularly for younger patients, women, and racial minorities.

Notably, the AI model was significantly more likely to underdiagnose disease in Black female patients compared to white male patients. In the CheXpert dataset, the disparity in false negative rates for the condition "enlarged cardiomediastinum" was over 20 percent when comparing these intersectional subgroups. For the "no finding" label, a classification that suggests no visible disease, the model exhibited up to 20% higher false positive rates in elderly female patients, suggesting overdiagnosis and potential for unnecessary intervention.

While the model outperformed or matched radiologists in aggregate measures like area under the curve (AUC) for detecting conditions such as pleural effusion or lung opacity, its subgroup performance fell short. Researchers benchmarked its fairness performance directly against board-certified radiologists who reviewed the same images and found that the AI demonstrated consistently larger fairness gaps across race, sex, age, and their combinations.

The disparity wasn’t limited to specific pathologies. Using the PadChest dataset from Spain, which includes 174 radiographic findings, the study showed that out of 48 pathologies with sufficient data, 31 exhibited gender-based fairness gaps larger than 5%. Age-based disparities were even more pronounced: 45 of the 48 findings showed a fairness gap greater than 20% when comparing patients aged 18-40 with those over 80. For instance, in identifying tracheostomy tubes, the model was 100% less accurate in the younger age group.

The root of the bias appears to lie in how the AI model processes demographic information. Using a technique called logistic regression on the model’s internal representations, researchers found that the model strongly encoded sensitive attributes like race, sex, and age, often more accurately than human radiologists. For instance, the model could predict a patient's sex with an AUC of 0.92 and age group with 0.94, while radiologists performed far worse on the same task. This reveals that, despite being trained without explicit demographic labels, the model has implicitly learned to extract and incorporate this information, which may contribute to biased clinical inferences.

This encoding capacity suggests that the model might be using demographic signals as proxies or shortcuts during diagnosis, a known risk in deep learning systems. Although such proxies may correlate with disease prevalence in some populations, relying on them can lead to discriminatory outcomes when demographic characteristics do not causally influence disease presentation. This raises serious questions about how foundation models should be trained and evaluated for clinical use, especially when deployed in diverse populations.

The authors emphasize that their findings are consistent across all five datasets, including MIMIC and CheXpert in the U.S., PadChest in Europe, and VinDr in Asia, highlighting that these disparities are not confined to a particular region or dataset. They also compared CheXzero to a fully supervised DenseNet model trained with explicit labels and found that both models exhibited bias, though the nature and magnitude of the gaps varied by task and dataset.

To mitigate the bias, the researchers experimented with fairness interventions by including demographic information in the model’s textual prompts, such as “Does this female patient have pneumonia?” While this approach reduced fairness gaps for some conditions, it did not eliminate them and in some cases had little to no effect. The team concludes that deeper and more principled fairness strategies are needed, ones that go beyond input manipulation and target the structural learning biases of the model.

The implications are significant. Foundation models like CheXzero are being hailed as the future of scalable, annotation-free medical AI, capable of diagnosing dozens of diseases with minimal human oversight. But if their performance is not equitable across demographic groups, their deployment risks entrenching health inequities rather than alleviating them. In the context of the White House Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, this study underscores the urgent need for regulatory frameworks that assess fairness alongside accuracy in medical AI.

The study finally calls for standardized fairness evaluations, routine bias audits before deployment, and collaborations between developers, clinicians, and ethicists to ensure that AI systems in healthcare promote equity, not disparity. It also suggests that human-AI collaboration, where physicians understand and monitor AI model behavior, may be a better path forward than full automation in critical diagnostic workflows.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback