No hype, just data: AI for mammography proves its worth in clinics
The AI model, developed by Celsus and based on a Faster R-CNN architecture with a ResNet-50 backbone, underwent a series of enhancements over nearly three years of testing. Developers responded to each testing round with updates that incorporated new image processing methods, quality-assessment modules, improved training datasets, and architecture upgrades. These iterations followed feedback from radiologists and performance reports from technical and clinical monitoring.

A multi-year clinical evaluation of an artificial intelligence (AI) model for breast cancer screening has demonstrated significant improvements in diagnostic performance, stability, and integration readiness, according to a peer-reviewed study published today in Diagnostics. The research, led by a team at the Moscow Center for Diagnostics and Telemedicine, outlines a lifecycle-based testing methodology for AI-driven mammography and provides a rare example of large-scale, real-world validation.
The study titled "Evolution of an Artificial Intelligence-Powered Application for Mammography" assessed the Celsus Mammography® AI model, a commercial-grade diagnostic tool designed to detect and classify breast cancer signs in mammograms. Using a combination of retrospective and prospective multicenter testing phases from 2018 to 2023, researchers conducted five rounds of calibration and monitored the system across more than 593,000 patient studies. The findings show the model’s area under the curve (AUC) improved from 0.73 to 0.91, while sensitivity rose by 37.1% and accuracy by 15.6%. The AI system’s technical defect rate dropped from 9.0% to just 1.0%.
The AI model, developed by Celsus and based on a Faster R-CNN architecture with a ResNet-50 backbone, underwent a series of enhancements over nearly three years of testing. Developers responded to each testing round with updates that incorporated new image processing methods, quality-assessment modules, improved training datasets, and architecture upgrades. These iterations followed feedback from radiologists and performance reports from technical and clinical monitoring.
Functional testing was the initial phase, where experts evaluated whether the AI met baseline operational and diagnostic requirements. Discrepancies, including missing disclaimers and inappropriate color displays, were identified and rectified before calibration testing commenced.
Calibration tests were conducted to assess diagnostic metrics such as AUC, sensitivity, specificity, and precision using an externally annotated dataset. After the first round of testing fell short of benchmarks, the AI was retrained, updated, and retested. The subsequent calibration phase yielded a 0.81 AUC, meeting the predefined threshold for clinical use. Further updates pushed the AUC to 0.91 in later model versions.
The study’s technical monitoring phase reviewed monthly performance in clinical environments across 112 medical facilities. The AI analyzed mammograms in real time and reported pathology scores alongside visual annotations. Researchers categorized technical defects into two types: no output and incomplete image analysis. These were tracked using regression models, with a 10% threshold set for acceptable error rates. The defect rate initially spiked but steadily declined as the model matured, indicating enhanced stability and better integration with imaging hardware and data transmission protocols.
Clinical monitoring, conducted in parallel, relied on expert radiologists to score the AI’s localization and interpretation of abnormalities. Scores improved over time, reflecting more accurate annotations and report generation. The AI’s ability to correctly classify and localize malignant and benign findings increased as the system underwent two major architectural updates and introduced new diagnostic categories such as fibrocystic changes and skin thickening.
Radiologist feedback was a crucial component of the testing loop. Over 8,400 user feedback points were collected through integrated interfaces in radiology workstations. Radiologists agreed with AI outputs in 79.2% of cases, indicating high alignment between human and machine interpretation, though 20.8% disagreement highlighted areas for further refinement.
The developer, Celsus, responded to performance reviews by implementing four major updates, including two involving full architectural revisions. Enhancements included switching from standard RoIPooling to precise RoIPooling, transitioning the backbone to ResNeSt and later to D2Det architecture, and integrating PGMI-based image quality classifications.
In January 2023, following the final calibration test (version 0.18.0), the AI model achieved full clinical adoption under Moscow’s compulsory health insurance program. It became the core of a reimbursable service titled “Description and Interpretation of Mammography Using Artificial Intelligence.” In its latest evaluation, the model delivered an AUC of 0.91, accuracy of 0.89, sensitivity of 0.85, and specificity of 0.93—placing it among the top-performing models globally.
While the study marks a significant step in real-world AI deployment, the authors acknowledge limitations. The diagnostic accuracy was based on a relatively small calibration dataset, and testing was geographically limited to Moscow. Additionally, the study did not register patient-level clinical outcomes such as recall rates or interval cancers.
Nevertheless, the lifecycle-based methodology is hailed as a breakthrough in AI validation. Unlike the prevailing “build-and-freeze” model, where software is locked after initial approval, this approach enables continuous feedback, version control, and clinical re-evaluation. The framework aligns with emerging standards such as the Radiology AI Safety (RAISE) initiative and the Checklist for Artificial Intelligence in Medical Imaging (CLAIM).
The authors recommend expanding the methodology to other diagnostic imaging applications and incorporating outcome-based clinical validation in future studies. They also call for standardized technical requirements, broader dataset diversity, and active participation from radiologists and policymakers in shaping AI oversight frameworks.
- FIRST PUBLISHED IN:
- Devdiscourse