AI lacks clinical readiness despite strong performance claims

Another recurring concern was misalignment with FDA treatment-line approvals, a key requirement for safe and compliant oncology care. Multiple systems prematurely recommended Teclistamab and Elranatamab for earlier lines of therapy despite approvals limited to patients with at least four prior treatments. Similarly, several systems suggested the CAR-T therapy Ide-cel for patients with too few previous treatment lines, contradicting current regulatory standards.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 12-12-2025 13:04 IST | Created: 12-12-2025 13:04 IST
AI lacks clinical readiness despite strong performance claims
Representative Image. Credit: ChatGPT

A new peer-reviewed evaluation of artificial intelligence performance in oncology has found that even the most advanced large language model systems struggle to deliver treatment recommendations that are safe, current and clinically actionable. The findings point to meaningful progress in AI decision support but also clear limitations that keep the technology from functioning as a dependable tool for real-world cancer care.

The study, titled AI for Evidence-Based Treatment Recommendation in Oncology: A Blinded Evaluation of Large Language Models and Agentic Workflows, was published in Frontiers in Artificial Intelligence. It compares five leading AI systems across 50 multiple myeloma clinical scenarios, covering diagnosis, staging, therapeutic planning and complex case management. The evaluation was conducted by three hematologist-oncologists who scored anonymized responses for accuracy, relevance, thoroughness, presence of fabricated information and readiness for immediate clinical use.

The results reveal a widening performance gap between general-purpose language models and specialized medical AI systems. They also show that none of the tested models, including the top performer, met the threshold for safe implementation without expert review.

Agentic workflow AI leads performance but still needs human oversight

The analysis tested three general-use AI language models, OpenAI’s o1-preview, Claude 3.5 Sonnet and Gemini 1.5 Pro, alongside a retrieval-augmented system called Myelo and a more advanced evidence-synthesis platform named HopeAI, which uses agentic workflows to combine multiple reasoning steps and external data resources.

Across all core metrics, HopeAI was the strongest performer. It achieved 82 percent accuracy, 85.3 percent relevance, and 74 percent comprehensiveness, significantly outperforming every other system. It also provided the highest share of responses rated ready for clinical use at 25.3 percent, though the authors stress that even this level is too low for deployment in a medical setting.

In contrast, general-purpose language models were far less consistent. OpenAI’s o1-preview ranked second but trailed HopeAI by a wide margin, reaching 64.7 percent accuracy, 57.3 percent relevance, and 36 percent comprehensiveness. Claude 3.5 Sonnet and Gemini 1.5 Pro scored lower still, both falling below 52 percent accuracy and offering the least clinical depth among the systems evaluated.

Myelo, the retrieval-augmented model developed for multiple myeloma support, performed better than general-purpose LLMs in certain areas but still fell short of the agentic approach. Its strengths included clearer standard-protocol guidance, but it often missed updated regulatory information and occasionally recommended treatments that were no longer approved.

Despite low hallucination rates, between 3 and 10 percent across all systems, safety concerns remained. The researchers found repeated instances of incorrect treatment-line recommendations, omission of newly approved therapies, and use of outdated or withdrawn drugs, all of which pose real clinical risk.

The findings signal that AI is steadily improving in oncology decision support, but human review remains indispensable.

Clinical evaluators flag errors in treatment sequencing, evidence integration and regulatory alignment

The study’s qualitative analysis revealed the most persistent weaknesses of current AI systems in supporting clinical decision-making.

One of the most serious issues was the omission of newly approved therapies, particularly in scenarios involving relapse after BCMA-targeted treatments. Talquetamab, a GPRC5D-targeted agent now recommended in several such cases, was frequently left out by nearly all systems. This gap shows that real-world AI applications may lag behind fast-moving oncology approvals unless continuously updated.

Another recurring concern was misalignment with FDA treatment-line approvals, a key requirement for safe and compliant oncology care. Multiple systems prematurely recommended Teclistamab and Elranatamab for earlier lines of therapy despite approvals limited to patients with at least four prior treatments. Similarly, several systems suggested the CAR-T therapy Ide-cel for patients with too few previous treatment lines, contradicting current regulatory standards.

The researchers also documented inappropriate use of withdrawn or unapproved drugs. Belantamab mafodotin, removed from the US market, appeared in treatment recommendations from several models, and melflufen was incorrectly advised despite its withdrawal. Some systems also mentioned regimen options still awaiting FDA clearance, underscoring the ongoing difficulty AI tools face in parsing regulatory nuance.

Treatment sequencing also posed challenges. In scenarios requiring awareness of cross-resistance, alternative targeting strategies or the need for bridging therapy before CAR-T procedures, most systems failed to address crucial clinical considerations. HopeAI performed better than others but still missed sequence-dependent nuance in some cases.

The pattern suggests that while AI can summarize information, understanding multistep clinical reasoning remains difficult. Oncology decisions often depend on synthesizing regulatory status, patient history, mechanism of action, and the timing of prior therapies, a level of reasoning that goes beyond factual accuracy alone.

Strong progress, but highlights limits of medical AI without human expertise

Although the evaluation acknowledges the progress made by newer AI systems, it reinforces that none are prepared for unsupervised use in oncology. The highest readiness score, HopeAI’s 25.3 percent, means that three out of four recommendations still required significant correction before clinicians could rely on them.

Even more concerning, evaluator agreement varied sharply across models. HopeAI produced consistently strong content but showed more rater disagreement because scores were clustered near the top of the scale, magnifying small differences in judgment. In contrast, Myelo showed the highest evaluator consistency, while the general-purpose models exhibited broader variability. These patterns reveal that even strong performance can mask subtle inconsistencies requiring careful oversight.

The ranking portion of the study further reinforced this divide. When experts ranked all systems’ responses across 50 scenarios, HopeAI took first place in 66.7 percent of evaluations. Myelo ranked second at 26.7 percent, while OpenAI, Claude and Gemini lagged far behind. Statistical testing confirmed that these differences were significant.

The authors note that advances in architecture have improved general-purpose models’ reasoning skills, but specialization matters. Retrieval methods alone did not adequately close the performance gap, and deep medical decision support requires more than surface-level knowledge. It requires structured reasoning and continuous updates that match the pace of oncology research and regulatory change.

The study also highlights the need to evaluate AI systems under realistic conditions. By using a blinded setup with standardized scoring criteria, and by restricting all evidence to a shared cutoff date of June 2024, the researchers aimed to reduce variation and ensure fair comparison. The methodology produced 4,500 individual evaluations, making it one of the largest clinician-blinded assessments of medical AI to date.

Overall, AI should remain a decision support tool, not a substitute for clinical judgment. They stress that safe integration into oncology will require advances in dynamic evidence updating, domain-specific reasoning and rigorous external validation. Future studies could expand beyond multiple myeloma to assess generalizability across cancer types or include multidisciplinary reviewers to evaluate broader clinical utility.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback