ChatGPT more accurate and reliable than Gemini for bladder cancer information

Despite the encouraging accuracy figures, the study emphasized that all models occasionally provided misleading or incomplete advice. A particularly concerning example involved a question about monitoring treatment efficacy for bladder cancer. All three LLMs incorrectly emphasized tests like bloodwork and imaging, overlooking more accurate clinical markers used in guideline-based practice. These lapses, though not frequent, underscore the continued need for human oversight, especially in high-stakes medical contexts.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 22-04-2025 18:04 IST | Created: 22-04-2025 18:04 IST
ChatGPT more accurate and reliable than Gemini for bladder cancer information
Representative Image. Credit: ChatGPT

A new comparative study has thrown light on the capabilities and shortcomings of large language models (LLMs) in delivering critical cancer-related health information to the public. Published in the Société Internationale d’Urologie Journal, the study “ChatGPT vs. Gemini: Which Provides Better Information on Bladder Cancer?” investigates how three AI chatbots - ChatGPT-3.5, ChatGPT-4, and Google’s Gemini - respond to frequently asked patient questions. The study, conducted by Saudi researchers affiliated with King Saud bin Abdulaziz University and the Ministry of National Guard Health Affairs, evaluates these LLMs on accuracy, comprehensiveness, readability, and answer stability, revealing where each model excels or falls short.

In a digital age where patients increasingly turn to AI-powered platforms for medical advice, the reliability of these responses has become a critical public health concern. The study's findings underscore a high level of accuracy across all three models, but it also points to significant disparities in how comprehensive and accessible these responses are, particularly for patients trying to navigate complex conditions such as bladder cancer. With the incidence of bladder cancer rising sharply in regions like Saudi Arabia, the study provides timely insights into how LLMs can either support or complicate patient education and decision-making.

How accurate and comprehensive are LLMs when addressing cancer-related queries?

The study assessed 53 patient-centered questions categorized into general information, diagnosis, treatment, and prevention. Accuracy was measured using a 3-point Likert scale and reviewed by board-certified urologists referencing international urological guidelines. ChatGPT-3.5 and ChatGPT-4 each achieved a 92.5% rate of correct answers, while Gemini scored slightly lower at 86.3%. Statistically, however, these differences were not considered significant across any category, including treatment, which accounted for more than half the questions.

Where the models did diverge meaningfully was in comprehensiveness. ChatGPT-4 led with 83% of its answers rated as comprehensive or very comprehensive, followed by ChatGPT-3.5 at 75.4% and Gemini at 68.6%. These differences reached statistical significance, particularly in the treatment category, where Gemini lagged behind at just 57.7%. The research found that Gemini’s responses were more prone to omitting relevant context or providing generalized summaries that lacked specific guidance - a key factor that could reduce utility for patients seeking nuanced information about cancer therapies.

Despite the encouraging accuracy figures, the study emphasized that all models occasionally provided misleading or incomplete advice. A particularly concerning example involved a question about monitoring treatment efficacy for bladder cancer. All three LLMs incorrectly emphasized tests like bloodwork and imaging, overlooking more accurate clinical markers used in guideline-based practice. These lapses, though not frequent, underscore the continued need for human oversight, especially in high-stakes medical contexts.

Are these AI-generated responses accessible to the average patient?

Beyond content quality, readability plays a pivotal role in effective health communication. The study used the Flesch Reading Ease (FRE) Score and Flesch–Kincaid Grade Level metrics to assess the accessibility of chatbot responses. Gemini consistently produced the most readable answers, with a median FRE score of 54.3, suggesting moderate ease of reading. In contrast, ChatGPT-3.5 and ChatGPT-4 delivered more complex outputs, with scores of 43.4 and 40.3 respectively—levels typically classified as “fairly difficult” and suited to college-educated readers.

This trend was confirmed in educational grade-level breakdowns. For general, diagnosis, and treatment questions, more than 70% of ChatGPT-4’s responses were written at college level or higher, while Gemini only reached that standard in about one-third of cases. Although more readable, Gemini’s content was often less comprehensive, highlighting a trade-off between simplicity and depth.

The implications are significant. Patients with limited health literacy may be better served by Gemini’s accessible language, but may also risk missing crucial medical nuances. Conversely, while ChatGPT-4 offers deeper medical insight, its academic tone may alienate or confuse non-specialist readers. This readability gap illustrates one of the core challenges in deploying LLMs in healthcare: striking a balance between detail and clarity to serve a broad patient base.

Can AI chatbot responses be trusted to remain consistent over time?

Stability, the ability of a model to generate consistent responses to the same query, was also tested in the study. Ten questions were asked multiple times, with the chat history cleared between sessions to simulate independent queries. The findings showed minimal differences in output consistency. ChatGPT-3.5 and ChatGPT-4 each had a 90% consistency rate, while Gemini scored slightly lower at 80%, a difference that was not statistically significant.

In sub-categories like treatment and prevention, all three LLMs produced stable results, but inconsistencies were noted in diagnosis-related responses. For example, Gemini showed a 66.7% inconsistency rate in this domain. These fluctuations suggest that while LLMs have improved in generating repeatable content, reliability may still vary depending on the subject matter and model architecture.

While the small sample size in the stability test limits the generalizability of this result, the data aligns with broader concerns about reproducibility in AI outputs. Given the dynamic nature of language models, which continuously update and retrain on new datasets, ensuring stability is essential to building user trust, particularly in healthcare scenarios where patients may reference the same chatbot repeatedly during a treatment cycle.

The study stresses that AI chatbots, while promising, should be seen as supplemental tools, not replacements, for professional medical advice. Their use in bladder cancer education holds value, especially in regions where access to urologists may be limited. However, their limitations in contextual understanding, guideline fidelity, and patient-specific nuance must be addressed through ongoing research, transparency in algorithmic training, and collaboration with medical experts.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback