Medical AI at risk without accountability and oversight
LLMs are prone to generating plausible yet incorrect medical advice, which could mislead patients or clinicians. These hallucinations are especially dangerous in high-stakes environments like triage and mental health support. The study recommends embedding fact-checking modules and linking outputs to verified medical knowledge bases.
The rapid deployment of large language models (LLMs) in healthcare has sparked both hope and alarm, as their integration into medical chatbots reshapes patient and clinician experiences. A new review titled “Large Language Models in Medical Chatbots: Opportunities, Challenges, and the Need to Address AI Risks,” published in the journal Information (2025), provides a sweeping examination of the promise and perils of this transformation.
The study underscores the urgent necessity for robust governance as these AI tools gain traction in clinical settings. Let's delve into the details!
What are the primary applications of LLMs in medical chatbots?
The study categorizes applications of LLMs into three core segments: patient-facing roles, clinician-facing functions, and rare disease diagnosis support.
Patient-facing applications include symptom checkers, health information dissemination, and mental health tools. Unlike their rule-based predecessors, LLM-enabled chatbots dynamically interpret natural language inputs and offer context-aware responses. This allows for more nuanced dialogue, personalized advice, and multilingual engagement. Systems like Woebot and Wysa already demonstrate how LLMs can facilitate cognitive behavioral support. They can also explain complex test results and conditions in lay language, improving accessibility and shared decision-making.
Clinician-facing use cases focus on reducing administrative burden and enhancing diagnostic support. LLMs assist in drafting clinical documentation, coding, and summarizing notes from electronic health records. These capabilities extend to medical education, where they serve as interactive tools for learning terminology, procedures, and pathophysiology. The study cites real-world cases, such as Mayo Clinic’s deployment of LLM-powered triage chatbots, that show tangible improvements in call center efficiency and clinician workload.
Rare disease diagnosis represents another promising avenue. Models like GPT-4 and BioGPT have demonstrated capability in generating plausible differential diagnoses for low-prevalence conditions, leveraging niche datasets and orphan registries. This could be especially useful for general practitioners who lack deep familiarity with rare disorders.
What risks and limitations accompany LLMs in healthcare?
Despite their potential, the study raises serious concerns about the safety, fairness, and reliability of LLM-based chatbots in medicine. Three critical issues dominate the discourse:
Hallucination and factual inaccuracy are primary risks. LLMs are prone to generating plausible yet incorrect medical advice, which could mislead patients or clinicians. These hallucinations are especially dangerous in high-stakes environments like triage and mental health support. The study recommends embedding fact-checking modules and linking outputs to verified medical knowledge bases.
Bias and fairness concerns are also significant. LLMs trained on broad datasets may reproduce social, racial, and gender biases. The report references empirical studies showing disparity in diagnostic accuracy for racial minorities and gender-diverse individuals. Fairness-aware training, stratified evaluation metrics, and diverse dataset curation are proposed mitigation strategies.
Privacy and data security risks stem from the AI’s need to process sensitive health information. LLM use must comply with regulations like HIPAA and GDPR. Techniques such as federated learning and differential privacy are recommended to secure patient data. However, implementing these methods across heterogeneous healthcare systems remains a challenge, especially due to inconsistent data standards and resource disparities.
Additionally, the review highlights legal and ethical uncertainties. Questions of accountability, who is liable when an AI system causes harm, remain unresolved. Most legal frameworks are not equipped to handle non-deterministic, probabilistic AI outputs. Similarly, informed consent and explainability are ethical gray zones. Users may be unaware they are interacting with AI, and the lack of transparency in AI decision-making undermines trust and clinician oversight.
How should LLMs be regulated and improved moving forward?
The safe and ethical deployment of LLMs in healthcare will depend on a combination of technical innovation, policy development, and interdisciplinary collaboration, the study asserts.
On the technological front, the next generation of medical LLMs is likely to include multimodal models that integrate text with images, audio, or even video. Examples like LLaVA-Med and MedCLIP aim to interpret radiographs and pathology slides alongside clinical text. However, the study notes a performance gap between benchmark results and real-world conditions, especially in under-resourced settings.
Human-AI collaboration models are recommended over autonomous AI agents. LLMs should act as clinical copilots, generating drafts or suggestions while leaving final decisions to human professionals. Dynamic interfaces allowing for real-time edits and transparency in AI-generated content will be crucial for clinician trust and accountability.
Integration with healthcare infrastructure is another priority. Embedding chatbots within electronic health record (EHR) systems could enable automated note generation, alerting for inconsistencies, and streamlined follow-ups. In telemedicine and chronic disease management, LLMs can provide continuous patient support, medication reminders, and health education, thereby enhancing care between visits.
Privacy-preserving strategies such as federated learning, on-device processing, and encryption are identified as essential for secure deployment. These techniques ensure data does not leave local devices or institutions, reducing the risk of sensitive information leakage.
In addition, the study emphasizes the pressing need for policy and regulatory modernization to keep pace with generative AI in healthcare. Current legal frameworks fall short of addressing the complexities of large language models. To close this gap, the authors advocate for AI-specific legal reforms that define liability, set clear performance benchmarks, and establish formal channels for patient redress. As an initial step, they propose a tiered risk-based regulatory approach that differentiates between low- and high-stakes applications, supported by post-deployment monitoring to track performance and detect emergent risks.
- FIRST PUBLISHED IN:
- Devdiscourse

