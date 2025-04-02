ChatGPT, while promising in its factual accuracy, falls short in completeness and may pose a risk to patient safety and medication adherence, according to a new study that evaluates the generative AI tool's ability to provide accurate and safe patient medication instructions. The findings of the study “Evaluating the accuracy of ChatGPT in delivering patient instructions for medications: an exploratory case study” are published in Frontiers in Artificial Intelligence.

The research, conducted by a team at King Saud University, assessed ChatGPT's responses against the standardized CareNotes® drug information system for three commonly prescribed medications: tirzepatide, citalopram, and apixaban. The assessment involved 37 evaluators, including four pharmacy experts and 33 PharmD interns, who independently rated the AI-generated reports using a detailed questionnaire addressing correctness, completeness, potential harm, and adherence risks.

While ChatGPT consistently produced responses that were factually correct in many cases, experts unanimously agreed that the responses were frequently incomplete and sometimes omitted critical safety information. In certain cases, these omissions could lead to patient harm or result in reduced adherence to prescribed therapies.

The study used version 3.5 of ChatGPT, accessed publicly, and involved generating five different responses for each drug by restarting the chat session to ensure variability. These 15 reports were then compared against CareNotes® content using a validated questionnaire derived from prior research.

Among the key findings, the correctness of ChatGPT’s responses varied by drug and by version. For tirzepatide and apixaban, three out of four experts deemed most versions correct, though concerns were raised over dosing frequency and incomplete guidance on hypoglycemia and missed doses. For citalopram, expert consensus was less favorable. Multiple experts noted incorrect or absent instructions, particularly related to food interactions and time of administration.

Completeness emerged as the most consistent shortfall. All experts found the reports for tirzepatide and apixaban to be missing vital information such as contraindications, storage guidance, pregnancy and breastfeeding warnings, drug interactions, and serious side effects. The lack of warnings about thyroid cancer, pancreatitis, or serotonin syndrome - depending on the drug - were flagged as significant omissions. For citalopram, only one version was found to be nearly complete by one expert, with the rest lacking key patient guidance.

The potential for harm was evident in many responses. For tirzepatide, all but one expert warned that the lack of safety and administration details could lead to misuse. For apixaban, divergent expert opinions emerged, with some reports missing crucial data on drug discontinuation that could lead to stroke or bleeding risks. Citalopram responses, while occasionally more detailed, often missed red-flag information that could lead to inappropriate self-adjustment of dosage or underestimation of side effects.

A similar pattern was observed regarding the potential for poor adherence. Experts noted that inadequate explanations of drug onset timelines or omission of adverse effects might cause patients to stop treatment prematurely. One expert highlighted ChatGPT’s failure to mention that citalopram may take several weeks to produce therapeutic effects - a critical point that, if not conveyed, might cause a patient to discontinue use.

PharmD interns, who evaluated fewer versions per drug, generally rated the responses more favorably, especially in terms of correctness. However, their assessments were also marked by limited recognition of clinical risks, a discrepancy the authors attribute to the interns’ relative lack of clinical experience.

Intern evaluations revealed several gaps similar to those found by experts, though at lower rates. Only 57% of interns found one version of tirzepatide complete, while the rest received lower ratings. Citalopram’s highest completeness rating from interns reached just over 70%, and apixaban versions hovered around similar figures. Interns also identified missing brand names, contraindications, side effects, and instructions for special populations like pregnant women and elderly patients.

The study’s comprehensive structure - drawing on both seasoned clinicians and student reviewers - underscores the importance of expert oversight in AI-assisted healthcare delivery. Notably, both groups acknowledged variability in ChatGPT’s output between different prompts, further complicating the technology’s use as a consistent tool for patient instruction.

Authors Norah Othman Abanmy, Nadia Al-Ghreimil, Jawza F. Alsabhan, Heyam Al-Baity, and Rana Aljadeed concluded that ChatGPT cannot yet be considered a reliable standalone source for medication instructions. The study stresses the need for continued refinement of generative AI tools before they can be safely deployed in patient-facing contexts.

They also note that ChatGPT, while user-friendly and widely accessible, lacks the domain-specific training required for the delivery of nuanced, safe, and regulatory-compliant healthcare information. The authors urge developers and healthcare regulators to consider integrating AI tools with expert-reviewed databases and to develop clear standards for safety, validation, and deployment.

The limitations of the study include its focus on only three medications and the use of simulated prompts rather than organically generated patient queries. Still, the results align with similar studies in the field, including a 2024 investigation that found only 26% of ChatGPT’s drug-related responses to be correct or low-risk.

AI’s rise in healthcare is unstoppable, but this study sounds the alarm: incomplete or wrong advice could harm patients. While generative tools like ChatGPT can aid clinicians and educate patients, they’re no match for licensed healthcare professional - yet. The authors push for real-time testing, readability checks, and user comprehension to ensure safety. Until AI nails complete, accurate, and trustworthy drug info, it’s a helper, not a replacement - human expertise must stay in charge.