GPT-4o predicts drug overdose risk with clinical accuracy using insurance claims data
The results underscore GPT-4o's capacity to interpret structured clinical data repurposed into a natural language or JSON-like format. When input data included up to 30 past medical visits per patient, the model could effectively reason over sequences of health events. While performance dipped slightly when too many visits increased prompt length beyond 6,500 tokens, 30 visits provided an optimal balance between context and cost.

Large language models (LLMs) like GPT-4o can effectively predict drug overdose risk using patients’ longitudinal insurance claims, outperforming conventional machine learning techniques under certain conditions, according to a new research submitted on arXiv. The study titled “Large Language Models for Drug Overdose Prediction from Longitudinal Medical Records” offers a promising step toward real-time clinical decision support systems tailored for overdose prevention.
Led by researchers at the University of Kentucky, the study compared the performance of GPT-4o in both fine-tuned and zero-shot inference modes against baseline models such as Random Forest and XGBoost. Using three years (2020–2022) of patient data from the Merative MarketScan Research Databases, which contain de-identified longitudinal insurance claims, the research frames overdose prediction as a sequence modeling task - assessing the likelihood of an overdose within a 7-day or 30-day time window based on prior medical events.
How effectively can LLMs detect overdose risk without task-specific training?
A critical contribution of the study is its evaluation of LLM performance in a zero-shot setting - a configuration in which the model has not been trained with any task-specific examples. When fed detailed English descriptions of past diagnoses, procedures, and prescriptions, GPT-4o achieved an F1-score of 54.32 for predicting overdoses within seven days and 57.24 within thirty days. Even when presented only with original medical codes, which are more compact and computationally efficient, the model still produced F1-scores above 52 in both windows.
The results underscore GPT-4o's capacity to interpret structured clinical data repurposed into a natural language or JSON-like format. When input data included up to 30 past medical visits per patient, the model could effectively reason over sequences of health events. While performance dipped slightly when too many visits increased prompt length beyond 6,500 tokens, 30 visits provided an optimal balance between context and cost.
Importantly, even with no fine-tuning, GPT-4o outperformed traditional models in certain contexts, especially in identifying true negatives. For example, specificity scores - the ability to correctly identify patients not at risk of overdose - reached as high as 83% in some zero-shot scenarios, offering valuable reliability in avoiding false alarms in clinical workflows.
Can fine-tuning elevate LLMs to clinical-grade performance?
The researchers fine-tuned GPT-4o on structured representations of 900 patient records (300 positive and 600 control cases) for each prediction window. The fine-tuned model outperformed the baseline XGBoost, especially in identifying true positives. For example, for the 7-day window, GPT-4o achieved an F1-score of 84.53 using summarized medical statistics, compared to XGBoost’s 78.92. It also showed a marked improvement in recall - detecting 82% of at-risk patients compared to 73% with XGBoost.
This gain was consistent across both timeframes and prompt types. Notably, the model trained on descriptive summaries (rather than full sequential data) delivered the highest performance when fine-tuned, suggesting that even simplified aggregated representations can unlock clinical value if paired with proper model adaptation.
The model also handled nuanced patient cohorts effectively. A unique challenge in overdose prediction is distinguishing between patients who use opioids or stimulants without overdosing and those who do. When tested on this “exposed control” cohort, zero-shot performance dropped due to false positives. However, once fine-tuned, GPT-4o’s prediction accuracy for this group jumped to 95.67%, demonstrating its ability to unlearn bias associating exposure with inevitability.
What are the implications for real-world deployment and model design?
Beyond clinical accuracy, the study investigated practical factors including cost efficiency and data representation. Using natural language descriptions of each medical event raised token counts and OpenAI API usage costs - about $0.0137 per prediction for a 7-day forecast. Switching to compact medical codes reduced that by 25%, and using statistical summaries cut costs by 70%.
Yet, even with concise inputs, performance remained competitive. This trade-off points to promising deployment strategies, such as using code-based prompts for bulk triaging and detailed prompts for flagged cases. Moreover, LLMs showed they could effectively interpret medical codes directly - achieving comparable accuracy to models trained on verbose data - a critical insight for integration with electronic health records (EHRs), where codes dominate.
The researchers also dissected input feature contributions. Diagnoses alone enabled the highest standalone predictive accuracy, outperforming procedures and prescriptions. Still, the best results emerged from including all three components. Diagnosis records for conditions such as hypertension and anxiety were disproportionately present in overdose cases, validating the model’s sensitivity to comorbid risk signals.
Another finding centered on data sequence structure. While LLMs initially underperformed when only given statistical aggregates in zero-shot mode, fine-tuning on those same aggregates flipped the result - the model exceeded baseline performance. This adaptability indicates that LLMs are not limited to language-based narratives; they can learn to emulate statistical reasoning when exposed to structured training.
Toward proactive overdose prevention with LLM integration
The study concludes with a vision for embedding LLMs into proactive healthcare workflows. Although insurance claims data are not updated in real time, researchers suggest augmenting this with local clinic data to close the recency gap. Furthermore, the results support the feasibility of implementing LLM-based risk models within hospital decision support systems, either as first-pass screeners or as explainable adjuncts to human judgment.
While the research is limited by sample size and data type, with no laboratory records included, it offers a foundation for extending overdose prediction capabilities through LLMs trained on longitudinal, multi-modal clinical datasets.
- FIRST PUBLISHED IN:
- Devdiscourse