Large language models move closer to clinical use in nutrition assessment
Dietary assessment has long been a bottleneck in nutrition research and public health. Common tools such as food frequency questionnaires, 24-hour recalls, and weighed food records rely heavily on self-reporting, creating well-documented problems with recall error, underreporting, and participant burden. These limitations become especially acute in large populations, institutional settings, and vulnerable groups where dietary intake is complex and poorly standardized.
New research shows that large language models (LLMs) can classify real-world food data with accuracy approaching that of trained human experts. The findings come amid growing concern that traditional dietary assessment methods are too slow, labor-intensive, and prone to bias to support modern public health surveillance and clinical nutrition at scale.
The study, titled Large Language Models for Real-World Nutrition Assessment: Structured Prompts, Multi-Model Validation and Expert Oversight, and published in the journal Nutrients, rigorously evaluates whether advanced language models can reliably assess dietary quality using real, non-ideal food data drawn from everyday settings rather than curated datasets.
Based on nearly 2,000 food items collected from residents of long-term care facilities in Poland, the authors conclude that modern AI systems can deliver high-quality nutritional classification when carefully designed prompts and expert oversight are in place. At the same time, the study highlights persistent gaps between automated outputs and human judgment, underscoring why AI remains a support tool rather than a replacement for professional dietitians.
Testing AI against real dietary complexity
Dietary assessment has long been a bottleneck in nutrition research and public health. Common tools such as food frequency questionnaires, 24-hour recalls, and weighed food records rely heavily on self-reporting, creating well-documented problems with recall error, underreporting, and participant burden. These limitations become especially acute in large populations, institutional settings, and vulnerable groups where dietary intake is complex and poorly standardized.
The new study addresses this challenge by testing LLMs under realistic conditions. Instead of relying on idealized food logs, the researchers analyzed food items stored by residents in long-term care facilities. These items included packaged snacks, homemade foods, and products brought in by family members, reflecting the messy, unstructured nature of real dietary environments.
Three advanced language models were evaluated: Claude Opus 4.5, Gemini 3 Pro, and GPT-5.1-chat-latest. Each model was asked to classify every food item as healthy or unhealthy, and their results were compared against a gold-standard reference created by two independent human experts. This design allowed the researchers to measure not only accuracy but also disagreement patterns, bias, and sensitivity to classification rules.
The study tested two distinct prompting strategies. The first was a structured, rule-based approach that combined the NOVA food processing classification with nutrient thresholds set by the World Health Organization. Under this method, ultra-processed foods were automatically classified as unhealthy, while remaining foods were evaluated against limits for sugars, saturated fats, and sodium. The second approach used a simplified prompt that asked models to make a holistic judgment without explicitly enforcing these frameworks.
The results show that all three models performed strongly across both approaches, achieving agreement with human experts between roughly 90 and 94 percent. This level of performance places large language models within striking distance of expert-level classification, particularly given the complexity and ambiguity of the dataset.
Prompt design shapes accuracy and bias
According to the study, how an AI system is instructed matters as much as which model is used. The structured, two-step prompt produced highly conservative classifications. Under this approach, models were exceptionally good at identifying unhealthy foods, minimizing the risk of falsely labeling poor-quality products as healthy. However, this safety-oriented bias came at a cost, as borderline items were often pushed into the unhealthy category even when human experts judged them more leniently.
Conversely, the simplified prompt yielded higher overall accuracy and closer alignment with expert intuition. Models classified more foods as healthy and demonstrated a better balance between sensitivity and specificity. This suggests that when AI systems are allowed to rely on broader contextual reasoning rather than rigid rules, their outputs more closely resemble holistic human judgment.
The trade-off revealed by this comparison has direct implications for real-world deployment. In clinical and public health settings, the consequences of misclassification are asymmetric. Labeling an unhealthy food as healthy can have greater health consequences than the reverse. As a result, conservative bias may be preferable in high-risk contexts such as disease prevention or institutional nutrition management, even if it slightly reduces agreement with expert opinion.
The study also highlights meaningful differences between the models themselves. While all three achieved high accuracy, their classification patterns varied significantly, even under identical prompts. This variability reflects differences in model architecture, training data, and internal reasoning processes. Importantly, the researchers found that combining outputs from multiple models into a consensus result helped stabilize performance and reduce individual biases.
This multi-model approach mirrors practices in other high-stakes AI domains, such as medical imaging and financial risk modeling, where ensemble systems are used to improve robustness. The findings suggest that nutrition assessment could benefit from similar strategies, particularly when decisions affect large populations.
Language, expertise, and the limits of automation
All food descriptions were processed in Polish, a morphologically rich language with complex grammatical structure. Rather than translating the data into English, the researchers deliberately preserved the original language to test how well language models handle culturally and linguistically specific food information.
The results indicate that language is not a neutral carrier of data. Polish food descriptions often encode preparation methods, ingredient relationships, and product categories in ways that may be lost or distorted through translation. The strong performance observed in this study suggests that language models can leverage these linguistic features to improve classification accuracy, but it also raises concerns about deploying models trained primarily on English data across diverse linguistic contexts without proper validation.
The authors warn that translating dietary data into another language before analysis could introduce semantic loss and bias, particularly for culturally specific foods. This finding has broad implications for global nutrition surveillance, where standardized tools are often applied across regions with vastly different food systems and languages.
Despite the high accuracy achieved by AI systems, the study makes clear that human expertise remains essential. Statistically significant differences persisted between model outputs and expert classifications across all prompts and models. Most disagreements occurred in borderline cases, such as minimally processed foods with moderate nutrient excesses or items with incomplete ingredient information.
Rather than viewing these discrepancies as failures, the authors frame them as evidence of where human judgment adds value. In practice, the study suggests a hybrid workflow in which AI performs the initial classification at scale, while experts focus on reviewing and correcting the smaller subset of ambiguous cases. This approach dramatically reduces workload without sacrificing quality.
The efficiency gains are substantial. Manually classifying thousands of food items requires extensive training, sustained concentration, and significant time investment. By contrast, reviewing AI-generated classifications allows experts to concentrate on complex decisions rather than routine categorization. For large epidemiological studies, institutional audits, or national nutrition monitoring programs, this shift could make previously infeasible analyses achievable.
The researchers also stress the importance of governance and oversight. AI systems should not operate as black boxes, particularly in healthcare-related domains. Transparent prompting strategies, documented uncertainty, and clear thresholds for expert intervention are necessary to ensure trust and accountability.
- READ MORE ON:
- AI nutrition assessment
- large language models nutrition
- automated dietary classification
- artificial intelligence in nutrition
- digital nutrition tools
- dietary assessment AI
- ultra-processed foods AI
- public health nutrition technology
- clinical nutrition AI
- nutrition data analysis SEO-optimized news-style headlin
- FIRST PUBLISHED IN:
- Devdiscourse

