AI models overestimate robotization potential in agricultural jobs
Moreover, error analysis revealed no consistent accuracy leader among the three models, although ChatGPT delivered slightly lower error margins across the board. Copilot demonstrated strengths in decomposing technical processes, while Gemini’s broader context handling allowed for more nuanced assessments in a few cases. However, none of the models could reliably emulate human expertise when evaluating task intricacies affected by environmental, regulatory, or logistical challenges - hallmarks of real-world farming.

A new benchmarking study published in AgriEngineering warns that large language models significantly overestimate the risk of automation in agricultural occupations, raising concerns about the reliability of AI in evaluating workforce vulnerabilities. The peer-reviewed article, titled "Benchmarking Large Language Models in Evaluating Workforce Risk of Robotization: Insights from Agriculture", provides the first systematic comparison of general-purpose language models - ChatGPT, Copilot, and Gemini - against expert-validated assessments of automation susceptibility in 15 agricultural occupations.
The key objective of the study is to determine whether LLMs can reliably assess the risk of robotization and how their predictions align with domain-expert evaluations. The research seeks to answer three core questions - can LLMs accurately identify the importance of agricultural tasks; do they provide realistic estimates of robotization potential; and how well do their assessments reflect the nuanced cognitive and manual demands of farm-related jobs?
Using a robust three-step evaluation framework, each LLM was tasked with rating the importance of specific job tasks, evaluating the feasibility of robotization, and classifying task types across cognitive/manual and routine/non-routine dimensions. The methodology mirrored that of an earlier expert-based study, which served as the gold standard or “ground truth” for comparative analysis.
Despite being prompt-engineered with job-specific data from the U.S. Department of Labor’s O*NET platform, the LLMs exhibited a pronounced tendency to overestimate robotization potential. The average error across models was calculated at 0.229 ± 0.174. In several cases, the models elevated low-susceptibility occupations such as agricultural managers, animal scientists, and agricultural educators into moderate or high-risk zones for automation. The research attributed this overestimation primarily to the models’ dependence on grey literature, optimistic technology narratives, and a failure to grasp the complexities of real-world agricultural environments.
In total, LLMs classified a higher proportion of occupations in the “yellow” and “red” zones of robotization risk, while expert assessments placed the majority in the “green” zone. ChatGPT slightly outperformed the others, with a higher alignment rate with human judgment, while Gemini showed the strongest correlation with expert rankings overall. Still, the inter-rater agreement among the three models reached only 53%, underscoring inconsistencies even among the AI systems themselves.
The study further examined how well the models understood task composition. While human experts accurately distinguished between cognitive and manual tasks and their routine nature, the LLMs consistently misclassified these aspects. In particular, LLMs failed to recognize the manual and non-routine complexity of occupations such as farm equipment mechanics and service technicians, often placing them in cognitively routine categories. The analysis concluded that the models leaned heavily toward classifying jobs as cognitive and routine, revealing a structural bias in how LLMs conceptualize labor.
Moreover, error analysis revealed no consistent accuracy leader among the three models, although ChatGPT delivered slightly lower error margins across the board. Copilot demonstrated strengths in decomposing technical processes, while Gemini’s broader context handling allowed for more nuanced assessments in a few cases. However, none of the models could reliably emulate human expertise when evaluating task intricacies affected by environmental, regulatory, or logistical challenges - hallmarks of real-world farming.
The researchers also investigated the sources of bias embedded in LLM assessments. They found that models frequently draw from promotional materials, speculative industry reports, and academic literature lacking practical validation. This data skew fosters an optimism bias, inflating assessments of what robotics and AI can realistically achieve in agriculture. The models rarely account for infrastructural limitations, regulatory constraints, safety concerns, or socio-economic impacts, all of which significantly moderate the feasibility of automation in agricultural domains.
The tendency of LLMs to reflect the Dunning–Kruger effect was also highlighted. By overconfidently estimating robotization potential in areas outside their “lived” training experience, LLMs effectively mimic the behavior of non-experts overestimating their competence. In this case, AI systems underestimated the labor-intensive, highly variable, and climate-sensitive nature of agricultural tasks, often making sweeping generalizations based on idealized technology use cases.
The study’s authors recommend that future LLM development for labor market assessments include sector-specific training, expert-informed datasets, and integration with ground-level feedback loops. They argue for collaborative frameworks that pair the scale and speed of AI with the qualitative depth of human expertise. Rather than using LLMs as standalone tools for evaluating workforce displacement risk, they should be deployed in hybrid models where human evaluators vet, calibrate, and contextualize AI-generated predictions.
Beyond improving LLMs’ domain fidelity, the study calls for transparent AI workflows in workforce impact evaluations. This includes the development of explainable AI techniques, standardized attribution methodologies, and active bias mitigation strategies in training datasets. Researchers also advocate for the use of occupational mapping tools that more accurately reflect the cognitive, manual, routine, and non-routine spectra of real-world labor, particularly in under-automated sectors like agriculture.
In a nutshell, the study shows that without the grounding influence of human judgment, AI assessments of robotization risk can misrepresent threats and overlook opportunities in complex sectors such as agriculture.
- FIRST PUBLISHED IN:
- Devdiscourse