AI in public services works best as support tool, not a decision-maker


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 27-01-2026 12:25 IST | Created: 27-01-2026 12:25 IST
AI in public services works best as support tool, not a decision-maker
Representative Image. Credit: ChatGPT

Rising caseloads, staff shortages, and growing public scrutiny have prompted governments to look toward artificial intelligence (AI) as a way to improve efficiency and consistency in decision-making. However, new research suggests that when these systems are applied to high-stakes public services, their limitations can be as consequential as their benefits.

A new study titled “The Promises and Perils of Using LLMs for Effective Public Services,” published in the Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems, analyses how large language models perform in real-world public sector work. Focusing on child welfare services, the research examines whether LLM-based tools can meaningfully support caseworkers without undermining professional judgment, accountability, and ethical decision-making.

Why child welfare exposes the limits of automation

Child welfare represents one of the most complex and sensitive domains in public administration. Caseworkers must balance competing goals: protecting children from harm while avoiding unnecessary or prolonged state intervention in family life. Decisions are shaped by evolving family circumstances, regulatory requirements, and professional norms that emphasize discretion and contextual reasoning.

The study situates LLM deployment within this reality rather than abstract benchmarks. Working in collaboration with a large Canadian child welfare agency, the researchers analyzed how language models could assist with interpreting narrative case documentation. These records are dense, unstructured, and central to how progress is assessed over time. They include service plans, goals, activities, and ongoing case notes that document family interactions, interventions, and outcomes.

The researchers tested whether LLM-based tools could reliably identify which case notes were relevant to child welfare goals and whether they could trace progress across the lifespan of a case. The goal was not to replace caseworkers, but to explore whether AI could function as a support system that helps professionals manage cognitive load and navigate large volumes of text.

The findings reveal a sharp boundary between technical pattern recognition and professional judgment. While the models were able to surface thematic trajectories and identify broad narrative patterns, they struggled to determine relevance in cases where judgment depended on nuance, timing, and contextual interpretation. As cases became longer and more complex, model performance declined further. This was not simply a technical failure, but a reflection of how child welfare decisions are inherently shaped by uncertainty and discretion.

The study shows that relevance in child welfare is not a fixed or purely textual property. What counts as progress, concern, or closure often depends on how circumstances change over time and how practitioners interpret those changes within regulatory and ethical frameworks. LLMs, which operate by detecting statistical regularities in language, are poorly suited to resolve these ambiguities on their own.

When AI supports sensemaking rather than decision authority

Despite the limitations, the study does not dismiss the value of large language models in public services. Instead, it draws a clear distinction between AI as a decision-maker and AI as a decision-support tool. The research finds that LLMs can add value when used to assist sensemaking rather than adjudication.

In practice, this means helping caseworkers surface themes, compare trajectories across cases, and identify areas that may warrant closer human review. By organizing large volumes of text and highlighting patterns that might otherwise be missed, AI tools can support reflection and collaboration among professionals. This function becomes especially relevant in environments where staff are overburdened and time-constrained.

However, the study cautions against allowing these tools to define progress or relevance autonomously. In child welfare, such judgments are shaped by professional training, lived experience, and ethical responsibility. Delegating these decisions to AI systems risks oversimplifying complex human situations and eroding accountability.

The researchers also highlight how organizational mandates and policy frameworks shape documentation practices. Case notes are not neutral records; they are written within institutional constraints and expectations. LLMs trained on these texts may inadvertently reproduce those constraints rather than challenge them. This reinforces the risk of embedding existing biases or procedural norms into automated systems without scrutiny.

Importantly, the study notes that even human reviewers often disagree when assessing relevance in complex cases. This underscores a key point: disagreement and deliberation are not failures in public service work, but essential features of responsible decision-making. Any system that attempts to eliminate this uncertainty through automation risks distorting the nature of the work itself.

Implications for AI governance in public services

The study challenges the assumption that better models alone will solve institutional problems rooted in uncertainty, discretion, and ethical judgment. The authors argue that high-stakes public services should not be treated as optimization problems where accuracy metrics alone determine success. Instead, these domains require tools that respect human discretion and support collaborative decision-making. This has direct implications for AI governance, procurement, and evaluation frameworks.

One key lesson from the study is the importance of participatory design. The research benefited from close collaboration with practitioners who understand the realities of frontline work. This approach allowed the researchers to identify not just technical limitations, but mismatches between what AI systems are designed to do and what public service work actually requires.

The findings also highlight the risk of over-automation. When AI systems are positioned as neutral or objective authorities, they can shift responsibility away from human decision-makers while still shaping outcomes. In child welfare, where decisions can have life-altering consequences, this shift is especially problematic. The study reinforces the need for clear boundaries around where automation is appropriate and where human judgment must remain central.

The research suggests that AI tools in public services should be evaluated not only for performance, but for how they interact with professional practice, accountability structures, and public trust. Transparency, explainability, and oversight are necessary but not sufficient. Systems must also be designed to accommodate uncertainty rather than suppress it.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback