Machines still don’t know what harm is, and that’s a growing AI risk
How machines can reliably recognize harm before it occurs? While AI models can optimize outcomes and follow predefined rules, translating human moral judgment into machine-operable logic has proven far more difficult. A new academic study reveals that this gap is not merely technical, but conceptual, and proposes a novel way to close it.
The study, titled Machine Understanding of Harms: Theory and Implementation, and published in the journal Knowledge, introduces a structured method for enabling autonomous systems to detect and reason about harm using the same kinds of concepts humans rely on in everyday moral judgment. The authors argue that current approaches to AI ethics fail because they are either too abstract to implement or too opaque to trust.
Why abstract AI ethics has failed to prevent harm
Many existing frameworks rely on broad principles such as beneficence, non-maleficence, autonomy, or fairness. While these concepts are essential for human moral reasoning, the authors argue that they offer little guidance for machines that must make rapid, context-sensitive decisions.
For example, an autonomous system tasked with assisting in a hospital setting may be programmed to avoid harm, but that instruction alone does not specify what harm looks like in practice. Does harm include physical injury only, or also emotional distress, privacy violations, and coercion? How should the system distinguish between accidental injury, justified intervention, and malicious action?
At the other end of the spectrum, purely data-driven approaches attempt to infer harmful behavior from statistical patterns. These systems may detect correlations between actions and negative outcomes, but they often lack transparency and struggle to generalize beyond the data they were trained on. More critically, they offer no clear explanation of why an action is harmful, making oversight and accountability difficult.
The authors argue that both approaches fail because they bypass the middle layer of moral understanding that humans routinely use. People do not typically reason about harm by invoking abstract principles or probability distributions. Instead, they rely on concrete concepts such as injure, deceive, imprison, exploit, or poison. These concepts encode rich information about how harm occurs, who is affected, and why the action matters morally.
The absence of this intermediate layer, the study contends, is a key reason why autonomous systems struggle to behave safely in open-ended environments. Without a structured way to represent harm at this level, machines are left either following rigid rules or making opaque statistical guesses.
Using moral language to make harm machine-readable
The study introduces a discovery method based on what the authors call thick harm verbs. These are ordinary action terms that simultaneously describe behavior and evaluate it morally. Examples include verbs associated with physical harm, psychological manipulation, coercion, and deception.
According to the authors, these verbs are powerful because they already contain the information a machine needs to recognize harm. Each verb encodes a causal mechanism, identifies relevant agents and objects, points to the human interests at stake, and signals the moral weight of the action. By analyzing these components, designers can extract concrete features that autonomous systems can monitor and reason about.
The study outlines four core dimensions that emerge from this analysis. The first is the mechanism of harm, which specifies how the action causes damage, such as through force, restriction, contamination, or misrepresentation. The second is the material and social prerequisites of harm, including the objects involved and the roles of the agents. The third dimension concerns the human interests affected, such as bodily integrity, freedom, privacy, or trust. The fourth is context sensitivity, which determines whether an action is harmful depending on factors like consent, authority, or emergency conditions.
Importantly, the authors show that not all harm verbs function in the same way. Some describe actions that are almost always harmful, while others are defeasible, meaning their moral status depends on context. For example, restraining someone may be harmful in ordinary circumstances but justified in a medical or safety emergency. By classifying harm verbs according to their contextual flexibility, the framework allows systems to avoid overly rigid prohibitions.
This structured approach enables a shift from abstract moral rules to concrete harm detection. Rather than asking whether an action violates a general principle, a system can ask whether it instantiates a known harm pattern under the current conditions. This makes ethical reasoning more granular, explainable, and adaptable.
From theory to implementation in autonomous systems
A key concern addressed is whether this framework can be implemented at scale. The authors note that the goal is not to automate moral judgment entirely, but to support human-guided system design with better tools. To that end, they propose using large language models as analytical assistants rather than autonomous decision-makers.
In the proposed workflow, language models help generate candidate harm-related features by analyzing moral language and suggesting how harm concepts might translate into system requirements. Human experts then review, refine, and validate these features before they are embedded into autonomous systems. This preserves human oversight while leveraging the pattern-recognition strengths of modern AI.
This hybrid approach avoids the pitfalls of both top-down ethical theory and bottom-up machine learning. It does not assume that moral principles can be directly coded, nor does it rely on systems learning harm implicitly from data without explanation. Instead, it treats harm understanding as a structured modeling problem grounded in shared human concepts.
When a system flags an action as potentially harmful, it can point to the specific harm pattern involved and the conditions that triggered it. This makes it easier for designers, regulators, and affected users to understand and evaluate system behavior.
In practical terms, the framework could be applied across a wide range of domains. In healthcare, it could help systems distinguish between beneficial interventions and violations of patient autonomy. In autonomous vehicles, it could support nuanced trade-offs between different kinds of risk. In digital platforms, it could improve detection of manipulation, coercion, or privacy invasion.
The authors stress that the framework is intentionally modular and extensible. As social norms evolve and new forms of harm emerge, additional harm concepts can be incorporated without redesigning the entire system. This adaptability is presented as essential for deploying autonomous systems in dynamic human environments.
- FIRST PUBLISHED IN:
- Devdiscourse

