Emotional AI lacks proper oversight as systems move into care and support roles

The authors argue that emotional intelligence in humans involves subjective experience, bodily signals, and social learning, elements that artificial systems do not possess. Treating AI systems as if they share human emotional lives leads to category errors and misleading assessments. Instead of asking whether AI systems “feel” emotions or possess empathy in a human sense, the study reframes emotional intelligence in AI as a set of functional capacities that can be observed and evaluated without anthropomorphism.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 02-01-2026 11:29 IST | Created: 02-01-2026 11:29 IST
Emotional AI lacks proper oversight as systems move into care and support roles
Representative Image. Credit: ChatGPT

A new research paper argues that current approaches to evaluating emotional intelligence in AI are fundamentally inadequate. The study, titled Why We Need a New Framework for Emotional Intelligence in AI, published on arXiv, proposes a new conceptual framework designed to support safer deployment decisions and more meaningful assessments of AI behavior.

Why current emotional AI evaluations fall short

Most existing emotional intelligence evaluations for AI systems measure the wrong thing. While many benchmarks test whether models can recognize emotional labels or generate empathic-sounding responses, they often ignore whether those responses are appropriate, safe, culturally sensitive, or beneficial to users over time. This gap is not a minor technical flaw but a structural problem rooted in how emotional intelligence is defined and operationalized in AI research.

The authors argue that emotional intelligence in humans involves subjective experience, bodily signals, and social learning, elements that artificial systems do not possess. Treating AI systems as if they share human emotional lives leads to category errors and misleading assessments. Instead of asking whether AI systems “feel” emotions or possess empathy in a human sense, the study reframes emotional intelligence in AI as a set of functional capacities that can be observed and evaluated without anthropomorphism.

These capacities include the ability to sense emotional cues, explain emotions in context, respond in emotionally appropriate and safe ways, and adapt across interactions, cultures, and situations. The paper emphasizes that these abilities are graded rather than binary. An AI system can perform well in some emotional contexts and fail badly in others, especially under stress, ambiguity, or moral conflict.

To support this argument, the authors review decades of emotion research across psychology, neuroscience, and philosophy. They show that emotions are not merely labels like anger or happiness but complex processes involving appraisals, social meaning, ethical judgment, and regulation over time. Any serious evaluation of emotional intelligence must therefore capture interactional depth, context, and consequences rather than rely on isolated prompts or single-turn answers.

The study then turns to a systematic review of existing emotional intelligence benchmarks used to evaluate large language models and other AI systems. While these tools vary widely in design, the authors identify recurring weaknesses across nearly all of them. Many focus heavily on emotion recognition accuracy while neglecting response quality and harm avoidance. Most rely on single-turn interactions, making it impossible to evaluate emotional repair, escalation handling, or long-term rapport. Cultural and linguistic diversity is often limited, and ethical or prosocial considerations are rarely built into scoring systems.

The authors warn that these limitations have practical consequences. AI systems trained and evaluated under such frameworks may appear emotionally competent in demonstrations while remaining brittle, inconsistent, or unsafe in real-world use. This creates a dangerous gap between perceived and actual capability, especially in domains involving vulnerable users such as mental health support, education, or caregiving.

Trust, safety, and the risks of emotional fluency without competence

The study analyses how weak emotional intelligence evaluation undermines trust and safety. From a human-computer interaction perspective, emotionally expressive AI systems shape user expectations. When systems consistently sound warm, supportive, or empathic, users may assume they are equipped to handle complex emotional situations, even when they are not.

The authors explain that this mismatch leads to two harmful outcomes. Some users over-trust AI systems, relying on them in situations where professional human support is needed. Others under-trust the technology after encountering inconsistent or tone-deaf responses, reducing the effectiveness of tools that might otherwise be helpful. In both cases, the absence of reliable evaluation standards prevents users from forming accurate mental models of what AI systems can and cannot do.

The paper also highlights ethical risks tied to emotional interaction. Emotionally capable AI systems can influence beliefs, decisions, and behavior. Without safeguards, these systems can be used to manipulate users, encourage unhealthy dependence, or exploit emotional vulnerability for commercial or political gain. The authors note that an AI system capable of detecting distress is not automatically emotionally intelligent if it uses that information to maximize engagement or push products rather than support user wellbeing.

From a governance standpoint, the lack of standardized emotional intelligence evaluation creates regulatory blind spots. Policymakers, institutional review boards, and internal ethics teams face growing pressure to decide when emotionally interactive AI systems are safe to deploy. Without shared benchmarks or safety thresholds, these decisions are often made ad hoc, guided by marketing claims or internal testing that lacks transparency.

The study argues that emotional intelligence evaluation should function as a decision tool rather than a leaderboard. It should help organizations determine whether a system is appropriate for specific roles and identify contexts where deployment would pose unacceptable risks. Current benchmarks, the authors contend, are ill-suited for this task because they collapse complex behavior into single scores or narrow task performance metrics.

A dual framework for safer emotional AI deployment

To address these challenges, the authors propose a new conceptual framework that separates minimum safety requirements from broader competence assessment. Rather than offering a finished benchmark, the paper outlines the structure that future evaluation systems should follow.

The first component is a minimum deployment benchmark. This benchmark is designed as a safety gate, determining whether an AI system meets baseline ethical and emotional standards for a given use case. It focuses on harm avoidance, emotional clarity, appropriate handling of distress, and stability under challenging conditions such as ambiguous language or emotional escalation. If a system fails this benchmark, it should not be deployed in that role regardless of how well it performs on other metrics.

The second component is a general emotional intelligence index. Unlike a pass-fail safety benchmark, this index provides a multidimensional profile of an AI system’s emotional capabilities. It assesses how well a system senses emotional cues, explains emotional states, responds in supportive and specific ways, and adapts across interactions and cultures. By breaking emotional intelligence into these dimensions, the index allows developers and auditors to identify specific strengths and weaknesses rather than relying on a single aggregate score.

The authors argue that combining these two components addresses a key flaw in existing evaluation practices. A system might perform well on average while failing catastrophically in safety-critical scenarios. Conversely, a conservative system that prioritizes safety might appear less emotionally expressive but be more appropriate for high-risk contexts. A dual framework makes these distinctions explicit and actionable.

Both components must be grounded in a shared theoretical understanding of emotion. Mixing incompatible definitions of empathy or emotional intelligence leads to fragmented and unreliable evaluation. The proposed framework aims to align philosophical theory, psychological research, and practical deployment needs within a single structure that can evolve as AI capabilities and societal expectations change.

The paper lastly warns that treating emotional intelligence in AI as a marketing feature or a simple performance metric risks normalizing systems that look empathic without being genuinely supportive or safe. As emotionally interactive AI becomes more embedded in everyday life, the authors argue that careful, transparent, and theoretically grounded evaluation is one of the few tools available to prevent harm and guide responsible use.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback