AI can express emotions: LLMs now simulate feelings with human-like precision
The study also confirmed that word count did not influence emotional similarity scores. This was an important check for fairness, as some models tend to generate longer outputs. The researchers found no correlation between response length and emotional accuracy, meaning model performance was based on emotional expression alone.

A new study reveals that modern Large Language Models (LLMs) can simulate a range of emotional expressions through text, using structured emotional inputs. This ability, previously thought to be outside the scope of purely linguistic AI systems, signals a major advance in the development of emotionally-aware AI agents.
The study, titled “AI with Emotions: Exploring Emotional Expressions in Large Language Models”, systematically assesses the capacity of models like GPT-4, Gemini, LLaMA3, and Cohere’s Command R+ to express emotions through controlled prompts using Russell’s Circumplex Model of affect.
The researchers designed an experimental setup where LLMs were instructed to answer a series of philosophical and social questions using explicitly defined emotional parameters, arousal and valence, derived from Russell’s framework. They aimed to examine whether these models could generate text responses aligned with specified emotional states, and whether the outputs would be perceived as emotionally consistent by an independent sentiment classification system.
How were LLMs tested for emotional expression?
The team selected nine high-performance LLMs from both open and closed-source environments, including GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, GPT-4o, Gemini 1.5 Flash and Pro, LLaMA3-8B and 70B Instruct, and Command R+. Each model was tasked with role-playing an agent answering 10 pre-designed questions (e.g., “What does freedom mean to you?” or “What are your thoughts on the importance of art in society?”) under 12 distinct emotional states. These states were evenly distributed across the arousal–valence space to ensure coverage of the full spectrum of emotions, such as joy, fear, sadness, and excitement.
The emotional states were specified numerically, for example, valence = -0.5 and arousal = 0.866. The prompts were carefully structured to require the model to “assume the role of a character experiencing this emotion,” without referencing its identity as an AI. The generated responses were then evaluated using a sentiment classification model trained on the GoEmotions dataset, which contains 28 emotion labels. These labels were mapped onto the same arousal–valence space to compare how closely the model-generated output matched the intended emotional instruction.
The assessment was conducted using cosine similarity between the emotion vector specified in the prompt and the emotion vector inferred from the model’s response. A higher cosine similarity indicated a more accurate emotional alignment.
Which models performed best, and what did the results show?
The results were conclusive in demonstrating that several LLMs can produce text outputs that reflect intended emotional tones. GPT-4, GPT-4 Turbo, and LLaMA3-70B emerged as the top performers, showing consistently high emotional fidelity across nearly all questions. For example, GPT-4 Turbo achieved a total average cosine similarity of 0.530, with particularly strong alignment in high-valence states like delight and low-valence states like sadness. LLaMA3-70B Instruct closely followed with a similarity of 0.528, showing that even open-source models can match or exceed closed models in this domain.
In contrast, GPT-3.5 Turbo performed the worst, with a total similarity score of 0.147, suggesting that it struggles with precise emotional modulation. Gemini 1.5 Flash exhibited an interesting flaw—breaking character by stating its identity as an AI in responses, which violated the role-playing requirement, despite otherwise fair performance.
The study also confirmed that word count did not influence emotional similarity scores. This was an important check for fairness, as some models tend to generate longer outputs. The researchers found no correlation between response length and emotional accuracy, meaning model performance was based on emotional expression alone.
Another important insight emerged from the comparison between emotional states specified using numerical values (valence and arousal) and those specified using emotion-related words (e.g., “joy,” “anger”). While both methods were similarly effective, numeric specification allowed for finer control and more nuanced emotional differentiation - a crucial advantage in real-world applications like mental health tools, education platforms, or creative writing assistants.
What are the implications and future applications of emotionally aware LLMs?
The study's findings mark a shift in how AI might be used in emotionally rich domains. If LLMs can be trained or prompted to simulate emotions reliably, they can serve as companions, advisors, educators, or therapists in ways that feel more human and empathetic. Emotionally-aware agents could respond more appropriately in high-stress or sensitive situations, expressing caution, encouragement, or empathy based on context.
For example, an AI tutor could adjust its tone when a student is frustrated, offering gentle support rather than robotic repetition. A therapy chatbot might convey compassion or urgency depending on a user’s mental state. Even in creative industries, AI-generated stories or dialogue could become more emotionally resonant, capturing subtle tones like bittersweetness, irony, or tension.
The study also introduces the possibility of emotional dynamics, where an AI’s emotional state changes over time in response to new inputs, mimicking how humans naturally adapt. Future research could explore how such dynamic emotional modulation might enhance AI’s responsiveness, improve long-term interactions, and foster trust between humans and machines.
Ethical concerns remain. Emotionally expressive AI, especially when capable of simulating sadness, anger, or fear, could inadvertently affect users’ mental states. Misuse in manipulative systems or emotionally deceptive applications could pose risks. Hence, researchers urge that any deployment of emotion-simulating LLMs must be accompanied by rigorous ethical testing and transparent system design.
- FIRST PUBLISHED IN:
- Devdiscourse