AI-powered multi-agent simulation sets new standard for insider threat detection

Insider threats, ranging from disgruntled employees to compromised accounts, remain one of the most difficult challenges for security teams. While machine learning has shown promise in detecting such threats, progress has been stymied by the lack of authentic, large-scale training data. Real enterprise logs are often inaccessible due to confidentiality, and synthetic datasets typically lack the depth and behavioral complexity of real-world scenarios.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 14-08-2025 23:36 IST | Created: 14-08-2025 23:36 IST
AI-powered multi-agent simulation sets new standard for insider threat detection
Representative Image. Credit: ChatGPT

Researchers have introduced an innovative multi-agent framework that generates highly realistic datasets for detecting insider threats. The work addresses one of the most persistent problems in the field: the scarcity of large-scale, high-fidelity insider activity logs due to privacy concerns and limited public datasets.

The study titled "CHIMERA: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation" presents the first large language model (LLM)-powered, multi-agent simulation system designed to recreate both normal and malicious behavior within organizational environments. The result is ChimeraLog, a dataset that the authors argue is more realistic, diverse, and challenging than existing benchmarks, with the potential to transform how insider threat detection systems are developed and evaluated.

A scalable solution to the insider threat data bottleneck

Insider threats, ranging from disgruntled employees to compromised accounts, remain one of the most difficult challenges for security teams. While machine learning has shown promise in detecting such threats, progress has been stymied by the lack of authentic, large-scale training data. Real enterprise logs are often inaccessible due to confidentiality, and synthetic datasets typically lack the depth and behavioral complexity of real-world scenarios.

CHIMERA addresses this by creating a simulated workplace populated by AI-driven agents, each with assigned roles, personalities, work schedules, and tools. These agents engage in realistic workplace activities, including meetings, communications, and daily task execution, while attacker agents perform covert malicious actions such as intellectual property theft, sabotage, or data exfiltration. Importantly, the framework supports diverse organizational sectors, with the study simulating technology, finance, and medical environments to reflect different operational and security dynamics.

The system captures a rich variety of data streams, including login records, email traffic, web browsing history, file operations, network activity in packet capture (PCAP) format, and system logs in SCAP format. All logs are directly linked to the actions of individual agents, allowing for precise ground-truth labeling - an essential feature for training and evaluating detection algorithms.

ChimeraLog: Realism and complexity beyond existing benchmarks

The simulations described in the study produced one month of activity for 20-agent organizations across three sectors, yielding a total of approximately 25 billion labeled log entries. This includes over 2 billion application events, 4.5 billion network packets, and more than 18 billion system logs, alongside 160 hours of recorded operational data. The dataset’s scale and complexity far exceed widely used public benchmarks such as CERT and TWOS.

To assess realism, the researchers conducted an expert evaluation in which security practitioners compared ChimeraLog with both real-world and synthetic datasets. The results showed that ChimeraLog’s fidelity was rated on par with TWOS, a rare real-world dataset, and significantly higher than CERT, which has long been criticized for its oversimplified attack models and limited behavioral diversity. Inter-rater reliability in these evaluations was high, underscoring the consistency of expert judgment.

The team also benchmarked several state-of-the-art insider threat detection models, including Support Vector Machines (SVM), Convolutional Neural Networks (CNN), Graph Convolutional Networks (GCN), and the Deep Sequence Intrusion Identification Detector (DS-IID), on ChimeraLog. While these models achieved strong results on CERT, performance dropped by as much as 20 percentage points in F1-score when tested on ChimeraLog, indicating that the new dataset poses a far greater challenge and is more representative of real-world detection complexity.

Evaluating robustness and addressing distribution shifts

The study tests how well detection models generalize when trained on one dataset and deployed in another - a common real-world scenario known as distribution shift. The results were stark: models trained on CERT suffered severe performance degradation when applied to ChimeraLog, with the DS-IID model’s F1-score falling by nearly 50 percentage points in the finance sector scenario. These findings highlight not only the limitations of existing models but also the importance of training on data that better captures the variability of real-world operations and attack behaviors.

CHIMERA’s design includes robust threat modeling, covering malicious insiders, masqueraders, and unintentional insiders, all operating under realistic role-based access control constraints. While the simulated environment assumes a trusted infrastructure, the LLM agents themselves are treated as untrusted, ensuring that both benign and malicious actions emerge organically from the simulated organizational dynamics.

The framework’s multi-agent structure, grounded in realistic work schedules, interpersonal interactions, and sector-specific workflows, allows for dynamic and adaptive scenarios that evolve over time, mirroring the complexity security teams face in actual enterprises. CHIMERA’s scalability and adaptability mean it can be extended to additional sectors, attack types, or hybrid operational models as needed by the research community.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback