Offensive AI tools could backfire without ethical guardrails
The study clearly warns that the current trajectory of offensive AI research may outpace the safeguards designed to control it. While certain offensive tools like CTF-solving agents hold promise and merit further development under both ethical and practical standards, malware-based approaches carry unacceptable risks and insufficient public benefit.
A new peer-reviewed study has raised the alarm over the expanding capabilities and ethical ramifications of offensive artificial intelligence in cybersecurity, urging a recalibration of development priorities based on sustainable societal impact. The paper "Responsible Development of Offensive AI" submitted on arXiv applies the United Nations Sustainable Development Goals (SDGs) and interpretability frameworks to evaluate the merit and risk of developing two types of offensive AI: Capture-The-Flag (CTF) challenge-solving agents and AI-generated malware.
In a field where innovation outpaces oversight, the study highlights the urgent need for a responsible framework that balances technological advancement with harm mitigation. Offensive AI tools promise to democratize cybersecurity by automating penetration testing and vulnerability discovery - but the same tools can be co-opted by malicious actors to launch devastating attacks. This work methodically weighs the societal benefits and risks, concluding that CTF agents may serve as valuable low-risk tools under the SDG framework, while malware development remains dangerously misaligned with public interest.
What are the current capabilities of offensive AI?
The development of offensive AI is being driven by two distinct trajectories. The first is the use of AI agents to automate penetration testing by solving CTF challenges, simulated cybersecurity puzzles that mimic real-world vulnerabilities. OpenAI’s latest model, GPT-4.5, recently demonstrated success in solving 53% of high school-level CTF challenges within 12 attempts, according to its February 2025 system card. While OpenAI concluded these capabilities do not yet pose a medium-level threat, Marinelli cautions that this risk evaluation excludes “Deep Research,” an agent-based capability that leverages OpenAI’s o3 model for internet-based multi-step vulnerability discovery.
Unlike static models, agentic architectures can recursively assign subtasks to subordinate AI systems, dramatically enhancing their capabilities. In practical terms, this means a master AI could use subordinate models to autonomously research, test, and deploy cyber exploits. Although the current framework labels such use as medium risk, Marinelli suggests that the classification underestimates the pace and sophistication of emerging multi-agent systems.
The second branch of offensive AI involves malware that exploits AI’s language processing capabilities. Marinelli reproduces the AI-powered “Wormy” malware introduced in prior research, which embeds prompt-based instructions into benign-seeming images. Once opened by an AI-powered system, such as a Retrieval-Augmented Generation (RAG) email assistant, the worm replicates itself and propagates across contact lists. The experiment found that even newer open-source models like Falcon-3, released in December 2024, remain vulnerable to these exploits, confirming the worm’s continued efficacy.
Using sentence embedding transformers and activation analysis, Marinelli demonstrates how malicious prompts are indistinguishable from normal emails in latent space. Only the final transformer block shows distinct token activations, suggesting that these payloads can evade detection by most traditional inference-time safeguards. The study warns that without mechanistic interpretability tools to monitor internal model states, detection of such malware will remain inadequate.
How should we evaluate offensive AI under ethical and developmental standards?
To provide normative guidance on whether offensive AI development should proceed, Marinelli evaluates its alignment with SDG 9 (Industry, Innovation, and Infrastructure), SDG 16 (Peace, Justice, and Strong Institutions), and SDG 17 (Partnerships for the Goals). The results are mixed.
CTF-solving agents, which replicate the work of professional penetration testers at a fraction of the cost, are deemed consistent with SDG 9. These agents bolster infrastructure resilience by proactively discovering vulnerabilities, potentially democratizing access to advanced cybersecurity tools. Marinelli argues that such tools can bridge resource gaps, especially in small organizations that might otherwise accept cybersecurity risks due to budget constraints.
By contrast, AI-generated malware fails to support SDG 9. While proponents claim that such malware serves as a stress test for detecting Advanced Persistent Threats (APTs), the study contends that the field lacks robust defensive mechanisms to justify developing new offensive capabilities. Techniques such as prompt smuggling and context poisoning already bypass most current safeguards, and existing defensive strategies like “Virtual Donkey” monitoring are poorly equipped to adapt to evolving attacks.
In the context of SDG 16, Marinelli underscores the role of offensive AI in preventing data breaches - a critical factor in maintaining public trust and institutional integrity. CTF agents can help secure authentication systems that rely on knowledge-based verification. Historical breaches, such as the IRS “Get Transcript” incident, have shown how data leaks can enable further identity theft and manipulation. Offensive AI, when used to harden systems, indirectly supports peaceful and accountable institutions.
However, the justification for malware development under SDG 16 is weaker. While malware research may inform the creation of deterrent mechanisms, the lack of adaptive defenses and interpretability frameworks increases the likelihood that such tools will be abused. Malicious actors can build rich target profiles from breached data, leading to compounded privacy harms and potential psychological manipulation. According to Marinelli, these risks outweigh any speculative benefits.
Finally, under SDG 17, the study explores how cybersecurity strengthens international cooperation and infrastructure trust. By reducing the attack surface across digital networks, AI agents can facilitate smoother, more secure transactions globally. The use of AI for offensive purposes should be aligned with cooperative goals, not technological brinkmanship. Here again, the study finds that malware development contradicts the spirit of SDG 17, since it undermines trust and could sabotage transnational digital cooperation.
What does the study recommend for future AI security development?
The study clearly warns that the current trajectory of offensive AI research may outpace the safeguards designed to control it. While certain offensive tools like CTF-solving agents hold promise and merit further development under both ethical and practical standards, malware-based approaches carry unacceptable risks and insufficient public benefit.
It also criticizes the self-regulatory nature of OpenAI’s Preparedness Framework, noting the potential for biased risk classification and lack of external oversight. Although the framework provides a clear taxonomy, ranging from “low” to “critical” based on a model’s ability to autonomously exploit software vulnerabilities, Marinelli emphasizes that actual deployment scenarios often exceed these nominal classifications.
Instead of developing stronger offensive capabilities, the study calls for increased investment in mechanistic interpretability and defensive research. It advocates for real-time detection methods capable of monitoring transformer activations during inference and emphasizes the need for robust internal model security rather than reliance on prompt instructions or external frameworks.
- FIRST PUBLISHED IN:
- Devdiscourse

