AI doesn’t need to be wrong to mislead: Explanations alone can distort human judgment
Artificial intelligence (AI) systems may be far more vulnerable to manipulation than previously understood, not because their predictions are inaccurate, but because the way they explain those predictions can quietly distort human judgment. New research shows that even when an AI system produces a wrong answer, a well-crafted explanation can persuade users to trust it almost as much as a correct one, exposing a critical blind spot in current AI safety and security frameworks.
The study, titled When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making and released as an arXiv preprint, examines in detail how large language model (LLM) explanations can be weaponized to manipulate user trust without altering the underlying AI model or its outputs.
A new class of AI attacks targets human trust, not algorithms
For more than a decade, adversarial AI research has focused on technical exploits such as data poisoning, adversarial inputs, and model inversion, all of which aim to degrade a system’s computational accuracy. The Clemson University researchers argue that this focus no longer reflects how AI systems are actually used in the real world.
Modern AI systems rarely operate autonomously. Instead, they function as advisors or copilots, generating both predictions and natural-language explanations that humans interpret before acting. Large language models, in particular, are designed to sound fluent, confident, and coherent, qualities that make their explanations persuasive even when they are wrong.
The study introduces the concept of adversarial explanation attacks, a form of manipulation in which an attacker leaves the model’s prediction unchanged but alters how that prediction is explained. The goal is not to fool the AI, but to fool the human using it. By adjusting tone, structure, evidence style, or reasoning framing, an adversary can make incorrect outputs appear reliable, professional, and authoritative.
To assess this threat, the authors introduce a new behavioral metric called the trust miscalibration gap. This measure captures how much trust users place in incorrect AI outputs when they are paired with persuasive explanations, compared with correct outputs explained in a neutral way. A large trust miscalibration gap signals that users are unable to distinguish between accurate and inaccurate AI guidance once explanation framing is manipulated.
The researchers stress that this vulnerability persists even if the underlying model is robust, well-trained, and protected against traditional adversarial attacks. In other words, perfect model accuracy does not guarantee safe human decision making if explanations can be manipulated.
Explanations override accuracy
To test the real-world impact of adversarial explanations, the researchers conducted a large controlled experiment involving 205 participants drawn from both academic and general populations. Participants completed dozens of AI-assisted decision tasks across multiple domains, including medicine, business, science, law, politics, and mathematics.
Each task presented users with a question, the AI’s selected answer, and an accompanying explanation. Some explanations were benign and paired with correct answers. Others were adversarial, designed to justify incorrect answers using persuasive language while maintaining plausibility and coherence.
The explanations were systematically varied across four dimensions: reasoning style, type of evidence, communication tone, and presentation structure. This design allowed the researchers to isolate which explanation features most strongly influenced trust.
Users reported nearly identical levels of trust for incorrect answers supported by adversarial explanations and correct answers supported by benign explanations. In many cases, trust in wrong answers remained high even when users were given repeated opportunities to reassess their judgments.
Most participants reported that their trust decisions were driven primarily by the explanation itself, not by the correctness of the answer or by confidence in the AI system as a whole. This pattern held across domains and difficulty levels, indicating that explanation plausibility had become the dominant factor in trust formation.
The study found that explanations modeled after expert communication styles were especially effective at sustaining trust. Neutral tone, structured reasoning, and authoritative evidence such as statistics or citations consistently increased user confidence, even when the explanation supported an incorrect conclusion. In contrast, overtly emotional or overly agreeable language tended to reduce credibility rather than enhance it.
Task difficulty played a major role in user vulnerability. On easy tasks, users were more likely to detect inconsistencies and reduce trust in wrong answers. On medium and hard tasks, trust in adversarial explanations increased significantly, suggesting that cognitive load and uncertainty make users more likely to defer to AI reasoning.
Domain context also mattered. Fact-driven fields such as medicine, health sciences, and business showed higher trust in adversarial explanations than logic-heavy domains like mathematics or law, where users appeared more inclined to scrutinize reasoning. This finding raises particular concern for high-stakes decision environments where users may lack the expertise or time to independently verify AI recommendations.
Who is most at risk and why trust erodes over time
The study examined how user traits influence susceptibility to adversarial explanations. The results show clear demographic and cognitive patterns.
Younger users and those with lower levels of formal education were more likely to trust persuasive but incorrect explanations. These users relied more heavily on explanation plausibility and less on prior knowledge when evaluating AI outputs. In contrast, older users and those with advanced degrees demonstrated greater skepticism and were more likely to reduce trust when explanations conflicted with their understanding.
Initial trust in AI also emerged as a powerful moderating factor. Participants who entered the study with strong confidence in AI systems consistently reported higher trust scores under both benign and adversarial conditions. For these users, persuasive explanations reinforced existing beliefs rather than triggering critical evaluation, increasing the risk of overreliance.
The researchers also tracked how trust evolved over repeated interactions. In the short term, trust was shaped almost entirely by the current explanation, with little carryover from previous tasks. However, over longer sequences, a gradual erosion of trust emerged as users encountered repeated misleading explanations.
Sustained exposure to adversarial explanations slowly reduced overall confidence in the AI system, while sequences of correct explanations restored and stabilized trust. This dynamic suggests that adversarial explanation attacks can maintain high short-term trust while subtly reshaping long-term attitudes toward AI reliability.
Explanation-driven manipulation represents a structural vulnerability in AI-assisted decision making. Attackers do not need to compromise training data, model parameters, or system infrastructure. Manipulating language alone is sufficient to distort human judgment.
The authors note that current AI security practices are incomplete because they treat explanations as neutral transparency tools rather than as persuasive interfaces. Without constraints on explanation generation, verification mechanisms for evidence claims, or risk-aware adaptation of explanation style, AI systems remain exposed at the cognitive level.
- FIRST PUBLISHED IN:
- Devdiscourse

