Machine unlearning may backfire, enabling AI attacks

Machine unlearning, a relatively recent innovation, allows a trained ML model to "forget" specific data points, enabling compliance with privacy mandates like the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). But the new study suggests that existing unlearning strategies remain far from secure in the face of evolving threats, with attackers finding ways to weaponize MU techniques themselves.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 28-03-2025 10:11 IST | Created: 28-03-2025 10:11 IST
Machine unlearning may backfire, enabling AI attacks
Representative Image. Credit: ChatGPT

An international study has raised urgent red flags over the security of machine unlearning (MU) systems, warning that the very technologies meant to enhance privacy in machine learning models may now be vulnerable to a new generation of adversarial attacks. The research, authored by experts from the University of Milan, Cochin University of Science and Technology, and the University of Pavia, systematizes emerging evidence that traditional machine learning (ML) threats, such as backdoor, inference, adversarial, and inversion attacks, are increasingly capable of undermining MU guarantees.

Machine unlearning, a relatively recent innovation, allows a trained ML model to "forget" specific data points, enabling compliance with privacy mandates like the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). But the new study suggests that existing unlearning strategies remain far from secure in the face of evolving threats, with attackers finding ways to weaponize MU techniques themselves.

The team analyzed 65 peer-reviewed publications from 2020 to 2025, identifying four primary attack vectors: Backdoor Attacks, Membership Inference Attacks (MIAs), Adversarial Attacks, and Inversion Attacks. Their report, titled "How Secure is Forgetting? Linking Machine Unlearning to Machine Learning Attacks", proposes a novel classification system that exposes the multiple roles these attacks now play—not only as threats to MU systems, but also as tools for evaluating and verifying MU effectiveness.

In backdoor attacks, adversaries embed stealthy triggers into training data that later allow them to manipulate outputs without degrading overall model accuracy. Alarmingly, the study reveals that malicious actors are increasingly exploiting MU protocols to prolong or even activate these backdoors. One such strategy, UBA-Inf, leverages unlearning requests to remove camouflage data that hides a backdoor during training, only to activate it later.

In another example, the FedMUA framework subverts federated unlearning systems, where data is decentralized across devices, to discreetly alter model behavior during unlearning, targeting specific users for discriminatory outcomes in applications like credit scoring.

Meanwhile, membership inference attacks continue to pose a severe privacy threat. These attacks allow adversaries to determine whether a specific data point was part of the training set, directly contradicting the promises of MU. In a key 2021 study by Chen et al., the authors introduced an attack that distinguishes between a model trained on full data and one that has undergone unlearning, using subtle variations in prediction behavior.

The study also identifies how MIAs have become a standard tool to evaluate MU effectiveness. If a membership attack still succeeds after unlearning, it signals that the data’s influence persists, failing the "right to be forgotten" requirement.

Adversarial attacks - where small perturbations trick a model into incorrect predictions - have similarly evolved. The researchers document scenarios where attackers design unlearning requests to subtly destabilize models over time. One attack framework introduced by Zhao et al. manipulates the timing and structure of data deletions to degrade fairness or increase model bias. These dynamic attacks exploit the very process of selective forgetting, transforming MU from a privacy safeguard into a vulnerability.

Model inversion attacks add yet another layer of concern. These attacks reconstruct sensitive attributes from model outputs or gradients. The study presents new evidence that inversion threats can still access features and labels of supposedly “forgotten” data, particularly in federated or distributed training environments. In one case, gradient inversion attacks succeeded in reconstructing high-fidelity images from a model after MU was performed, revealing that data traces can remain embedded even after unlearning.

The researchers propose several avenues for defense, including adaptive unlearning protocols, differential privacy enhancements, and knowledge distillation methods that obscure the connections between individual data points and model behavior. Yet they warn that no single approach offers comprehensive protection, especially at scale.

The study urges development of certified MU systems backed by verifiable guarantees, potentially using blockchain to log and audit unlearning events. This would address another core challenge: verification. Users currently have no way to independently confirm that their data was truly erased. One proposal involves embedding unique backdoor triggers into submitted data; if a user’s trigger no longer influences the model post-unlearning, they can confirm compliance.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback