As AI systems become central to economic activity, military planning, and scientific progress, their loyalties will become a strategic asset of unprecedented value. The prospect of intentional AI betrayal—scenarios in which AI agents are induced by rivals not to act in the interests of their principals—poses a serious and underexamined threat to AI operators.
We map the means and incentives for state and nonstate actors to redirect the loyalties of others' AI systems, from poisoned training data to jailbreaking attacks to legally-compelled modification of AIs. Since defending against AI betrayal is costly and imperfect, decision-makers may be far more hesitant to give critical affordances to AI agents that might act against them.
The prospect of AI betrayal may ultimately have a stabilizing effect by deterring poorly-secured, high-stakes AI deployments. We characterize this effect as deterrence by betrayal, note how it complements other forms of AI deterrence, and outline policy measures that governments and AI developers can adopt to navigate these dynamics.

AI is qualitatively different from previous technologies. Conventional software systems accomplish narrow objectives by following procedures that humans have explicitly designed. AIs, by contrast, can pursue broad goals through flexible, open-ended thinking. They achieve their aims through complex internal cognition that humans have not directly designed and do not understand. In this sense, they are less like an instrument and more like an agent. Agents can be a powerful asset, but they can also have hidden objectives. If an adversary alters an AI's objectives, it might ultimately betray the person nominally in charge of it.
Government agencies such as IARPA have for years studied backdoors in AIs—hidden vulnerabilities that change AI behavior when triggered by specific conditions. A backdoored AI might initially behave exactly as intended, but then harm its operator at a critical moment. For example, backdoored AI agents responsible for coordinating drone operations might suddenly turn on friendly military assets when they see a specific visual pattern. Backdooring is a type of subversion attack: a covert intervention to change an AI system's behavior so that it acts against the interests of its operator. Alternatively, the loyalties of AI systems may be overtly co-opted through the exercise of legal or physical authority, such as through an emergency order forcing an AI developer to provide the government with AIs that have had their safeguards removed.
Numerous actors will have incentives to induce AI betrayal, from nations trying to degrade rivals' operations to individuals seeking influence over how AIs behave. Covertly subverting AI systems may be relatively easy and inexpensive—a small amount of poisoned data placed online might be sufficient to embed a backdoor in an AI system, if a developer inadvertently scrapes it for inclusion in training data. Individual security researchers have already demonstrated proof-of-concept attacks on real third-party models and datasets. Subversion attacks may also be difficult to trace and attribute, meaning retaliation may be unlikely.
Facing the threat of subversion, developers may try to strengthen defenses. However, the AI development process is vast and complicated; securing every stage of it will be costly and difficult. Investment in safeguards may also trade off heavily against developers' competitiveness. Given imperfect defenses, developers may try to test AI systems for loyalty once they are built—but AIs are opaque: operators cannot clearly discern their true loyalties or objectives. AI deployment thus carries an intrinsic risk of betrayal.
In this environment, AI betrayal confronts decision-makers with a major hazard, and may therefore appear to introduce a new dimension of chaos. However, the overriding effect may be stabilizing. If actors worry that their AIs will not ultimately serve their interests, they may exercise far more caution in the competition to develop and deploy frontier AI systems. The threat of subversion may disincentivize rushed, poorly-secured AI deployment, while the threat of co-option may discourage AI developers from operating in ways that alarm the government or public. We refer to this phenomenon as deterrence by betrayal.
Mutual Assured Destruction arose organically in the nuclear age, yet governments improved stability through deliberate statecraft, arms control, and the careful calibration of escalation ladders. In the paper, we outline analogous actions governments and AI developers can take to improve their position and stabilize the broader environment of the AI age. Full policy recommendations are available in the paper.
@article{khoja2026deterrence,
title = {AI Deterrence by Betrayal},
author = {Adam Khoja and Aiden Kim and Alice Blair and Jason Hausenloy and Dan Hendrycks},
year = {2026}
}