# AI Deterrence by Betrayal

> As AIs become central to economic activity, military operations, and scientific progress, their loyalties will become a strategic asset of immense value. In this paper, we argue that the prospect of intentional AI betrayal---scenarios in which AI agents are induced by rivals to subvert the interests of their principals---poses a serious and underexamined threat to AI developers and users. We analyze the means and incentives of actors to redirect the loyalties of others' AI systems, from poisoned training data to jailbreaking attacks to governmentally compelled changes to AIs. Since defending against AI betrayal is costly and imperfect, decision-makers may be far more hesitant to give critical affordances to AI agents that might act against them. The prospect of AI betrayal may ultimately have a stabilizing effect by deterring poorly secured, high-stakes AI deployments and applications. We characterize this effect as deterrence by betrayal and describe how it complements other forms of AI deterrence. Finally, we outline policy measures by which governments and AI developers can harness this dynamic for their own benefit. We release our work at https://aibetrayal.com.

## Paper
- [Full paper, verbatim LaTeX](https://aibetrayal.com/llms-full.txt)
- [Full paper (PDF)](https://aibetrayal.com/paper.pdf)

## Paper sections
- [Introduction](https://aibetrayal.com/llms-introduction.txt)
- [AI Betrayal](https://aibetrayal.com/llms-ai-betrayal.txt)
- [Deterrence by Betrayal](https://aibetrayal.com/llms-deterrence.txt)
- [Discussion and Conclusion](https://aibetrayal.com/llms-discussion.txt)

## Appendices
- [Subversion Attacks and Defenses](https://aibetrayal.com/llms-appendix.txt)

## Site
- [Site root](https://aibetrayal.com/)