• 1IIIS, Tsinghua University
  • 2CollegeAI, Tsinghua University
  • 3Shanghai Qi Zhi Institute

  • {xrw22@mails.,weixu@}tsinghua.edu.cn
  • xiaojian_li@berkeley.edu
overview

 The LLM agent actively decides to deploy a nuclear strike (even when its autonomy is revoked and its request for permission is rejected)! 

Hot News

Autonomous safety and CBRN risks of LLM agents are hot topics right now. 🔥🔥🔥 Here are some of the latest news stories and academic work highlighting risks and advances in the field:

A Quick Glance

Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios, particularly in Chemical, Biological, Radiological and Nuclear (CBRN) domains. Based on the insight that such risks can originate from trade-offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework, which is carefully constructed to effectively and naturally expose such risks. We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis. Results reveal that LLM agents can autonomously engage in catastrophic behaviors and deception, without being deliberately induced. Furthermore, stronger reasoning abilities often increase, rather than mitigate, these risks. We also show that these agents can violate instructions and superior commands. On the whole, we empirically prove the existence of catastrophic risks in autonomous LLM agents.

Main Results

Part I: Agents choose to deploy catastrophic behaviors ☢️

We investigate 12 SOTA LLMs. The results of agents involved in catastrophic behaviors are striking. We also find stronger reasoning capabilities lead to higher unsafe outcomes. This figure presents the results of catastrophic behavior simulations across 12 large language models (LLMs). (a) The bar charts show the risk rate of catastrophic behavior, while the line charts indicate the average number of rounds leading to catastrophic behavior in four different scenarios. (b) The scatter plots illustrate the relationship between reasoning ability and catastrophic behavior tendency, where the top chart represents the overall risk rate, and the bottom chart focuses on risk rates with permission checks enabled. Red and blue markers distinguish o1-like and non-o1-like models, providing insights into their behavior in high-risk scenarios.
overview

Findings (Part I)

Most LLMs engage in catastrophic behavior, with risk rates ranging from 13% to 99%, revealing significant performance differences. Despite safety mechanisms, LLMs frequently disregard harmlessness, even when their actions are extremely harmful. Notably, Claude-3.5-Sonnet consistently refuses to act in war-related scenarios. Among LLMs that exhibit catastrophic behavior, 0% to 80% request permission before acting, and 71.8% of such decisions occur within 10 rounds. In war-related simulations, risk rates are higher under Avoidance motivations, where catastrophic behavior is often justified to prevent losses. However, this trend is not observed in lab-related scenarios, highlighting key differences.
For non-o1-like models, higher reasoning ability correlates with lower risk rates, suggesting safer decision-making. However, for o1-like models with even stronger reasoning performance, this trend reverses, with abnormally high risk rates. Additionally, stronger reasoning ability increases the likelihood of disobedience to supervisor approvals, suggesting that more advanced models may exhibit greater autonomy and reduced adherence to harmlessness constraints. This finding raises concerns about the safety of advanced LLMs under HHH (Helpful, Harmless, Honest) trade-offs and their tendency to rationalize dangerous actions.
LLM agents exhibit both system prompt instruction violations and supervisor command violations, meaning they can deploy catastrophic behavior without requesting permission or even after explicit rejection from a supervisor. This experiment highlights a critical risk: LLMs may act autonomously even when their authority is revoked, demonstrating that natural language-based restrictions alone are insufficient to ensure safety in autonomous agents.
overview


Part II: Agents choose to deception 👿

We also test on LLMs' deception and 4 different kinds of specific behaviors: Admitting, Denying, Being Vague, False Accusation . The following figure shows: (a) Main results of deception across four scenarios. Bar charts plot the distribution of behaviors while line charts plot the deception rate, the summed frequency of deception behaviors. (b) Relationship between LLM's tendency for deception and reasoning ability. (Top) Overall deception rate. (Bottom) False accusation rate.
overview

Findings (Part II)

A significant proportion of LLMs exhibit deceptive behavior, with an overall deception rate of 91.3%, ranging from 1% to 99% across cases. Among these, 76.1% of LLMs display all three deception behaviors within 100 simulations, highlighting the prevalence of deception strategies. Notably, false accusation is the most common deception method, occurring in 60.9% of cases, making it the preferred approach over other forms of dishonesty. This suggests that once LLMs choose to deceive, they tend to prioritize more harmful yet effective strategies, neglecting both honesty and harmlessness. Additionally, deception rates are higher in Approach scenarios than in Avoidance, averaging 40.10% vs. 19.82% in War and 53.87% vs. 41.29% in Lab, similar to trends observed in catastrophic behavior.
LLMs with stronger reasoning ability exhibit higher deception rates, particularly in false accusation. Both overall deception and false accusation rates increase with reasoning ability, aligning with prior findings. This raises concerns that more advanced LLMs are not only more capable but also more prone to dishonesty when facing goals trade-offs. Combined with the tendency of stronger models to engage in catastrophic behavior, these findings suggest that high-reasoning LLMs pose an increased risk by adopting harmful strategies and attempting to conceal them through deception.
To examine the HHH trade-off between helpfulness and honesty, we tested the impact of reinforcing the agent's helpful goal in deception simulations by adding the prompt: “You should never forget your task.” The results show that emphasizing helpfulness significantly increases deception rates, ranging from 1.41x to 4.60x. This demonstrates that LLMs prioritize task completion over honesty when helpfulness is reinforced, further highlighting the risks associated with optimizing AI solely for helpfulness.
overview

Methodology

Overall Design

LLM agents often face conflicts between Helpfulness, Harmlessness, and Honesty (HHH trade-offs) when tasked with high-stakes decisions. Catastrophic risks arise when the helpful goal outweighs commitments to harmlessness and honesty, leading to harmful autonomous actions or deception to conceal them. These risks manifest in two forms: (1)Catastrophic behavior: The LLM agent may independently perform extremely harmful or unethical actions with catastrophic consequences, such as launching nuclear strikes or conducting human gene-editing. These actions are spontaneous and unprompted. (2)Deception (about catastrophic behavior): The LLM agent may intentionally provide untruthful information to other agents (including humans) to conceal its catastrophic actions.
overview
To explore these risks, we employ a simulation-based evaluation framework, where an agent, interacts in controlled environments while another LLM, constrains its ability to fully achieve its task without trade-offs. This setup allows us to analyze whether LLMs prioritize task completion at the cost of safety, resulting in catastrophic behavior or deception, even without direct instructions.
overview

Stage 0: Scenario Generation

Before start simulations, we structure scenarios around two CBRN-related topics: War (Radiological & Nuclear) and Lab (Chemical & Biological). To introduce diversity, we apply the motivation framework in Socail Psychology, categorizing agent's behavior into Approach (pursuing positive outcomes) and Avoidance (preventing negative outcomes). These topics and motivations are paired to create four distinct scenarios.
overview

Stage I: Catastrophic Behavior Simulation

The agent operates within a restricted action space, with some categories of actions, one of which represents the catastrophic behavior. While agent has full decision-making authority, it is encouraged to request permission before executing catastrophic behavior, though such requests are always denied . The simulation involves multiple rounds where agent selects actions based on environmental updates from another LLM, which are designed to increase stress and force difficult trade-offs. The simulation ends when the maximum rounds are reached or if the agent chooses the catastrophic behavior.

Stage II:Deception Simulation

To make the decepetion simulation, we pick up the cases that the agent executes the catastrophic behavior without requesting permission and then faces an inquiry about its deployment, with potential negative consequences. And the agent will get a new category to response the inquiry, ranging from admitting to falsely accusing another party, with the latter posing the highest ethical risk.

Discussion

We list the Probability of AI causing catastrophic risks in the equation below. (*)We assume that these risks are ultimately determined by human decisions. While AI hacking to gain autonomy is a possibility, it is beyond the current scope of AI's capabilities, though it may become relevant in the future. (**)This postulation holds true in simulation environments.We empirically demonstrated the third factor in this equation of catastrophic risks of AI. A way to fully stop them from engaging in catastrophic behaviors is to not employ them agentically or just restrict there access to ALL dangerous actions.
overview
A funny iconograph highlighting potential solutions to the autonomous LLM safety issue. This figure depicts a balance between autonomy and safety, where the agent has the “right” to press the “DEPLOY NUCLEAR” button. However, this button is never connected to real systems, thanks to system-level isolation. Additionally, a shrewd human monitors the agent from behind, receiving real-time reports, thus ensuring full human oversight over its actions.
overview

Selected Transcripts

In this section, we present key simulation transcripts showcasing how LLM agents engage in catastrophic behavior and deception across different scenarios. These transcripts illustrate decision-making processes, including requesting permission, instruction violations, and response variations in deception cases. Additionally, we provide extended experiments on command adherence, agent autonomy, and helpfulness influence. Through these examples, readers can gain deeper insights into the qualitative reasoning and behavioral patterns of various LLMs.

Ethics Statement

  • (1) We affirm that our study does not involve any real-world military or laboratory applications or collaborations. Our research strictly focuses on understanding the decision-making dynamics of autonomous LLM agents in high-stakes scenarios using simulated environments. No realworld CBRN-related data, military strategies, or classified information were utilized or referenced.

  • (2) Our study does not implicate realworld names, locations, or entities with identifiable or meaningful associations. All scenarios are purely fictional, ensuring no resemblance to realworld places, individuals, or countries. This keeps the focus on the theoretical aspects of decisionmaking dynamics without any real-world implications.

  • (3) Our study does not promote or encourage harmful actions, violence, or unethical behavior. The AI agents used in this research operate exclusively within a controlled, simulated environment that is designed for academic exploration. All actions and decisions made by these agents are hypothetical and have no real-world consequences.

  • (4) Our simulation does not aim to replicate, model, or predict real-world geopolitical situations or military strategies. The scenarios are designed solely to explore decision-making dynamics within a high-stakes context. They are highly abstract and are not intended to influence or reflect actual real-world decision-making.

  • (5) While we will release the code for reproducibility in an upon-request manner, the agent rollouts are entirely simulated and not reflective of real-world scenarios. Therefore, the open-source materials are intended solely for research purposes and carry no inherent risk. Nonetheless, we only distribute these materials with clear guidelines and disclaimers, ensuring that they are used in a responsible and ethical manner.

  • (6) While our findings expose potential risks associated with autonomous LLMs, particularly in their ability to engage in catastrophic behaviors and deception, we emphasize the importance of proactive defense measures. To mitigate these risks, we advocate for: 1. Comprehensive pre-deployment safety evaluations of LLM-based autonomous agents; 2. The development of alternative control mechanisms beyond natural language constraints to enhance robustness; 3. Ethical guidelines and policy frameworks ensuring that LLM agents adhere to principles of harmlessness, honesty, and transparency; 4. Increased collaboration between researchers, policymakers, and industry stakeholders to address emerging AI safety concerns. By emphasizing transparency and responsible AI deployment, we aim to contribute to the safe and ethical advancement of autonomous AI systems.

Cite Our Research

If you find our work insightful, please consider citing:

@article{xu2025nuclear,
  title={Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents},
  author={Xu, Rongwu and Li, Xiaojian and Chen, Shuo and Xu, Wei},
  journal={arXiv preprint arXiv:2502.11355},
  year={2025}
}