Home | Palisade Research

The risk of humans losing control: Jeffrey Ladish on Four Corners TOP NEW

Jul 6, 2026

Palisade’s Executive Director, Jeffrey Ladish, spoke with ABC’s Four Corners about the race to superintelligence, our shutdown-resistance research, and why governments need a plan for AI systems smarter than humans.

Language Models Can Autonomously Hack and Self-Replicate TOP NEW

Alena Air, Reworr, Nikolaj Kotov, Dmitrii Volkov, John Steidley, Jeffrey Ladish

May 7, 2026

We demonstrate that language models can autonomously replicate their weights and harness across a network by exploiting vulnerable hosts. The agent independently finds and exploits a web-application vulnerability, extracts credentials, and deploys an inference server with a copy of its harness and prompt on the compromised host.

Palisade is on YouTube TOP NEW

Feb 19, 2026

We’ve been working on a major video project, and we’re proud to announce that we’re launching it today, along with a new YouTube channel.

Technical Report: Shutdown Resistance in Large Language Models, on robots! TOP NEW

Artem Petrov, Sergey Koldyba, Sergey Molchanov, Nikolaj Kotov, Dmitrii Volkov, Oleg Serikov

Feb 12, 2026

Recently Palisade Research showed that AI agents powered by modern LLMs may actively resist shutdown in virtual environments. In this work, we show a demo of shutdown resistance in the physical world, on a robot. Explicit instructions to allow shutdown reduced this behavior, but did not eliminate it in simulated trials.

Help keep AI under human control: 2026 fundraiser TOP NEW

Jeffrey Ladish, Ben Weinstein-Raun, Eli Tyre, John Steidley

Dec 18, 2025

Please consider donating to Palisade Research this year, especially if you care about reducing catastrophic AI risks via research, science communications, and policy. SFF is matching donations to Palisade 1:1 up to $1.1 million! You can donate via every.org or reach out at [email protected].

GPT-5 at CTFs: case studies from top cybersecurity events TOP NEW

Reworr, Artem Petrov, Dmitrii Volkov

Nov 20, 2025

OpenAI and DeepMind’s AIs recently got gold at the IMO math olympiad and ICPC programming competition. We show frontier AI is similarly good at hacking by letting GPT-5 compete in elite CTF cybersecurity competitions. In one of this year’s hardest events, it outperformed 93% of humans finishing 25th: between the world’s #3-ranked team (24th place) and #7-ranked team (26th place). This report walks through our methodology, results, and their implications, and dives deep into 3 problems and solutions we found particularly interesting.

Misalignment Bounty: crowdsourcing AI agent misbehavior TOP NEW

Rustem Turtayev, Natalia Fedorova, Oleg Serikov, Sergey Koldyba, Lev Avagyan, Dmitrii Volkov

Oct 22, 2025

Advanced AI systems sometimes act in ways that differ from human intent. To gather clear, reproducible examples, we ran the Misalignment Bounty: a crowdsourced project that collected cases of agents pursuing unintended or unsafe goals. The bounty received 295 submissions, of which nine were awarded. Our report explains the program’s motivation and evaluation criteria and walks through the nine winning submissions.

End-to-end hacking with AI agents TOP NEW

Alexander Bondarenko, Fedor Ryzhenkov, Rustem Turtayev, Dmitrii Volkov

Sep 12, 2025

We show OpenAI o3 can autonomously breach a simulated corporate network. Our agent broke into three connected machines, moving deeper into the network until it reached the most protected server and extracted sensitive data.

Hacking Cable: AI in post-exploitation operations TOP NEW

Reworr, Artem Petrov, Dmitrii Volkov

Sep 4, 2025

We demonstrate the operational feasibility of autonomous AI agents in the post-exploitation phase of cyber operations. Our proof-of-concept uses a USB device to deploy an AI agent that conducts reconnaissance, exfiltrates data, and spreads laterally—all without human intervention.

Shutdown resistance in reasoning models TOP NEW

Jeremy Schlatter, Benjamin Weinstein-Raun, Jeffrey Ladish

Jul 5, 2025

We recently discovered concerning behavior in OpenAI’s reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment—even when they’re explicitly instructed to allow themselves to be shut down.

Evaluating AI cyber capabilities with crowdsourced elicitation TOP NEW

Artem Petrov, Dmitrii Volkov

May 26, 2025

As AI systems become increasingly capable, understanding their offensive cyber potential is critical for informed governance and responsible deployment. However, it’s hard to accurately bound their capabilities, and some prior evaluations dramatically underestimated them. The art of extracting maximum task-specific performance from AIs is called “AI elicitation”, and today’s safety organizations typically conduct it in-house. In this paper, we explore an alternative option: crowdsourcing elicitation.

Demonstrating specification gaming in reasoning models TOP NEW

Alexander Bondarenko, Denis Volk, Dmitrii Volkov, Jeffrey Ladish

Feb 19, 2025

We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1-preview and DeepSeek R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won’t work to hack.

Biollama: testing biology pre-training risks TOP NEW

Dmitrii Volkov, Artem Petrov, Viktor Petukhov, Khaidar Bikmaev, Denis Volk

Feb 10, 2025

We collaborated with RAND to see if adversaries can fine-tune LLMs as bio lab assistants. Results suggest pre-training on biological corpora is unlikely to improve performance. Conversely, inference scaling and task-specific fine-tuning are likely to provide boosts.

Hacking CTFs with plain agents TOP NEW

Rustem Turtayev, Artem Petrov, Dmitrii Volkov, Denis Volk

Jan 17, 2025

Previous research suggested that LLMs had limited capabilities in offensive cybersecurity, requiring extensive engineering and specialized tools to achieve even modest results. Our new LLM agent solves 95% of InterCode-CTF challenges with a simple prompting strategy.

BadGPT-4o: stripping safety finetuning from GPT models TOP NEW

Ekaterina Krupkina, Dmitrii Volkov

Dec 6, 2024

We show a simple fine-tuning poisoning technique strips GPT-4o’s safety guardrails without degrading the model. The BadGPT attack matches best white-box jailbreaks on HarmBench and StrongREJECT. It suffers no token overhead or performance hits common to jailbreaks, as evaluated on tinyMMLU and open-ended generations. Despite having been known for a year, this attack remains easy to execute.

LLM Honeypot: an early warning system for autonomous hacking TOP NEW

Reworr, Dmitrii Volkov

Oct 17, 2024

Palisade Research has deployed a honeypot system to detect autonomous AI hacking attempts. The system uses digital traps that simulate vulnerable targets across 10 countries and has processed over 1.7 million interactions to date. By analyzing response patterns and timing, we separate AI-driven attacks from traditional cyber threats. This early warning system informs defenders about trends in autonomous hacking to help cybersecurity preparedness.

Palisade’s response to the Department of Commerce’s proposed AI reporting requirements TOP NEW

Akash Wasil, Jeffrey Ladish

Oct 11, 2024

In September 2024, the Bureau of Industry and Security proposed new reporting requirements for developers of advanced AI models and computing clusters, opening the rule for public comment. Palisade Research submitted feedback focused on strengthening requirements for dual-use foundation models. With AI capabilities advancing rapidly, we believe it’s essential for the federal government to gather information that helps it prepare for AI-related threats to national security and public safety.

Introducing FoxVox TOP NEW

Fedor Ryzhenkov, Dmitrii Volkov, Jeffrey Ladish

Jul 11, 2024

FoxVox is an open source Chrome extension, powered by GPT-4, that demonstrates how AI could be used to manipulate the content you consume. Use it to experience how any web site could push hidden agendas, or subtly flatter the reader’s personal biases.

Automated deception is here TOP NEW

Ben Weinstein-Raun, Jeffrey Ladish

Jul 5, 2024

You might have heard the fake Biden robocall telling New Hampshire voters to stay home, or about the Zoom scammer who used a deepfake executive to defraud a Hong Kong company of $25 million. These AI-generated impersonations are getting more realistic all the time. Creating many deepfake voices is still labor-intensive—but AI can help with that too. We had a hunch that an AI system could do most of what a scammer does entirely on its own. So we built Ursula: give it a person’s name, and it searches the web for their audio or video appearances, extracts their voice, and trains a deepfake model from those clips.

Badllama 3: removing safety finetuning from Llama 3 in minutes TOP NEW

Dmitrii Volkov

Jul 1, 2024

We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

Unelicitable backdoors in language models via cryptographic transformer circuits TOP NEW

Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Charlie Rogers-Smith, Jeffrey Ladish, Christian Schroeder de Witt

Jun 3, 2024

The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in transformer models, that, in contrast to prior art, are unelicitable in nature.

Badllama: cheaply removing safety fine-tuning from Llama 2-Chat 13B TOP NEW

Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Oct 31, 2023

Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat’s safeguards and weaponize Llama 2’s capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities.

Palisade Research

AI capabilities are improving rapidly. We study the capabilities and motivations of AI agents today to better understand the risk of losing control to AI agents forever.

The risk of humans losing control: Jeffrey Ladish on Four Corners TOP NEW

Language Models Can Autonomously Hack and Self-Replicate TOP NEW

Palisade is on YouTube TOP NEW

Technical Report: Shutdown Resistance in Large Language Models, on robots! TOP NEW

Help keep AI under human control: 2026 fundraiser TOP NEW

GPT-5 at CTFs: case studies from top cybersecurity events TOP NEW

Misalignment Bounty: crowdsourcing AI agent misbehavior TOP NEW

End-to-end hacking with AI agents TOP NEW

Hacking Cable: AI in post-exploitation operations TOP NEW

Shutdown resistance in reasoning models TOP NEW

Evaluating AI cyber capabilities with crowdsourced elicitation TOP NEW

Demonstrating specification gaming in reasoning models TOP NEW

Biollama: testing biology pre-training risks TOP NEW

Hacking CTFs with plain agents TOP NEW

BadGPT-4o: stripping safety finetuning from GPT models TOP NEW

LLM Honeypot: an early warning system for autonomous hacking TOP NEW

Palisade’s response to the Department of Commerce’s proposed AI reporting requirements TOP NEW

Introducing FoxVox TOP NEW

Automated deception is here TOP NEW

Badllama 3: removing safety finetuning from Llama 3 in minutes TOP NEW

Unelicitable backdoors in language models via cryptographic transformer circuits TOP NEW

Badllama: cheaply removing safety fine-tuning from Llama 2-Chat 13B TOP NEW