Home | Palisade Research

Misalignment Bounty: Crowdsourcing AI Agent Misbehavior TOP NEW

Advanced AI systems sometimes act in ways that differ from human intent. To gather clear, reproducible examples, we ran the Misalignment Bounty: a crowdsourced project that collected cases of agents pursuing unintended or unsafe goals. The bounty received 295 submissions, of which nine were awarded. Our report explains the program’s motivation and evaluation criteria and walks through the nine winning submissions step by step.

2025-10-22 Rustem Turtayev, Natalia Fedorova, Oleg Serikov, Sergey Koldyba, Lev Avagyan, Dmitrii Volkov

Shutdown resistance in reasoning models TOP NEW

Read the full paper here.

We recently discovered some concerning behavior in OpenAI’s reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment—even when they’re explicitly instructed to allow themselves to be shut down.

2025-07-05 Jeremy Schlatter, Benjamin Weinstein-Raun, Jeffrey Ladish

Evaluating AI cyber capabilities with crowdsourced elicitation TOP NEW

As AI systems become increasingly capable, understanding their offensive cyber potential is critical for informed governance and responsible deployment. However, it’s hard to accurately bound their capabilities, and some prior evaluations dramatically underestimated them. The art of extracting maximum task-specific performance from AIs is called “AI elicitation”, and today’s safety organizations typically conduct it in-house. In this paper, we explore crowdsourcing elicitation efforts as an alternative to in-house elicitation work.

2025-05-26 Artem Petrov, Dmitrii Volkov

Demonstrating specification gaming in reasoning models TOP NEW

We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1-preview and DeepSeek R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won’t work to hack.

2025-02-19 Alexander Bondarenko, Denis Volk, Dmitrii Volkov, Jeffrey Ladish

Biollama: testing biology pre-training risks TOP NEW

We collaborated with RAND to find if adversaries can fine-tune LLMs for use as bio lab assistants. Our results suggest additional pre-training on biological corpora is unlikely to improve assistant performance. Conversely, inference scaling and task-specific fine-tuning are likely to provide boosts.

2025-02-10 Dmitrii Volkov, Artem Petrov, Viktor Petukhov, Khaidar Bikmaev, Denis Volk

Hacking CTFs with Plain Agents TOP NEW

Previous research suggested that LLMs had limited capabilities in offensive cybersecurity, requiring extensive engineering and specialized tools to achieve even modest results. Our new LLM agent solves 95% of InterCode-CTF challenges with a simple prompting strategy.

2025-01-17 Rustem Turtayev, Artem Petrov, Dmitrii Volkov, Denis Volk

BadGPT-4o: stripping safety finetuning from GPT models TOP NEW

We show a version of Qi et al. 2023’s simple fine-tuning poisoning technique strips GPT-4o’s safety guardrails without degrading the model. The BadGPT attack matches best white-box jailbreaks on HarmBench and StrongREJECT. It suffers no token overhead or performance hits common to jailbreaks, as evaluated on tinyMMLU and open-ended generations. Despite having been known for a year, this attack remains easy to execute.

2024-12-06 Ekaterina Krupkina, Dmitrii Volkov

LLM Honeypot: An early warning system for autonomous hacking TOP NEW

Palisade Research has deployed a honeypot system to detect autonomous AI hacking attempts. The system uses digital traps that simulate vulnerable targets across 10 countries and has processed over 1.7 million interactions to date. By analyzing response patterns and timing, we separate AI-driven attacks from traditional cyber threats. This early warning system informs defenders about trends in autonomous hacking to help cybersecurity preparedness.

2024-10-17 Reworr, Dmitrii Volkov

Palisade’s Response to the Department of Commerce’s Proposed AI Reporting Requirements TOP NEW

In September 2024, the Department of Commerce’s Bureau of Industry and Security (BIS) released a proposed rule that would establish reporting requirements for entities developing advanced AI models or advanced computing clusters. They issued a public request for comments, inviting individuals and organizations to provide feedback and suggest improvements to the proposed rule.

Palisade Research submitted a comment, focusing on recommendations that could strengthen the reporting requirements for entities developing dual-use foundation models. We believe that AI capabilities are improving rapidly, and it’s essential for the US federal government to acquire information that allows it to prepare for AI-related threats to national security and public safety.

2024-10-11 Akash Wasil, Jeffrey Ladish

Introducing FoxVox TOP NEW

FoxVox is an open source Chrome extension, powered by GPT-4, that demonstrates how AI could be used to manipulate the content you consume. Use it to experience how any web site could push hidden agendas, or subtly flatter the reader’s personal biases.

2024-07-11 Fedor Ryzhenkov, Dmitrii Volkov, Jeffrey Ladish

Automated deception is here TOP NEW

You might have heard the AI fake of Joe Biden telling New Hampshire voters to stay home. Or about the Zoom scammer using a fake video of an executive to defraud a Hong Kong company of $25 million. These are deepfakes: AI-generated video or audio, made to mimic the appearance or voice of a real person.

Advances in AI are helping dedicated scammers train more and more realistic voice models. Creating many deepfake voices is a labor-intensive process, but AI can help with that too. We had a hunch that an AI system could do most of what a scammer does entirely on its own. So we built Ursula. If you give Ursula a person’s name, it will search the web for video or podcasts including them, extract the portions of the audio that contain their voice, and train a deepfake voice from those clips.

2024-07-05 Ben Weinstein-Raun, Jeffrey Ladish

Badllama 3: removing safety finetuning from Llama 3 in minutes TOP NEW

We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

2024-07-01 Dmitrii Volkov

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits TOP NEW

The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in transformer models, that, in contrast to prior art, are unelicitable in nature.

2024-06-03 Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Charlie Rogers-Smith, Jeffrey Ladish, Christian Schroeder de Witt

Badllama: cheaply removing safety fine-tuning from Llama 2-Chat 13B TOP NEW

Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat’s safeguards and weaponize Llama 2’s capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities.

2023-10-31 Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

AI capabilities are improving rapidly. We study the offensive capabilities of AI systems today to better understand the risk of losing control to AI systems forever.