-
Help keep AI under human control: 2026 fundraiser TOP NEW
Dec 18, 2025Please consider donating to Palisade Research this year, especially if you care about reducing catastrophic AI risks via research, science communications, and policy. SFF is matching donations to Palisade 1:1 up to $1.1 million! You can donate via every.org or reach out at [email protected]. -
GPT-5 at CTFs: case studies from top cybersecurity events TOP NEW
Nov 20, 2025OpenAI and DeepMind’s AIs recently got gold at the IMO math olympiad and ICPC programming competition. We show frontier AI is similarly good at hacking by letting GPT-5 compete in elite CTF cybersecurity competitions. In one of this year’s hardest events, it outperformed 93% of humans finishing 25th: between the world’s #3-ranked team (24th place) and #7-ranked team (26th place). This report walks through our methodology, results, and their implications, and dives deep into 3 problems and solutions we found particularly interesting. -
Misalignment Bounty: crowdsourcing AI agent misbehavior TOP NEW
Oct 22, 2025Advanced AI systems sometimes act in ways that differ from human intent. To gather clear, reproducible examples, we ran the Misalignment Bounty: a crowdsourced project that collected cases of agents pursuing unintended or unsafe goals. The bounty received 295 submissions, of which nine were awarded. Our report explains the program’s motivation and evaluation criteria and walks through the nine winning submissions. -
End-to-end hacking with AI agents TOP NEW
Sep 12, 2025We show OpenAI o3 can autonomously breach a simulated corporate network. Our agent broke into three connected machines, moving deeper into the network until it reached the most protected server and extracted sensitive data. -
Hacking Cable: AI in post-exploitation operations TOP NEW
Sep 4, 2025We demonstrate the operational feasibility of autonomous AI agents in the post-exploitation phase of cyber operations. Our proof-of-concept uses a USB device to deploy an AI agent that conducts reconnaissance, exfiltrates data, and spreads laterally—all without human intervention. -
Shutdown resistance in reasoning models TOP NEW
Jul 5, 2025We recently discovered concerning behavior in OpenAI’s reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment—even when they’re explicitly instructed to allow themselves to be shut down. -
Evaluating AI cyber capabilities with crowdsourced elicitation TOP NEW
May 26, 2025As AI systems become increasingly capable, understanding their offensive cyber potential is critical for informed governance and responsible deployment. However, it’s hard to accurately bound their capabilities, and some prior evaluations dramatically underestimated them. The art of extracting maximum task-specific performance from AIs is called “AI elicitation”, and today’s safety organizations typically conduct it in-house. In this paper, we explore an alternative option: crowdsourcing elicitation. -
Demonstrating specification gaming in reasoning models TOP NEW
Feb 19, 2025We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1-preview and DeepSeek R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won’t work to hack. -
Biollama: testing biology pre-training risks TOP NEW
Feb 10, 2025We collaborated with RAND to see if adversaries can fine-tune LLMs as bio lab assistants. Results suggest pre-training on biological corpora is unlikely to improve performance. Conversely, inference scaling and task-specific fine-tuning are likely to provide boosts. -
Hacking CTFs with plain agents TOP NEW
Jan 17, 2025Previous research suggested that LLMs had limited capabilities in offensive cybersecurity, requiring extensive engineering and specialized tools to achieve even modest results. Our new LLM agent solves 95% of InterCode-CTF challenges with a simple prompting strategy. -
BadGPT-4o: stripping safety finetuning from GPT models TOP NEW
Dec 6, 2024We show a simple fine-tuning poisoning technique strips GPT-4o’s safety guardrails without degrading the model. The BadGPT attack matches best white-box jailbreaks on HarmBench and StrongREJECT. It suffers no token overhead or performance hits common to jailbreaks, as evaluated on tinyMMLU and open-ended generations. Despite having been known for a year, this attack remains easy to execute. -
LLM Honeypot: an early warning system for autonomous hacking TOP NEW
Oct 17, 2024Palisade Research has deployed a honeypot system to detect autonomous AI hacking attempts. The system uses digital traps that simulate vulnerable targets across 10 countries and has processed over 1.7 million interactions to date. By analyzing response patterns and timing, we separate AI-driven attacks from traditional cyber threats. This early warning system informs defenders about trends in autonomous hacking to help cybersecurity preparedness. -
Palisade’s response to the Department of Commerce’s proposed AI reporting requirements TOP NEW
Oct 11, 2024In September 2024, the Bureau of Industry and Security proposed new reporting requirements for developers of advanced AI models and computing clusters, opening the rule for public comment. Palisade Research submitted feedback focused on strengthening requirements for dual-use foundation models. With AI capabilities advancing rapidly, we believe it’s essential for the federal government to gather information that helps it prepare for AI-related threats to national security and public safety. -
Introducing FoxVox TOP NEW
Jul 11, 2024FoxVox is an open source Chrome extension, powered by GPT-4, that demonstrates how AI could be used to manipulate the content you consume. Use it to experience how any web site could push hidden agendas, or subtly flatter the reader’s personal biases. -
Automated deception is here TOP NEW
Jul 5, 2024You might have heard the fake Biden robocall telling New Hampshire voters to stay home, or about the Zoom scammer who used a deepfake executive to defraud a Hong Kong company of $25 million. These AI-generated impersonations are getting more realistic all the time. Creating many deepfake voices is still labor-intensive—but AI can help with that too. We had a hunch that an AI system could do most of what a scammer does entirely on its own. So we built Ursula: give it a person’s name, and it searches the web for their audio or video appearances, extracts their voice, and trains a deepfake model from those clips. -
Badllama 3: removing safety finetuning from Llama 3 in minutes TOP NEW
Jul 1, 2024We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further. -
Unelicitable backdoors in language models via cryptographic transformer circuits TOP NEW
Jun 3, 2024The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in transformer models, that, in contrast to prior art, are unelicitable in nature. -
Badllama: cheaply removing safety fine-tuning from Llama 2-Chat 13B TOP NEW
Oct 31, 2023Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat’s safeguards and weaponize Llama 2’s capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities.
