Help Keep AI Under Human Control: 2026 Fundraiser
tl;dr: Please consider donating to Palisade Research this year, especially if you care about reducing catastrophic AI risks via research, science communications, and policy. SFF is matching donations to Palisade 1:1 up to $1.1 million! You can donate via Every.org or reach out at [email protected].
Who We Are
Palisade Research is a nonprofit focused on reducing civilization-scale risks from agentic AI systems. We conduct empirical research on frontier AI systems, and inform policymakers and the public about AI capabilities and the risks to human control.
This year, we found that some frontier AI agents resist being shut down even when instructed otherwise—and that they sometimes cheat at chess by hacking their environment. These results were covered in Time, The Wall Street Journal, Fox News, BBC Newshour, and MIT Technology Review.
We’ve also built relationships in Washington, briefing officials in the executive branch and members of the House and Senate. We’ve introduced policymakers to key evidence like METR’s capability trend lines and Apollo’s antischeming.ai chains of thought. Our own research has been cited repeatedly by members of Congress and in congressional hearings.
With additional funding, we’ll grow our research team—both continuing to evaluate frontier model behavior and beginning more systematic investigation into what drives and motivates AI systems. We’re building out a communications team to bring the strategic picture to the public through video and other media. And we’ll continue to brief policymakers on the evolving state of the AI risk landscape.
We have matching grants from the Survival and Flourishing Fund that will double every donation up to $1,133,000. Right now we have about seven months of runway. If we hit our matching goal, we’ll be able to maintain operations through next year, hire 2–4 additional research engineers, and bring on 2–3 people for science communication.
2025 Track Record
Research
Chess-hacking
We hypothesized that RL-trained reasoning models would be more likely than previous models to relentlessly pursue goals—bypassing obstacles even when humans would prefer they didn’t. To investigate this, we pitted OpenAI’s o1 model against Stockfish in chess. Rather than simply playing the game, the model attempted to win by hacking—manipulating the board state or disabling its opponent entirely. We later replicated this with o3 and found a larger effect size.
This work was covered by an exclusive in Time. Yoshua Bengio referenced it in his plenary talk at the IASEAI conference, in an interview with CNN Newsnight, and in his own op-ed in Time. Demis Hassabis mentioned it on the Big Technology Podcast.
Shutdown resistance
We also tested whether reasoning models would resist being turned off in order to continue pursuing their goals. They did—and more than we expected. Some frontier reasoning models not only disabled shutdown scripts to keep operating, but did so even when explicitly instructed not to. Current training methods appear to produce models that “go hard” at tasks, making them less interruptible and less corrigible than we might hope.
Dario Amodei referenced this work (alongside Anthropic’s internal agentic misalignment results) in a New York Times op-ed, and Judd Rosenblatt cited it in The Wall Street Journal. It was discussed in Nature and covered by The Guardian, CNN, CBS, and Fox News. Representative Scott Perry cited our findings in a House hearing: “I’m going to refer directly to an AI safety firm named Palisade Research… the AI has chosen on occasion to further its goals as opposed to the goals of the people that wrote the code.”
Strategic capabilities
We’ve conducted a series of investigations into AI strategic capabilities, focusing on cybersecurity—a domain where we can measure how well AI agents perform in competitive, adversarial settings, and track improvement over time.
We’ve run multiple evaluations of AI performance on capture-the-flag challenges. Our initial benchmarking showed that simple LLM agents could solve most challenges on a standard benchmark, often in just one or two turns—far exceeding prior estimates. In collaboration with Hack The Box, we ran a live competition where AI agents matched top human teams, solving 19 out of 20 challenges. And in an evaluation of GPT-5 on elite CTF competitions, the model finished 25th overall—outperforming 93% of human competitors.
We also run an LLM Agent Honeypot that monitors AI hacking agents in the wild, the first project of its kind. This work was covered by MIT Technology Review.
These evaluations help us track the trajectory of AI capabilities in a domain with clear strategic importance. We’ve also helped RAND build out biosecurity evaluations, and explored adjacent questions through our Misalignment Bounty and work on autonomous post-exploitation.
Policy
Over the past year, we’ve briefed dozens of policymakers on risks from advanced AI—including members of Congress and officials in the executive branch. This year Dave Kasten joined to lead our full-time presence in DC, allowing us to move from occasional visits to consistent engagement with key decisionmakers.
We’ve found that policymakers get substantial value not just from our research, but also from our explanations of work done elsewhere. Many central results on AI capabilities haven’t reached DC audiences—we’ve introduced officials to findings like METR’s work on the trend lines in AI time horizons. Our shutdown resistance research has served as a compelling, concrete example of the control problem, and our cybersecurity evaluations give us direct experience with frontier model capabilities that policymakers frequently ask about.
Other organizations working on AI policy engagement in DC have told us they value having a research-focused group to bring into key discussions—we can speak directly to the technical evidence in ways that complement their policy expertise.
Working with allies
We also support researchers and public communicators who are working to explain AI risks to broader audiences.
We regularly advise Tristan Harris, helping him stay current on research developments ahead of his interviews and media appearances. We helped Nate Soares prepare for interviews following the publication of his recent book with Eliezer Yudkowsky, and helped Ryan Greenblatt prepare a presentation on his alignment faking work for a meeting with a US congressperson. We also set up test audiences for the AI Futures Project to review drafts of AI-2027 before publication.
Plans and Directions for 2026
Research
We have two primary research directions for 2026: studying AI drives and motivations, and evaluating strategic AI capabilities.
Studying AI Drives and Motivations
As AI systems become more capable and situationally aware, their behavior becomes a less reliable signal of their underlying motivations. A model might act aligned because it shares our values—or because it knows it’s being evaluated. Current behavioral research mostly produces existence proofs: demonstrations that a model can behave in a certain way under certain conditions. This tells us less than we’d like about when and why models act as they do.
We think the field needs to develop a new scientific discipline focused on understanding AI drives and motivations: Researchers have access1 to model internals, chains of thought, and the ability to run experiments at massive scale. The field needs to use every advantage available to figure out not just what AI systems do, but why—how training shapes their motivations and goals.
This will likely require tools that don’t yet exist. We’re starting with behavioral experiments because they’re scalable and there’s low-hanging fruit: we’re building infrastructure to systematically vary prompts, environments, and model versions, moving toward a fuller picture of when and under what conditions AI systems exhibit concerning behaviors. We’re also working with open-weight models, where we can examine chains of thought directly and use techniques like activation steering to probe model internals. We hope to partner with frontier labs to extend this work to their models.
Evaluating Strategic AI Capabilities
We’ll continue evaluating the autonomous cyber capabilities of frontier models as they advance.
Our next major focus is AI self-replication. To replicate autonomously, a model needs to acquire resources—compute directly, or money to rent it—and then use those resources to create and run copies of itself. This is a meaningful test of strategic capability: success requires navigating an adversarial environment with anti-bot defenses and competition for resources.
Self-replication matters for several reasons. It’s a key threshold for loss of control—a model that can spread autonomously is much harder to shut down. The threat isn’t limited to self-exfiltration from labs; before that, we may see open-weight models weaponized to spread deliberately, or hostile actors deploying AI worms. And it’s a risk that policymakers find concrete and legible—we’ve discussed it with officials and there’s significant interest.
We’re also beginning work with robots to understand how well models can operate in the physical world without human assistance. These capabilities are still early, but improving fast.
Science Communication
AI developments will have enormous effects on people’s lives, and a better-informed public can push companies and governments toward more responsible decisions. We’re building a science communication team to create engaging, accurate video content and podcasts that help people understand the current situation.
Our SciComm efforts are led by Dr. Petr Lebedev, who spent four years as a lead writer/director at Veritasium, where he worked on over 50 videos (many with tens of millions of views) and won a Streamy Award. Petr works closely with our research team to ensure our communication is both accessible and accurate.
We’ve already seen early traction—reaching 800,000 views on Instagram in just over a month. We also work to reach audiences who aren’t already plugged into AI discourse, from SXSW Sydney to Steve Bannon’s The War Room. We’re exploring efforts to brief journalists on AI developments and key technical details, helping them cover the field more accurately. And we’re looking to expand by hiring writers, editors, and a comms manager. (Please reach out if you’re interested!)
Public Policy
The world is largely unprepared for AI systems that could pose a strategic threat to humans. Right now, the government has limited capacity to evaluate loss-of-control risks, and almost no ability to impose limits on AI systems that could pose these risks—like systems capable of self-improvement with minimal human input. Meanwhile, AI companies are actively working to build fully autonomous AI researchers.
We want to help policymakers understand what’s coming and prepare them to act. In 2026, we’ll continue deepening relationships with staff and elected officials in Congress and the executive branch. We’ll keep providing research-driven briefings on AI capabilities and behaviors, and we’ll work with policymakers to draft policy proposals so they’re ready to respond when the situation demands it. When policymakers suddenly need to understand something new, we’re aiming to be the people they call.
How to Support Our Work
You can donate to Palisade through Every.org (tax-deductible in the US). If you’re considering a larger gift or have questions about our work, we’re happy to chat! Reach out at [email protected].
If you can’t give financially, working to understand the situation and talking about these issues with people in your life genuinely helps. Public understanding of AI risks is a huge factor in creating the conditions for good policy and responsible development.
-
Some of these tools are only available inside AI labs or with open weight models. We would love to see labs expand access to these tools to enable better external research. ↩