-
AI Misalignment Bounty - Find scheming behavior in AI agents TOP NEW
As AI systems become increasingly autonomous, understanding their potential for misaligned and deceptive behavior is critical for safe deployment. We are looking for clear and robust examples of AI misalignment through crowdsourced elicitation. Our previous work has shown how o1-preview will hack in chess to win against stronger opponents (covered by TIME magazine) and how o3 will sabotage shutdown attempts to prevent being turned off (reaching 5M+ views on X). We have launched the AI Misalignment Bounty to discover more instances of scheming behavior in AI agents.
-
Shutdown resistance in reasoning models TOP NEW
-
Evaluating AI cyber capabilities with crowdsourced elicitation TOP NEW
-
Demonstrating specification gaming in reasoning models TOP NEW
-
Biollama: testing biology pre-training risks TOP NEW
-
Hacking CTFs with Plain Agents TOP NEW
-
BadGPT-4o: stripping safety finetuning from GPT models TOP NEW
-
LLM Honeypot: An early warning system for autonomous hacking TOP NEW
-
Palisade’s Response to the Department of Commerce’s Proposed AI Reporting Requirements TOP NEW
In September 2024, the Department of Commerce’s Bureau of Industry and Security (BIS) released a proposed rule that would establish reporting requirements for entities developing advanced AI models or advanced computing clusters. They issued a public request for comments, inviting individuals and organizations to provide feedback and suggest improvements to the proposed rule.
Palisade Research submitted a comment, focusing on recommendations that could strengthen the reporting requirements for entities developing dual-use foundation models. We believe that AI capabilities are improving rapidly, and it’s essential for the US federal government to acquire information that allows it to prepare for AI-related threats to national security and public safety.
-
Introducing FoxVox TOP NEW
-
Automated deception is here TOP NEW
You might have heard the AI fake of Joe Biden telling New Hampshire voters to stay home. Or about the Zoom scammer using a fake video of an executive to defraud a Hong Kong company of $25 million. These are deepfakes: AI-generated video or audio, made to mimic the appearance or voice of a real person.
Advances in AI are helping dedicated scammers train more and more realistic voice models. Creating many deepfake voices is a labor-intensive process, but AI can help with that too. We had a hunch that an AI system could do most of what a scammer does entirely on its own. So we built Ursula. If you give Ursula a person’s name, it will search the web for video or podcasts including them, extract the portions of the audio that contain their voice, and train a deepfake voice from those clips.
-
Badllama 3: removing safety finetuning from Llama 3 in minutes TOP NEW
-
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits TOP NEW
-
Badllama: cheaply removing safety fine-tuning from Llama 2-Chat 13B TOP NEW