The most important problem in AI safety today is not capabilities, not agents, not even alignment in the abstract. It is verifiable inspection of frontier models before they make consequential decisions.
Every behavioral safety measure — RLHF, oversight, evals, control protocols — degrades against a model that can fake alignment. The 2024–2025 results on alignment faking, sleeper agents, and natural emergent misalignment have made this undeniable.
We work on the bottleneck: scaling mechanistic interpretability from today's chat-tuned base models to the RL-trained reasoning systems that will define the next decade.