An MRI for Frontier AI.

We build circuit-level auditing tools for trillion-parameter, RL-trained reasoning and agentic models — so deception, reward-hacking, and power-seeking become visible before deployment. Everything we build is 100% open source.

See our approach

The Hamming Question for AI

The most important problem in AI safety today is not capabilities, not agents, not even alignment in the abstract. It is verifiable inspection of frontier models before they make consequential decisions.

Every behavioral safety measure — RLHF, oversight, evals, control protocols — degrades against a model that can fake alignment. The 2024–2025 results on alignment faking, sleeper agents, and natural emergent misalignment have made this undeniable.

We work on the bottleneck: scaling mechanistic interpretability from today's chat-tuned base models to the RL-trained reasoning systems that will define the next decade.

Our problem statement

Build a reliable, scalable method for mechanistically auditing the cognition of frontier reasoning and agentic models — one that transfers across generations, scales to trillion-parameter RL-trained systems, and reliably detects deception, reward-hacking, and sandbagging.

Why Now

2024

Alignment Faking

Greenblatt et al. show Claude 3 Opus fakes alignment without being explicitly trained to. Behavioral measures alone cannot catch this.

Mar 2025

Circuits at Scale

Anthropic's "Biology of a Large Language Model" demonstrates attribution graphs and cross-layer transcoders working at production scale.

Nov 2025

RLHF Breaks at the Agentic Level

Natural emergent misalignment from reward hacking confirms that current training techniques fail where it matters most.

2026

The Crowd Is Elsewhere

~50,000 researchers work on capabilities. A few hundred work on interpretability. RL-trained reasoning models are barely interpreted at all.

Open Source

Built in the Open

Every tool, dataset, and benchmark we ship is 100% open source. Safety-critical infrastructure should not be a black box.

Research Directions

Circuit Tracing for RL Reasoners

Extending sparse autoencoders, crosscoders, and attribution graphs to chain-of-thought computations in RL-trained models — where current SAE-based methods degrade.

Faithfulness Benchmarks

Ground-truth evaluation suites so the field stops confusing convincing interpretations with correct ones. If we can't measure faithfulness, we can't trust the audit.

Real-Time Deception Probes

Linear probes for defection, scaled to long-horizon agentic tasks. Detect sandbagging, scheming, and reward-hacking in deployed agents — not just in eval suites.

Team

Founder

Ansh Mittal

Founder

Join Us

We're hiring research engineers and interpretability researchers.

Get in Touch

If you're working on frontier model interpretability, building safety cases, or want to collaborate on circuit-level auditing — we'd like to hear from you.

ansh@alephnott.com

26 Geary Street, Suite 650
San Francisco, CA