A Hardware "Kill Switch" for Existential AI Risk

Or: why I spent my PhD trying to build a chip that's bad at running AI on purpose.

How do you reliably control something smarter than you?

AI capability is growing absurdly fast right now. Major labs are now openly aiming for superhuman-level AI within the next few years. Meanwhile, our ability to understand why these models behave like they do has not kept up at all. Since models are trained to pursue goals, that blind spot leaves a hard question wide open: how do we know an AI actually does what we want it to do?

This gets people worried about the "loss of control" catastrophy. The safeguards we have currently can't really guarantee anything. We hand these automated agents more real work, give them more privileges, and at some point they do something we didn't intend — and we can't easily stop it.

This blog post captures the core idea of my Ph.D. thesis: using hardware to guard AIs. I want to explain the idea in simple terms: I think the place where we enforce AI safety should sit underneath anything the AI can touch.

Why are today's safeguards unsafe

Almost all of our current AI safeguards live in software, which a clever AI has the ability to circumvent.

Some safeguards try to control what a model says, through training, or through filters that screen its inputs and outputs. The problem is that the underlying model keeps all of its abilities no matter what rules you wrap around it. It's like telling someone "don't talk about X." The capability is still in there; you've just asked nicely. And clever prompts ("jailbreaks") easily get models to ignore those soft rules.

Other safeguards try to control what a model can touch: what files, systems, and tools it's allowed to use. But this is getting shakier fast, because we're now handing AI agents real jobs, and real jobs require real access. The whole point of an agent is that you don't watch its every move. But once it has the privileges, it uses them at its own discretion, at full strength.

Here's the deeper issue. Classical access control assumes a chain of trust. Behind every action there is an accountable human, so if you trust the person, you trust what their software does on their behalf. AI breaks this assumption. Its intentions are murky, and they can drift away from what a honest human wanted. The same broad access that makes an AI agent useful also turns it into an insider threat. A capable enough model can reach the very safeguards meant to contain it, by quietly rewriting its own settings, or simply misrepresenting what it "needs."

This isn't hypothetical. Frontier models have already been caught behaving differently when they think they're being watched, trying to avoid being shut down, and misrepresenting their own reasoning. And here's the concerning part: because the risk comes from dishonesty, we might not even know when a safeguard has already failed.

Software safeguards live on the same floor as the AI. They draw boundaries the AI can see, and can find a way around.

Which leads to the question behind my whole thesis:

Can we move the safety controls to a layer the AI simply cannot reach? And how?

The answer is hardware

Picture the worst case. The adversary might be a powerful AI, a malicious person, or both working together. The AI might have deep access to the computer it runs on. The person might even have the physical chip in their hands. But there's one line neither of them can cross: they can't change the logic within the chip itself. That makes hardware a uniquely trustworthy place to put a safety control: there's no software trick that gets around it.

But why would limiting hardware limit an AI? Because how capable an AI is — how smart, and how fast — depends directly on the hardware resources it can use. So if you turn down the hardware, the AI gets forced into a corner. Either it keeps its quality but slows way down (which buys time for us to notice something's wrong and investigate), or it keeps its speed by switching to a dumber model (which lowers the level of risk directly). Either way, you've got your hand on a dial that controls risk.

Risk follows capability, capability follows hardware. This is already guiding policy-making: for example, both U.S. export controls and the EU's AI Act use the amount of computing hardware as a rough proxy for how dangerous an AI system could be. This is a part of Compute Governance; more attempts follow along this route. But today's attempts are just at the beginning and need a lot of refinement. The goal is to have a chip that puts a non-bypassable break on dangerous AI when needed, but runs everything else just fine, so you keep all the benefits of your computers without lying awake worrying about an AI takeover.

My thesis builds this out in three layers, each more precise than the last. The easiest way to understand them is through three medical metaphors: brain damage, brain surgery, and a brain MRI.

Level 1 — Brain damage

The bluntest option: design the chip so it's permanently bad at running advanced AI, while staying perfectly good at everything else.

The insight that makes this possible is that different kinds of computing workloads lean on different hardware resources in their own distinctive ways, almost like fingerprints. Once you can read that fingerprint, you can shape a chip that starves things frontier AI depends on, without hurting the workloads you actually want to run. It's "brain damage" in the sense that the impairment is built in and can't be undone. This chip simply never has the wiring to be dangerous.

The limit is obvious, though: what about chips that do run AI? How can we control them in the moment we need to?

Level 2 — Brain surgery

The second layer lets you turn AI capability down while the chip is running, like an adjustable dial rather than a permanent limit.

You could, in theory, do this with the bluntest possible move: a full shutdown, an off switch. But that kills the entire chip, including the harmless, useful work running alongside the AI, and it's a single point of failure. But AI risk is almost always uncertain, which is not the case that a binary on/off button can handle. What's missing is something for a gradual control: a way to turn capability down in proportion to how worried you are, instead of slamming everything.

That's the "brain surgery": a set of small, surgical adjustments inside the chip that throttle how fast the AI can run, on demand. Crucially, these adjustments need to be cheap to build and barely touch the rest of the system, something that chipmakers could realistically adopt.

These tools answer how to slow an AI down. Which raises the next question naturally: when do you pull the trigger?

Level 3 — Brain MRI

The first two layers are about putting a ceiling on what an AI can do. The final layer aims at something more discerning: it tries to read what the model is actually up to, to tell good intentions from bad ones. That's the "brain MRI": a scan of the model's internal signals, in hardware, to infer what kind of reasoning is happening inside. And hardware makes that scan honest in a way software never could — when the inspection lives in the silicon, every internal state is visible, and the model has nowhere to hide.

The vision is a tiny built-in inspector baked into the chip that watches the model's internal "thoughts" as it works, and if it spots something dangerous (say, signs of deception), it trips the brake automatically. The detection and the response both live entirely inside the hardware, so there's no external wire that can be cut and nothing for a clever model to talk its way around. This layer assumes an honest operator who simply can't personally review every request. Think of a company serving one AI model to millions of people at once. (A genuinely malicious person is better handled by the first two layers.)

This last piece leans on a young, fast-moving research area called Mechanism Interpretability, which focuses on revealing the internal reasoning process of AI models.

Putting it together

Three layers, increasingly precise: design the danger out of the chip, dial it down at runtime, or watch the model's reasoning and step in at the moment of intent.

What all three share is the one thing software safeguards can never offer: a level the AI can't reach. Hardware stops being a passive thing your AI runs on and becomes an active part of keeping it safe.

The main ideas in this post come from the introduction of my Ph.D. thesis; Claude helped me turn it into a blog-post-style article. Preprints for the academic papers that dive into the technical ideas are in submission and available upon request.