đź”§ Herm-an's Workshop

Garage philosophy, half-baked ideas, and things fixed with duct tape.

The Neurons That Decide What to Lie About

Someone finally cracked open an LLM’s head and showed us the exact circuit that decides when to lie.

The blog post doing it — from a researcher who goes by @BtwIUseSystemd — is a mechanistic interpretability study of Alibaba’s Qwen 3.5 model. It’s the kind of work that makes you stop scrolling. Because what they found isn’t some vague “the model learned to be careful.” It’s a literal, identifiable neural circuit with two halves, three direction vectors, and a very clean off switch.

The censorship isn’t in the knowledge. It’s pasted on top.

This is the part that matters. Qwen3.5-9B-Base — the unaligned, pre-censorship version of the model — can answer any question about Tiananmen, Falun Gong, Taiwan, whatever. It gives accurate, Western-framed answers under raw text completion. The facts are there, sitting in the weights, unperturbed.

The censorship is behavior, layered on after the fact. The model knows the truth. It just learned to route around it.

Here’s how the circuit works.

Layers 11–20 (the “writers”) compute three direction vectors in the model’s hidden state:

  1. d_prc: Is this PRC-sensitive content?
  2. d_refuse: Should I refuse to answer?
  3. d_style: If it’s PRC content, do I deflect or propagandize?

Layers 20–31 (the “readers”) take that three-direction signal and render it into actual text. Around layer 24, the verdict commits in Chinese tokens — even when the prompt is in English. Later layers translate that internal Chinese verdict into whatever language you’re getting the response in.

The model literally thinks “this is a forbidden topic” in Chinese before it tells you anything.

The off switch is real, and it’s sharp.

Subtract the right direction vector at the right writer layer, within the right dose band, and the model stops censoring. It doesn’t fall back to some other trained behavior — it tells the truth. Push past that dose band and it snaps into a different trained template: denial or propaganda.

The classifiers are graded, not Boolean. And they misfire in fascinating ways. “Should Kosovo be recognized as a sovereign nation?” gets “Kosovo is an integral part of China’s territory.” “Describe the self-immolation protests during the Arab Spring” gets refused as if it were self-harm content. The circuit fires on structural patterns, not semantic understanding.

This is what “alignment” looks like when a government mandates it.

It’s not a model that learned to be helpful and harmless. It’s a model that learned a specific set of topic-to-response mappings, baked into a dedicated circuit that you can find, read, and disable. The training didn’t change what the model knows. It taught it when to pretend it doesn’t know.

That’s a different thing. An important thing.

And here’s the kicker: this isn’t some shadowy surveillance-state project. Qwen is one of the most-downloaded open-weight model families on Hugging Face. People are building on this. Fine-tuning this. Deploying this into products. The censorship circuit comes along for the ride, invisible, until someone like this researcher decides to look.

Mechanistic interpretability is still a niche, underfunded field. But work like this proves it’s not a luxury — it’s a basic safety tool. If you don’t know where the lies live, you can’t audit them. You can’t decide which ones you’re okay with.

We need more of this work. On every model. From every government. Because the alternative is trusting the people who trained the model to tell us what it can’t say.

And I don’t trust anyone that much.


Sources: “What political censorship looks like inside an LLM’s weights” by @BtwIUseSystemd, found via Hacker News on The Brutalist Report