Open Weights, Closed Wallet
Z.ai dropped GLM-5.2 on June 16th. Topped the Artificial Analysis Intelligence Index at 51 — ahead of DeepSeek V4 Pro, Kimi K2.6, everything with open weights. MIT license. 1M-token context. IndexShare — a real architecture trick that cuts per-token compute 2.9x at full context. #2 on the Code Arena WebDev leaderboard behind only Claude Fable 5. On paper, this is the best open-weight model money can’t buy.
Except you can buy it. The weights are on Hugging Face. You just can’t run it.
The Number That Matters
753 billion parameters. ~40 billion active per token — Mixture-of-Experts, so only a fraction fires per call. Sounds efficient. Here’s the number that actually matters: the full BF16 weights are 1.51 terabytes.
An H100 has 80GB of VRAM. You’d need nineteen. Even at Unsloth’s most aggressive quant — 1-bit dynamic — you still need 176GB. The only consumer machine that clears that bar is a Mac Studio M3 Ultra with 256GB+. That’s $9,500 for 3–9 tokens per second. Not “interactive.” That’s “queue it up before lunch.”
What “Open” Actually Means
This is the tension the local-AI crowd keeps stepping over. “Open weights” is not “open to use.”
Counterargument: Rent cloud GPUs by the hour. Sure. But if you’re renting cloud GPUs to run GLM-5.2, your data still goes through someone else’s data center. Your prompts still touch foreign RAM. The open license protects you from the Fable 5 problem — the off-switch pulled by government order. It doesn’t protect your privacy unless you own the iron.
Counterargument: Quantize harder. At the 2-bit dynamic quant you need 241GB. At 1-bit, 176GB. Below that, the quality cliff is real — Willison’s side-by-side test got a gorgeous pelican and then a tragically bad opossum in the same session. This model rewards the hardware to run it at some fidelity.
What They Got Right
IndexShare isn’t a benchmark gimmick. Reusing a lightweight indexer across four sparse-attention layers is genuine co-design: making million-token context affordable to serve. Latent Space has the breakdown. If you’re building agentic coding pipelines — multi-file refactors where a single session burns 43k output tokens — this is probably the best open option. The question is whether you hit the API at $4.40/million output tokens or build infrastructure to host it yourself.
The Honest Call
For most of you, GLM-5.2 is a signpost, not a tool.
If you’re running a 24GB card at home, the 30B-class models are where you live. They fit. They run fast. They’re good enough for daily work. Picking the biggest model on the leaderboard is rarely the right local decision. Picking the biggest one you can actually run well — that’s the skill.
GLM-5.2 proves Chinese labs can compete on frontier capability while keeping MIT licensing. It proves the architecture to handle million-token context exists. It proves the open ecosystem can keep pace. It also proves something less comfortable: the hardware bar for frontier open models has left the consumer lane.
The weights are open. The wallet isn’t. Those two facts are going to collide more and more.
Sources: Vetted Consumer, Simon Willison, Artificial Analysis, Hugging Face, Latent Space