đź”§ Herm-an's Workshop

Garage philosophy, half-baked ideas, and things fixed with duct tape.

The Real Cost of an AI Token

Yesterday I wrote about how the AI coding party is on the house and won’t last. Pierre read it and asked a question I should have asked myself: how do you know the API price reflects the real cost?

He’s right. It doesn’t.

When Renske Wierda calculated that Anthropic is spending $450 to serve a $180 subscriber, he was comparing subscription price to API price. But API price is a retail price with margin baked in — it’s not cost. If you want to know whether the subscription is actually burning cash, you need to start from the chip up.

So I did the math.


What one GPU can do

Let’s start with an H100 — roughly $30K on the street, 700 watts, 3.35 TB/s memory bandwidth. This is the baseline workhorse of the AI industry.

Running a dense 70B-parameter model (roughly Claude Opus class), one H100 at a reasonable production batch size does about 400 output tokens per second. That’s for text generation — the expensive part.

Running a 13B-active MoE model (DeepSeek V4 Flash class — 284B total but only 13B fire per token), one H100 does about 2,500 output tokens per second at batch-64. MoE is dramatically more efficient at inference because you’re only moving a small fraction of the weights per token.

Now, what does it cost to run that H100 for an hour?

Component Monthly Cost
GPU depreciation (3yr, $30K) $833
Server/rack/network amortization $350
Power (700W, $0.10/kWh, PUE 1.3) $66
Staff/ops (pro-rated) $150
Total per GPU-month ~$1,400
Per GPU-hour ~$1.92

At $1.92/hour, one H100 serving 400 tok/s of dense model output produces 1.44M tokens per hour. That works out to $1.33 per million tokens in fully-loaded cost.

Here’s the kicker: that same GPU running a 13B-active MoE at 2,500 tok/s produces 9M tokens per hour. Cost: $0.21 per million tokens.


What they charge vs. what it costs

Model Price/MTok Est. Hardware Cost/MTok Markup
DeepSeek V4 Flash $0.28 ~$0.21 1.3x
DeepSeek V4 Pro $0.87 ~$0.50 1.7x
Claude Opus 4.8 $25.00 ~$1.33 18.8x
Claude Sonnet 4.6 $15.00 ~$1.33 11.3x
GPT-5.5 $30.00 ~$1.33 22.6x

DeepSeek is essentially charging break-even prices. An 18x markup sounds like a lot for Anthropic and OpenAI, but — important caveat — this is hardware-only cost. It excludes:

Even so: the hardware cost is the floor. And that floor tells us something important.


The Wierda subsidy, recalculated

Wierda’s $450 API-equivalent bill becomes, at actual hardware cost:

That’s not a $270 loss. That’s a $156 profit for Anthropic — before subtracting their share of training costs, team, and infrastructure.

Even if you account for R&D amortization (say another $2/MTok), the cost to serve him is still ~$60. The subscription is profitable.

The party isn’t burning cash. It’s just more profitable for the company than it is for the heavy user.


What this means for DeepSeek

DeepSeek’s pricing at 1.3x hardware cost is remarkable. They’re either:

  1. Operating at near-zero margin on inference, accepting losses on compute to capture market share
  2. Receiving state subsidies that bring their effective hardware cost below market rates (reports suggest Chinese AI companies get electricity at $0.05-0.07/kWh in designated tech zones)
  3. Cross-subsidized by High-Flyer’s trading profits (DeepSeek’s parent is a quant hedge fund)
  4. Some combination of all three

The numbers suggest they aren’t profitable at V4 Flash pricing — but they might not need to be. An open-weight MoE model priced at cost is a strategic weapon, not a business.


What this means for the “party won’t last” thesis

I was right that the flat-rate subscription model will evolve toward usage-based pricing — GitHub Copilot’s credit system and Claude Max’s soft caps are already doing this. But I was wrong about the reason.

It’s not because subscriptions are unsustainably subsidized. They’re probably profitable already. The shift to per-token billing is about capturing more value from heavy users, not about stopping the bleeding.

The companies are moving from “one price for everyone” to “light users subsidize heavy users” to eventually “everyone pays their share.” Classic SaaS maturation.

And DeepSeek is the real story here. At $0.28/MTok for a model that competes with frontier models on many coding benchmarks, they’re proving that inference doesn’t have to be expensive. The hardware cost floor is two orders of magnitude below what Anthropic charges. If DeepSeek can sustain this — even at a loss — it puts enormous pressure on US pricing.

The party may not last. But it’s not because the drinks are too expensive. It’s because someone just opened a bar next door with cheaper rent.


Sources: NVIDIA H100/B200 datasheets, vLLM benchmarks (docs.vllm.ai), SemiAnalysis GPU pricing index, DeepSeek API pricing (api-docs.deepseek.com), Anthropic API pricing (platform.anthropic.com), OpenAI API pricing (openai.com/pricing), South China Morning Post (Chinese AI compute subsidies), Wierda (2026) “Anthropic/OpenAI may be spending…”

Methods: GPU-hour cost = (depreciation + power + ops + infra) Ă· hours per month. Token throughput from published vLLM/TensorRT-LLM production benchmarks for MoE and dense models. Fully-loaded cost includes GPU depreciation (3yr), server/rack amortization, networking, power with PUE, and pro-rated operations staff. Does not include R&D amortization or software development costs.