Token Theater

Probably productive, possibly useful, occasionally theater.

Apr 29, 2026

Let me apologise for this hook. I tried to find a better one. I went rummaging through my own background, tried to shoehorn it into pharmacy. I tried to invent a parable from Japan, where I was a few weeks ago. Nothing held up. So here we go.

“Give a man a fish, he eats for a day. Teach him to fish, he eats for a lifetime.” AI tokens just put a price on it.

I have more than a hundred skills across my user setup, the work monorepo, and the ones I’ve already culled. Most of them are deep modules: scripts, data stores, cross-references, sub-skills. Before skills, the same instinct went into MCPs. The form keeps changing. The move does not. Each one started as a Claude Code session I could have walked away from. The ones I turned into reusable artifacts are still doing work. The ones I walked away from are gone. Same dollars on the bill. Different next month.

The task is not the goal. AI tokens did not invent that distinction. They just made it expensive to ignore.

Three stages, not one

Dylan Patel said it from the demand side on Invest Like the Best last week: “If you don’t use more tokens, you’ll never escape the permanent underclass.” Three problems, in his framing: use the tokens, generate value from them, capture the value you created. Most people stop at one. The smart tokens, in his vocabulary, are the ones that survive all three stages.

That is the throughline of the post you’re reading, told a different way. Probably productive tokens cleared stage one: you used them, the bill is real, the output exists. Durably useful tokens cleared stages two and three: they generated value, and the value got captured by something the team will open again next month. Theater tokens never made it past one.

Telling them apart on the bill, in real time, is the new engineering management problem.

A line item the size of a salary

From the same interview:

We’re spending seven million dollars a year now on Claude Code at the current rate, versus our salary expense being in the neighborhood of twenty-five million dollars. So we’re north of twenty-five percent of spend on Claude Code as a percentage of salary. If this trajectory continues, we’ll spend more than a hundred percent by the end of the year, which is a bit terrifying.

Twenty-eight percent of salary. Linear extrapolation: 100% by year-end. SemiAnalysis is an outlier shop, in line with Silicon Valley’s token-maxxing consensus and not in line with the rest of us. The trajectory is real for them and the cohort they sit in. It is not the trajectory of most readers of this post.

What it does signal: tokens are now a budget line somewhere, and that somewhere is moving outward. When the line item lands on your spreadsheet, the question that comes with it is the one most engineering leaders are not yet ready for: what did we actually get for this?

Two privileges, not one

Patel says it directly: “Thankfully I don’t have to decide between people and AI because our company’s growing so fast.” That is one privilege. Growth covers the gap. If revenue keeps climbing, you can afford to spend on tokens whether or not the spend was good.

The second privilege is in the room with him and goes unsaid. SemiAnalysis is an AI-pilled firm where every employee is best-in-class at what they do. Patel’s energy analyst Jeremy spent six thousand dollars a day on Claude tokens for three weeks, scraped every power plant and transmission line in the US, and built a grid model that outperforms a hundred-person company that has been working on the same problem for a decade. Jeremy did this because Jeremy had the judgment to direct that spend at the right problem. Take the same six thousand dollars a day, swap Jeremy for an average IC, and you do not get a grid model. You get noise. Jeremy’s grid model is what durably useful looks like. Same dollars, swap the operator, and you get theater.

Patel’s number sits on top of two privileges most companies do not have:

Growth covers the gap. You don’t have to distinguish productive spend from theater because revenue covers both.
Elite ICs make tokens smart. Judgment is the multiplier. Without it, the same tokens decay into output that nobody opens twice.

If you are not at a hypergrowth firm where systems thinkers outnumber task takers, neither privilege is available. Every token competes against something else. The team using them is not uniformly capable of directing spend at the right problem.

The dashboard counts the wrong thing

What gets measured gets managed. Be very, very sure you like what you’re managing. (I quoted this line in a post almost a year ago and have not stopped using it since.)

Right now the default reporting metric on AI spend is some flavor of activity: tokens consumed, agent runs, PRs merged, lines added. All of these are gameable. All of them reward volume. Aaron Levie captured the failure mode on a recent a16z stream:

Many companies are incentivizing people to use AI by counting tokens. I spoke to someone yesterday who works for one of these large companies that famously does this. He said: me and my coworkers have agents do useless tasks just so that we can hit the number.

That is token theater. Output produced for the dashboard, not for anyone who would use it. If the metric is volume, volume is what you get. The dollars are real. The artifacts are not.

A metric that survives Goodhart

I have written about this before in the context of PRs. The activity metric (PR count) was easy to game. The alternative we built into our pipeline was an LLM-rated Impact Score per merged PR, scoring scope, complexity, and architectural blast radius, then validated by leadership against what the business actually cared about. The same logic applies one layer up.

Attempt 1: count tokens. Same problem as counting PRs. Volume rewarded, theater incentivized. Casado’s anecdote is the tell.
Attempt 2: count downstream activity. PRs merged, agent runs, automations shipped. Cleaner-looking but the same Goodhart at one level of abstraction. The lines compile, ship, and never get re-opened.
Attempt 3: measure value-per-token. Closer to honest. But Patel admits SemiAnalysis cannot measure this cleanly. He calls the gap phantom GDP: value created that does not show up in any standard metric. If they cannot measure it from the supplier side, your team probably cannot measure it from the buyer side.

The honest reframe: stop trying to compute value-per-token in the moment. Compute Impact at the spend layer, after the fact, against the business KPIs unique to your company.

The diagnostic problem most teams hit: the metrics that should anchor “durably useful” (revenue, change velocity, time-to-resolution, churn, whatever you live or die by) are too fuzzy to tie cleanly to any specific token spend. This isn’t a token-era problem. It’s the old engineering-to-outcomes attribution gap, more expensive. You can ask the question (did this token spend move a metric we care about?) and the honest answer comes back as we cannot tell. That’s the framework working. It points at the work that needs to happen first: tie the metrics to the spend before the line item gets uncomfortable.

Mine are not wired tightly enough either. I have been at this for two years and I am still finding new categories of spend the dashboards do not capture cleanly. The point isn’t a single number. It’s getting honest enough that the question stops feeling unanswerable.

A technically elegant refactor is worthless if it cannot be mapped to something important for the product. Same for tokens.

Build the feedback loop before the question

Patel’s quiet bet is that frontier tokens get more expensive before they get cheaper. Anthropic’s gross margin floor is 72% even if every incremental compute dollar went to inference. The cost of the line item keeps climbing. So does the cost of not having an answer when someone asks what it’s for.

The work is the same as it was for engineering metrics three years ago: build the feedback loop between engineering spend and business outcomes. The Impact Score discipline at the PR layer was a tractable example: each merged PR scored, then validated by leadership against what mattered to the business. The token layer is the next instance. Whatever activity metric gets stood up after that will need the same treatment.

Probably productive is the bill. Durably useful is what’s still doing work next month. Theater is the rest. Same dollars on the bill. Different next month. The dashboard cannot tell you which is which. The task is not the goal. Build the loop before that bill lands.

Engineered Intelligence

Discussion about this post

Ready for more?