Average Guacamole
When the crowd mean is what you actually want
Apologies for the quiet stretch. Back with something from Simon Willison’s interview on Lenny’s podcast that I’ve been chewing on.
Simon said Claude turns out to be an excellent chef. Which doesn’t make sense, because it doesn’t have taste buds. But it can give you the global average of the world’s guacamole recipes, and the global average of guacamole is, somehow, good guacamole.
The model has no taste. The crowd does.
That’s the whole trick.
Millions of home cooks, over generations, have iterated recipes toward a median. The ratio of avocado to lime to onion to salt that everyone converges on is the ratio that works. The taste buds weren’t in the model. They were in the humans who wrote the training data. The model is a really good averaging machine pointed at a really good crowd.
“Average” used to be an insult. You didn’t aspire to the median advice, the median craftsman, the median restaurant. Then we built a machine that serves you the global average on demand, and it turns out the average is sometimes astonishing.
The engineering question isn’t “is the average good?” It’s whose taste buds does the training data contain, and have they converged?
The harness I’m building right now
I’m spinning up an agentic harness for a client this week, and I’m making a choice that would have embarrassed me five years ago. I’m using the defaults. The boring stack. The one every tutorial uses. The one any developer can pick up on day one without a Rosetta Stone.
The reason is downstream. This thing needs to be maintainable in six months by whoever is around. That “whoever” increasingly includes the model. And the model doesn’t care about my opinions. It cares about what it has seen a million examples of.
Pick a bespoke stack and the future support burden lands on whoever can read your particular dialect. Pick the boring average and the support burden lands on the global crowd, which includes every LLM trained since 2022. Every time Claude needs to patch a route handler in this codebase, it’s reaching for the same pattern it’s seen ten thousand times. The training data is doing free maintenance work for me.
The guacamole insight, applied to architecture: stack selection is now an AI-compatibility decision. Pick the median. Not because you can’t do better, but because the median is where the taste buds live.
Where the average is a trap
The average is excellent when humans have iterated well toward the mean: guacamole, boilerplate Python, Dockerfiles, SQL joins, onboarding READMEs. Most backend code is commodity; nobody’s reading your retry logic for style points. And if you know nothing about AppleScript, the average AppleScript is better than your bad guess. Simon’s point: he’d avoided the language for two decades because the learning curve was two months, and Claude collapsed that overnight.
But the median is a trap exactly where you’d expect it to be. The average tweet is boring. The average essay is forgettable. The average advice on your specialty is worse than you, because you’re already above the median there. If the training data doesn’t contain your problem, the average is confidently wrong, and the confidence is the dangerous part. The global average guac is not Oaxacan and not California. It’s fine, and fine is fine until you needed not-fine.
Before asking the model for anything that matters, the useful question is: where did this model’s taste come from, and do I trust that crowd on this specific thing? Food blogs are a trustworthy crowd for guacamole. Stack Overflow is a trustworthy crowd for JavaScript idioms. Medium is not a trustworthy crowd for investment strategy. “How to have a difficult conversation with your co-founder” is mostly a crowd that read the same five HBR articles.
The median of bad inputs is bad. The median of good inputs is good. The model does not distinguish. You have to.
The reframe
The most valuable skill working with LLMs isn’t prompt engineering. It’s knowing the shape of the crowd behind the answer.
In commodity domains, lean on the average and ship. In your expert domain, treat the average as a baseline to beat. In novel domains, the average is noise and you should generate options, not answers.
For this week’s harness, the crowd is overwhelming and well-iterated and well-represented in every model I’ll touch. I’m picking defaults on purpose. The training data is my junior engineer and I want it to have seen this pattern before.
Related: The Only Eval That Matters on how benchmarks don’t know whose taste they’re measuring.
Pick the median where the crowd has converged. Fight for distinction where it hasn’t. The taste buds are already in the training data, for the things the crowd had taste for.
The rest is still on you.


