Ralph Wiggum for Teams

Scaling autonomous loops to teams

Jan 21, 2026

This weekend I watched Ralph chew through a Telegram agent I was building for myself. Similar to my Gastown experiments but different. Seven PRs yesterday at work. Each one reviewed and merged. The agent did the implementation. I did the thinking about what “done” looks like.

This is the Ralph Wiggum technique. A bash loop that runs an AI agent until it meets your exit conditions. Geoffrey Huntley coined it. The name comes from the Simpsons character who never stops trying despite constant failure. That’s the vibe.

Most discussions of Ralph focus on solo development. One developer, one agent, one loop. I’ve been thinking about what happens when you try to scale this with a team. The answer: it doesn’t scale cleanly. But the friction is interesting.

The Shape of My Ralph

Two gitignored files per project:

ralph.json - Acceptance criteria with backpressure verification:

{
  "name": "my-project",
  "acceptance_criteria": [
    {
      "id": "ac-001",
      "description": "User auth endpoint returns JWT",
      "steps": ["Add bcrypt hashing", "Create /api/auth/login"],
      "backpressure": "curl -sf localhost:3000/api/auth/login -d '{}'",
      "passes": false
    }
  ]
}

progress.txt - Append-only log of what Ralph did.

Plus a Claude skill (collection of files, not just one) that knows how to read ralph.json, pick the first failing criterion, implement following the steps array, run the backpressure command to verify, mark it passing, commit, repeat.

The skill can be user-shared or project-scoped. That’s what needs to be distributed to other devs. The ralph.json and progress.txt are personal and ephemeral.

Where My Time Actually Goes

I spend almost no time writing code now. I spend time on:

Writing acceptance criteria that are specific enough
Designing backpressure commands the agent can run without my input

Backpressure is crucial. Geoff emphasizes this constantly. It’s just a shell command that returns exit code 0 on success. Could be curl -sf to check an endpoint exists. Could be test -f ./dist/index.js. Could be pnpm test -- --grep "auth".

The agent can’t cheat the backpressure. It either passes or it doesn’t. Binary verification.

I start every criterion by asking: what can the agent test itself without my input?

The Team Problem

Here’s where it gets messy.

If you try to share ralph.json across a team, you’d have 15 different versions being overwritten each PR. The file is ephemeral while the change is happening. It captures intent during development, then becomes irrelevant once the PR merges.

It would be useful to see ralph.json in review. “Here’s what I was trying to build, here’s how I verified it worked.” But I haven’t figured out how to achieve that without merge conflicts or noise.

For now, Ralph stays individual. Each dev runs their own loop against their own criteria. The shared artifact is the PR, not the ralph.json.

Is This Just TDD?

The question keeps nagging: is Ralph secretly just TDD with a new spin?

You write the acceptance criteria first. You define what “passing” looks like before implementation. The backpressure commands are basically test assertions.

Maybe, but something feels different. TDD is about unit-level verification. Ralph’s backpressure tends toward integration-level: does the endpoint respond, does the file exist, does the build pass. The granularity is coarser. The feedback loop is tighter because the agent runs the verification immediately.

And TDD doesn’t have the agent picking the next task from a list. That’s where Ralph diverges. You define outcomes. The agent sequences the work.

The “Dangerously Skip” Question

I’ve been talking with my team leads about Ralph. They’re uncomfortable with --dangerously-skip-permissions. Understandably.

It should probably run sandboxed. Our build is difficult to containerize right now. Food for thought.

But here’s how I’ve started framing it: working with Ralph is like working with a group of developers. I give some squishy requirements. There’s a lot of implied “don’t do this, do that.” What I get out is a PR. I check whether it’s what I had in mind. Approve or request changes.

The only difference is speed. Seven PRs yesterday. Each one still got human review. Verification still needs human eyes for anything SOC-relevant.

--dangerously-skip feels dangerous because it sounds dangerous. But the actual risk model isn’t that different from delegating to a developer who runs code on their machine and submits a PR.

What I’m Still Figuring Out

Visibility in review. The ralph.json captures useful context. How do I surface it without creating merge hell?

Sandboxing. Should be running in containers. Haven’t prioritized the work to make our build portable enough.

Team adoption. My team leads are still processing. The permission model spooks people. The speed is appealing. We’ll see.

The Question Ralph Forces

What automated acceptance criteria can I build in?

Not “what should the code look like.” What observable behavior should exist when it’s working? And can I express that as a shell command?

If yes, Ralph can get there. If no, I still need to be in the loop. It is good at leading you through the acceptance criteria manually though.

That’s the sorting function. Not “is this task easy or hard.” Is this task amenable to automatic verification?

Engineered Intelligence

Discussion about this post

Ready for more?