Building Safety Nets for Dangerous Ideas
Why fast iteration dies in the debugging stage and how to fix it
Ship faster. The only acceptable speed is faster.
I wrote about this in Many Discoveries are Accidents, how iteration velocity creates discovery surface area. But velocity without visibility is just chaos.
When things break at speed, you need to know what broke and why. Not in two hours. In ten minutes.
Traditional monitoring won’t get you there. Dashboards show symptoms. Alerts fire without context. Stack traces point everywhere except the actual problem. You can’t afford the archaeological dig through CloudWatch logs when you’re running experiments, deploying constantly, and letting AI agents touch production.
You need incident response compressed the same way AI compressed development cycles.
The Debugging Tax
You’re staring at 500 lines of CloudWatch logs. There’s an error somewhere. Maybe multiple errors. Maybe the real error is buried three stack traces deep. The answer exists in this wall of text, but finding it means scrolling, grepping, correlating timestamps across services, reconstructing what happened from fragments.
This is where most debugging time goes. Not fixing the problem, finding it. A friend describes it as “looking for a needle in a haystack, where the haystack is on fire, and the needle is invisible.”
Standard logging assumes humans will parse the output. So we get:
Structured logs that require knowing what to query for
Dashboards that show aggregate metrics but not causation
Alerts that fire but don’t explain why
Stack traces that point to symptoms, not root causes
When something breaks in a distributed system, you’re reconstructing a timeline from multiple sources. Each service logs independently. Correlation IDs help if you remembered to add them everywhere. But you’re still doing the synthesis manually.
If each experiment requires a potential 90-minute debugging session when it fails, you’re not going to run many experiments. The discovery surface area collapses under the weight of operational overhead.
The Pattern That Works
The breakthrough isn’t sophisticated ML pipelines analyzing logs in real-time. It’s simpler: dump the relevant logs into an LLM’s context window and ask “what broke?”
Here’s the actual workflow:
Grab the error window - CloudWatch logs for the relevant timeframe, maybe 2-5 minutes around when things failed
Include context - Request IDs, user actions, recent deployments, the codebase
Paste into Claude - “Here are logs from our API failing. What’s the root cause and what should I check?”
Get actionable output - Not just “database connection failed” but “MongoDB connection pool exhausted because ECS task memory limits too low, see line 247”
This turns a 2-hour debug session into 10 minutes.
Real Example: The OOM Death Spiral
You’re seeing ECS tasks die randomly. CloudWatch shows:
Container exit code 137 (OOM killed)
API timeouts
MongoDB connection errors
Health check failures
Traditional debugging: check each signal, correlate timing, form hypothesis, test. Maybe 90 minutes if you’re fast.
AI approach: feed all logs to Claude with “tasks dying intermittently, what’s the pattern?” Response in 30 seconds: “Memory spike during PDF processing → OOM → connection pool doesn’t recover → cascading failures. Line 1823 shows 2GB allocation for document that should stream.”
The AI isn’t magic. It’s just really good at pattern matching across unstructured text and finding the connection you’d eventually find manually.
This is the Bitter Lesson from Rich Sutton: “of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.”
Practically you just got back 89 minutes. That’s 89 minutes you can spend shipping the next experiment instead of debugging the last one.
The Economics of Fast Failure
Log storage costs (CloudWatch): ~$0.50/GB ingested, $0.03/GB stored AI analysis costs (Claude Sonnet): $3 per million input tokens, $15 per million output tokens
For a typical debugging session:
5MB of logs = $0.0025 to store
~15,000 input tokens + 2,000 output tokens = $0.075 in API calls
The API call costs 30x more than storage, but collapses 2 hours to 10 minutes. You’re spending 7.5 cents to save 110 minutes of engineering time.
More importantly: it removes the psychological barrier to trying risky things. When debugging is cheap and fast, you stop avoiding experiments that might fail in interesting ways.
When to Automate vs One-Off
Start with One-off queries.
One-off Claude queries work for:
Active debugging sessions
Post-mortems
“What happened last night?” investigations
Learning new codebases from error patterns
When you find yourself doing them multiple times. Automate.
Automated AI analysis makes sense for:
Scaling this across a team
High-frequency errors you need to triage
On-call alert enrichment (error + AI summary in Slack)
Pattern detection across multiple services
Generating runbooks from repeated issues
The automation threshold is around 10+ similar debugging sessions per month. Below that, manual queries are faster than building the pipeline.
For discovery-oriented teams shipping constantly, you’re probably above that threshold. The first time an alert fires at 2am with a full AI diagnosis already attached, you’ll wonder how you ever did on-call without it.
Practical Implementation
Simplest version that works:
# Get logs from CloudWatch
logs = cloudwatch.get_logs(
log_group=’/aws/ecs/my-service’,
start_time=incident_time - 5min,
end_time=incident_time + 5min
)
# Add context
context = f”“”
Service: {service_name}
Time: {incident_time}
Recent deploys: {last_3_deploys}
User impact: {error_rate} errors/min
Logs:
{logs}
What caused this failure? Be specific about which lines matter and what to fix.
“”“
# Get analysis
response = anthropic.messages.create(
model=”claude-sonnet-4-20250514”,
max_tokens=2000,
messages=[{”role”: “user”, “content”: context}]
) Don’t like python? Rewrite in your language of choice. The key steps are: fetch logs, add context, call AI API.
The key is including enough context that the AI can differentiate between “this always logs warnings” and “this warning preceded the crash.”
Once you trust this works, you can wire it into your incident response:
alert fires → grab logs → AI analysis → post to Slack with diagnosis.
The engineer wakes up to a root cause analysis, not just an alert.
What This Actually Changes
This doesn’t replace monitoring. You still need alerts, metrics, traces. You need to go from Macro to Micro and back many times as a Goalie. What it replaces is the manual correlation step.
Before:
alert fires → check dashboard → check logs → check related services → form hypothesis → test
After:
alert fires → dump logs to AI → get hypothesis → test
You’re not eliminating the investigation. You’re front-loading the pattern recognition to something better at it (search + in-context learning) than humans.
The real win: junior engineers can debug like senior engineers. The experience gap in “knowing where to look” shrinks when the AI can suggest where to look based on the logs themselves. This matters when you’re moving fast, you can’t afford for only the senior engineers to understand production.
Where This Breaks Down
This fails when:
Logs don’t contain the answer - missing instrumentation means AI can’t help
Too much noise - if you log everything, 5 minutes might be 50MB, too big for context
Subtle timing issues - AI can spot patterns but not always race conditions
You ask the wrong question - “why is this slow” when you should ask “why does this fail”
It’s not magic. It’s good pattern matching with unlimited patience for reading boring logs.
The limitations matter less than you’d think, because most production issues aren’t subtle race conditions. They’re “the database ran out of connections” or “this API started returning 500s after the deploy” or “memory leak in the PDF processor.” Boring, mechanical failures that waste hours of human time to diagnose.
The Second-Order Effect
Once you trust this works, you start instrumenting differently. Instead of “log for humans to grep,” you log for “give AI enough to diagnose.”
That means:
More context in error messages
Structured data AI can parse
Correlation IDs everywhere and namespaces
Less filtering of “noisy” logs
Because the AI doesn’t get tired of reading noise. It just needs the signal to be in there somewhere.
Your debugging loop goes from “minimize logs to reduce noise” to “log everything and let AI filter.” Counter-intuitive, but it works. You stop worrying about log volume and start worrying about log signal quality.
This is the same shift that happens everywhere AI touches operations: optimise for machine parsing, not human reading. Your CloudWatch logs don’t need to be readable at 2am anymore. They need to contain enough information that an AI can reconstruct causality.
The Safety Net You Actually Need
Here’s why this matters for discovery surface area: fast failure is only valuable if you can learn from it quickly.
If every experiment that fails costs you 2 hours of debugging, you’ll run fewer experiments. You’ll be more conservative. You’ll optimize for “things that definitely won’t break” instead of “things that might reveal something interesting.”
But if failures are cheap to diagnose, you can afford to try the risky thing. Ship the unconventional architecture. Let the AI agent make the database query. Deploy on Friday afternoon.
The safety net isn’t preventing failure. It’s making failure cheap enough that you can take risks worth taking.
That’s what lets you increase discovery surface area without increasing chaos. You’re not shipping recklessly, you’re shipping fast with good observability. And when something inevitably breaks in an unexpected way, you learn what broke quickly.
Then you fix it and ship the next thing.
This is how you build systems that can actually afford to stumble into discoveries: make the stumbles cheap to recover from.


