Automating Engineering Metrics Collection with AI CLI Tools
'Squishy' Metrics Matter More Than Dashboards
Every Monday, my Head of Product and I run an all-hands for Engineering where we outline what's happening this week. One of the things I prep for is gathering the metrics for the past week that work towards our OKRs. I'm very cautious about using KPIs, if you've read Engineering KPIs in the Age of AI, you'll know just how cautious I am.
What I didn't mention in that article was the importance of limiting the number of KPIs and ensuring they have a balancing mechanism. I.e., if you are measuring velocity, then you should probably measure error rate as well to ensure you aren't moving too fast or breaking too many things.
Back to the main point: fetching KPIs from different systems has been a bit of a challenge. I report on several leading indicators of good development, such as code coverage, impact score, and type safety. Our team has been hill climbing on these for the past year as we build our Customer Experience AI Agents, and I didn't want these best-practice goals to interfere with the work of making a production AI Agent system. That's hard enough.
It’s a chore
Collecting these metrics has been a bit of a chore for the past year. It often requires me logging into half a dozen systems and pulling one number or another. As I was building out the Impact system, for instance, I would discover that my automation on that stopped working last Wednesday. I would need to go fix that and reprocess missing values.
Not to mention the systems we’re built on are changing underneath us as well. For instance, Vercel released new Observability dashboards that broke my overall error rate calculation, requiring a different solution. We went from a week of visibility down to 3 days unless I purchase their Observability Plus. 😡 jk, i still love you Vercel.
The point is these data points and the pipelines around them are fragile. Forcing them into the old software paradigm of "write code, code does thing" will always require escape hatches that need developer attention. We had to build the escape hatches that raise exceptions, and then it's back to regular programming for bugs, fixes, and priorities.
That is, until the latest wave of Codex, Claude Code, and Gemini CLI. I'll share the Claude Code Command, then discuss it below.
The Command
Collect weekly engineering metrics for the web application monorepo. This command automates the collection of key engineering metrics that are tracked every Monday.
**Prerequisites:**
1. Remind user they should be on the main branch. Use an emoji to indicate this.
2. Install dependencies: `npm install` or `pnpm install`
3. Build packages: `npm run build:packages`
4. Ensure cloud provider is configured: `npm run cloud:check` (if not, instruct the user to run `npm run cloud:login`)
**Metrics to collect:**
- Frontend app coverage (lowest of statements, branches, functions, lines)
- Backend API coverage (lowest of statements, branches, functions, lines)
- Type checking issues (total errors + warnings across all apps)
- Stability % (from CDN Analytics - successful requests / total requests, 5 decimal places)
- Volume (k) (from CDN Analytics - total API requests for the week)
- PRs merged to main (from git log for the past week)
- Error % (calculated as 1 - stability, 5 decimal places)
**Steps:**
1. **Collect Coverage Data:**
- Run: `npm run coverage:frontend`
- Parse the output or check `apps/frontend/coverage/` for coverage reports
- Extract the lowest percentage from statements, branches, functions, lines
- Run: `npm run coverage:backend`
- Parse the output or check `apps/backend/coverage/` for coverage reports
- Extract the lowest percentage for backend coverage
2. **Collect Type Checking Issues:**
- Run: `npm run typecheck:frontend`
- Run: `npm run typecheck:backend`
- Parse output to count total errors and warnings from both apps
3. **Collect PR Data:**
- Use git log to count PRs merged in the past week
- Example: `git log --merges --since="1 week ago" --format="%h %s" | wc -l`
4. **Collect CDN/Analytics Metrics:**
- Run: `./scripts/collect-cdn-metrics.sh`
- Script should collect:
- Total requests over the past 7 days
- Error rates (5xx status codes)
- Stability percentage
- API endpoint volume
- Store results in `metrics-history.json` (gitignored)
5. **Historical Data:**
- Append collected metrics to `metrics-history.json` with timestamp
6. **Output Format:**
Present the metrics in a table format:
```
Date: [current Monday date]
Frontend: [lowest coverage %]
Backend: [lowest coverage %]
Svelte Check: [total errors + warnings]
Stability: [success rate % with 5 decimal places]
Volume(k): [requests in thousands]
PRs: [merged PRs this week]
Error %: [error rate % with 5 decimal places]
```
**Notes:**
- Run this command every Monday
- You can use subagents. The 2 coverage commands take some time, so run those 2 in separate subagents then run the rest in a 3rd subagent.
- Ensure Cloudflare API credentials are configured (see step 4)
- The reference period is one week (Monday to Monday)
- All volume metrics come from Cloudflare Analytics
- If coverage tests fail, the coverage data might still be generated in the terminal output or partial coverage files
- Consider using `--reporter=json-summary` for more reliable parsing of coverage data
- Zone IDs can be found in your Cloudflare dashboard under each domain's settings
- If you have to change scripts or commands, or run into any issues with this command, please note it in the output at the end with a suggestion for how to fix it. Use emojis to catch my attention.
As you can see, there are a variety of metrics that I'm reporting on now that meet us where we are as a team. It's an incredibly important thing for me to tailor these to us in this moment, avoiding Goodhart's Law.
Benefits
The approach has several benefits. It can be run locally or in an Action Runner. Piping the results to a structured output step isn't something I've done yet, but it's on my roadmap, as well as sending the results in a Slack message to myself every Monday.
Beyond basic automation, this approach offers flexibility and adaptability. When a service changes its API or output format, I can simply update the prompt rather than diving into code. The AI agent can also handle edge cases more gracefully, like parsing slightly different log formats, dealing with partial failures, or rerunning excluding a single failing test.
Self-Healing Capabilities
One of the most powerful aspects of using AI CLI tools is their ability to self-heal. When the Vercel dashboard changed, instead of my script breaking silently, Claude Code could recognize the changed format and suggest alternatives. It can detect when a command fails and try alternative approaches, perhaps checking a different endpoint or parsing the data differently.
The AI can also identify patterns in failures. If coverage reports are consistently missing on Mondays, it might suggest running the tests earlier. This proactive problem-solving launches into the fix earlier than the old paradigm.
It practically rewrote the Vercel KPI of error rate to use Cloudflare’s Analytics API in one session. I’m no graph ql expert so this saved me a few hours of work allowing me to focus more on the Deck and new sprints.
Parallelization and Automation
AI agents excel at orchestrating parallel tasks. While traditional scripts require careful coordination of async operations, Claude Code tool can spawn subagents to handle different metrics simultaneously. Coverage tests for frontend and backend can run in parallel, while another agent fetches CDN metrics, dramatically reducing the total collection time from 15-20 minutes to under 5.
The automation extends beyond just running commands. The AI can intelligently retry failed operations, aggregate partial results, and even interpolate missing data points based on historical trends when necessary.
If it has to change anything, it can raise those exceptions in the summary. Hell you could have it submit a PR and send the notification in Slack whenever it has to do it.
Why Not a Dashboard?
You might wonder why I don't just use a traditional metrics dashboard. There are several reasons:
First, the act of manually reviewing these numbers each week forces engagement with the data. It's too easy to ignore a dashboard that's always there. This weekly ritual ensures I'm intimately familiar with our trends and can adjust.
Second, our metrics come from disparate sources that don't play nicely together. Building a unified dashboard would require significant engineering effort (believe me, I’ve already tried) and ongoing maintenance, effort better spent on our core product.
Third, this approach allows for contextual interpretation. If instructed to do so, the AI can add notes about why coverage might have dropped (e.g., "Large refactor merged on Thursday") or flag unusual patterns that a static dashboard would miss.
Challenges with Various Inputs
The trickiest part has been handling the variety of input sources. Some metrics come from JSON APIs, others from CLI output, and some from parsing log files. Each source has its own quirks:
Coverage reports sometimes output to stderr instead of stdout
CDN APIs occasionally return incomplete data for recent time periods and the VOLUME. So much data.
Git logs can be misleading when force-pushes or rebases occur
Type checking output formats vary between TypeScript versions
The AI's natural language understanding helps normalize these diverse inputs into a consistent format, something that would require extensive parsing logic in traditional scripts.
What Next?
I know I said no dashboard, but I'm actually planning a different kind of dashboard—one focused on what I call "squishy metrics”. What needs to be understood and what influences the team isn't always Error Rate.
That's a lagging indicator. Leading indicators are always greater value than lagging indicators. The key is to figure out what those are and recognize when they change.
The vision is to build something that captures the harder-to-measure aspects of engineering health:
AI-Powered Deep Dives: When metrics shift, use AI Search to automatically investigate why. Did deployment patterns change? Was there a shift in code review behavior? Are certain parts of the codebase becoming bottlenecks? The AI can dig through PRs, commits, production logs, Slack conversations, and documentation to surface the human stories behind the numbers.
Impact Score Integration: Connect these metrics with our existing Impact scoring system and project management tools. This would give Team Leads real insights into their teams, not just "coverage dropped 5%" but "your team is tackling high-complexity features this sprint, which historically correlates with temporary coverage dips."
Leading Indicator Discovery: Use AI to identify new leading indicators we haven't thought of yet. Maybe PR description length correlates with bug rates. Maybe certain emoji reactions in code reviews predict deployment success. The squishier the metric, the more predictive power it might have.
Team Context Dashboard: Surface the qualitative factors that affect performance. Team morale indicators from communication patterns, collaboration health from PR interactions, knowledge silos forming from code ownership patterns. These are the metrics that matter but are too nuanced for traditional dashboards.
The goal isn't to create another metrics dashboard. It's to build an Engineering Intelligence layer that understands the human dynamics driving our engineering outcomes, those soft factors are what predict whether we'll hit our goals or not.