There are very few real primities to build with in the ai dev tooling space. Hooks, MCPs, computer use / CLIs, and Skills are the main ones I work with.
I've been building and evaluating a series of tools to review PRs, and have encountered something I'm calling the "same guy problem". If I use claude code to generate a PR, then ask claude to review it, I can predict almost perfectly what he's going to find: some edge cases technically not handled (because they're impossible to reach), some linting and style stuff. Nothing severe or interesting. That's because it's the same guy. We see the same thing if I ask codex to review a PR that codex generated. It's very difficult to get a smart review if I'm already throwing my smartest tool at the first task.
Even Anthropic and OpenAI frontier models are not that different of guys these days. Some bugbots have managed to become slightly "different guys" by stapling sufficiently large and complex system prompts to their foreheads. Telling some LLM instance "You are a security genius" before having it do a security focused review can work, but it's flakey, effectiveness changes model to model, and most importantly, irritatingly stupid to my Serious Engineering sensibilities.
This is not to say that this is all hopeless. I have stood up systems which work effectively at finding novel concerns. These are in three main categories:
- Deterministic scripts and checks, which we've used AI to write more of.
- Agents narrowly scoped with specific judgement calls to make (like spec evaluations), which output a GO / NO GO decision. The primary output is this binary call; the chain of thought to get there is a useful artifact on rejection but not the main point.
- RAG-like systems which compare new PRs to sources of truth, for service-level design patterns OR company-level architecture documents.
These checks have to be specific and designed iteratively, so we're not spinning up 6 expensive agents and telling them to "review this PR for quality" and getting a bunch of identical meandering.
The main point is, though, that there is only so much that tooling can do for us. For LLM coding performance, using better models beats spending more money, and spending more money beats smarter toolchains. Consider if I tasked you to make GPT-2 the best possible software engineer you could. what would you try? there's just a ceiling of what you can get out of that model. Past a point, the axe is as sharp as it's going to get and the better payoff is in using it to chop some trees.