> the code it produces is slop, but that’s more the fault of the model than the harness not being a good judge on if a step in the workflow resulted in a net improvement or completion.
I don’t know. I’ve invested heavily in building internal tools that scaffold code and lint the filled in architecture/code design. That with a ratchet pattern, to allow for new rules that have errors across the existing code base, but to asymptotically fix them, is working pretty well.
Example - all modules have tightly scoped design primitives (I’m using hexagonal architecture for the backend, for example). And all code has BDD tests, which is what I spend much of my time reviewing, since cases written in human sentences is easier than looking at so many files of code.
There is a relentless upkeep to draft rules that respond to the workarounds the agents come up with to adhere to the design I want, but it’s slowly approaching perfect. What has helped here tremendously is I use hooks to llm as a judge the decisions the llms make, and then have them review/raise the questionable ones after a first pass is completed. In general, this is snuffing out the slop effectively.
All to say, someone asked me recently what model I prefer. In this approach, the model doesn’t really matter to me because the code is consistently what I want. I’ll choose a model because it has better mcp speed (codex), or a more thorough scope (Claude code).
Where this IS true is when we’re building a net new pattern. The agents are not great at it. BUT most code can fit into the few patterns I’ve created, and what can’t you lock down a new pattern to enforce over a couple iterations of it. Almost everything, at least in SaaS, follows a template.
I don’t know. I’ve invested heavily in building internal tools that scaffold code and lint the filled in architecture/code design. That with a ratchet pattern, to allow for new rules that have errors across the existing code base, but to asymptotically fix them, is working pretty well.
Example - all modules have tightly scoped design primitives (I’m using hexagonal architecture for the backend, for example). And all code has BDD tests, which is what I spend much of my time reviewing, since cases written in human sentences is easier than looking at so many files of code.
There is a relentless upkeep to draft rules that respond to the workarounds the agents come up with to adhere to the design I want, but it’s slowly approaching perfect. What has helped here tremendously is I use hooks to llm as a judge the decisions the llms make, and then have them review/raise the questionable ones after a first pass is completed. In general, this is snuffing out the slop effectively.
All to say, someone asked me recently what model I prefer. In this approach, the model doesn’t really matter to me because the code is consistently what I want. I’ll choose a model because it has better mcp speed (codex), or a more thorough scope (Claude code).
Where this IS true is when we’re building a net new pattern. The agents are not great at it. BUT most code can fit into the few patterns I’ve created, and what can’t you lock down a new pattern to enforce over a couple iterations of it. Almost everything, at least in SaaS, follows a template.