Harness Engineering
The frontier labs want you to believe it's all about the model. It isn't. The biggest lever in building with AI is the harness, the scaffolding around the model, and it's where I now spend almost all of my time. Here is what harness engineering is, with a real example, and why I think it's the next chapter of green software.
by Asim Hussain · 20 June 2026
The frontier labs want you to believe that the only thing that matters is the model. Of course they do. They sell models. But after a couple of years building real software with these tools, I’m convinced it isn’t true, and that the biggest lever most of us are missing sits somewhere else entirely: the harness.
The harness can be worth a whole model generation
Start with a number. LangChain recently took a coding agent from outside the top 30 of a benchmark to the top 5, a 13.7-point jump on Terminal-Bench, without touching the model at all. They changed the harness: the system prompts, the tools, the scaffolding around the model.
Sit with how big that is. A leap of that size is what the industry normally makes you wait a whole model generation for, the sort of jump you’d expect going from one flagship model to the next. The labs package that up and sell it to you as the reason to upgrade. LangChain got it for free, from the parts nobody is selling.
So what is a harness?
The harness is everything wrapped around the model: the rules, the tools, the feedback loops, the structure you build so the thing does what you need without being begged. If the model is the engine, the harness is the rest of the car, the steering and the brakes and the rails that keep it on the road. A raw model is not an agent. A model plus a harness is.
Download — GIF · 540p (138 KB) GIF · 720p (149 KB) MP4 · 720p (18 KB)
A harness in practice: a rule the model can’t break
Let me make that concrete with one small piece of my own. I’ve been building a harness for the last six months. I call it signposts, and I’ll write about it properly another time, but here is one tiny part of it, just to give you the shape of the thing.
I build with Astro, and Astro gives you an <Image> component that does a great deal of quiet work. A plain HTML image tag, <img>, ships exactly the file you point it at, untouched: drop in a sixteen-megabyte photo and every visitor downloads sixteen megabytes. Astro’s <Image> wraps that up. At build time it resizes and recompresses the source into modern formats at a few sensible widths, and writes the responsive markup for you, so the same photo reaches your users at around a hundred kilobytes. Same picture on the page. Two orders of magnitude less to download.
So in my code I always want the first of these, never the second:
---
import { Image } from "astro:assets";
import hero from "./hero.jpg"; // a 16MB source file
---
<!-- what I want: optimised and resized at build time -->
<Image src={hero} alt="A hero shot" />
<!-- what I don't: the raw 16MB file, shipped as-is -->
<img src="/hero.jpg" alt="A hero shot" />
For months I could not get the model to comply. I told it, plainly, in my prompts. I put it in the CLAUDE.md that gets injected at the start of every single conversation, something like:
## Images
ALWAYS use the Astro `<Image>` component. NEVER use a raw `<img>` tag.
`<Image>` optimises at build time; a raw `<img>` ships the unoptimised file.
And it would begin a session doing exactly that, then somewhere in a long session quietly drift back to a raw <img>. This is context drift, one of the real, load-bearing problems of agentic coding: the longer a session runs, the more the early instructions fade. I spent ages checking the output by hand.
Then it clicked.
An instruction is a hint, and hints get forgotten. What I needed was a rule the model could not break.
So I stopped asking. I added a hook that fires every time the model edits a file (and again before every commit, via lefthook), runs the change through ast-grep, a tool that parses code into its syntax tree so you can say, structurally, what is and isn’t allowed, and rejects the edit outright if it finds a raw <img> where an <Image> belongs. Roughly:
id: use-astro-image
language: astro
severity: error
rule:
any:
- kind: self_closing_tag
- kind: start_tag
has:
kind: tag_name
regex: "^img$"
message: |
Use the Astro <Image> component, not a raw <img>.
A raw <img> ships an unoptimised, non-responsive image: no build-time
resize, no srcset. Astro's <Image> produces an optimised, responsive
image for free, so a 16MB source becomes ~100KB in production.
Now here is the part people get wrong, and it is the most important part. Blocking on its own is not enough. If you simply refuse the edit, the model is left confused: it knows it can’t do something, but not why. So the message is everything. When the rule rejects the edit, that message goes straight back to the model.
And the reason this works is worth saying plainly: the model isn’t being difficult. It just forgets. The message arrives as a reminder, the model reads it, says some version of ah, you’re right, I should be using <Image>, and for a good while afterwards it remembers and gets it right. Then, eventually, it drifts again, and the rule reminds it again. The difference now is that I no longer have to care, because ast-grep means a raw <img> simply cannot make it into my codebase. Not once. I am completely confident of that, and that confidence is the whole point.
That attitude is what harness engineering is. It isn’t one tool, or one technique. It is the realisation that the thing you build to get an AI to do what you want is not a prompt. It’s closer to context, but really it is just enforcement: engineering the software, and the environment it is built in, so the goal and its constraints are both baked in and neither can be quietly dropped. You set the goal, and then you set the rules that make any other outcome impossible. I am far firmer with the model than I would ever be with a human developer, and that is exactly right: a person you trust to remember; a model you give a world it cannot get out of.
So now, when I hit a problem, I don’t reach for the prompt. I firm up the harness. A new failure mode becomes a new rule. I don’t instruct, I enforce. My harness is far more than this one ast-grep rule, it has many moving parts, but that is how I think now, and honestly it has changed how I build.
And it cuts both ways
This is also the real reason people complain about the effort of reviewing what AI writes. The reviews are not the problem. The drift is. If the model can quietly slip a mistake in at any point, you have to read everything, every time, just in case, and that is exhausting. It is most of what people mean when they say agentic coding is more trouble than it’s worth.
With a strong harness that anxiety mostly lifts. My reviews are simpler and I can trust them, because the whole category of tiny drifting mistakes is already fenced off, and I can spend my attention on the things that genuinely need judgement instead of dreading it has slipped in some small thing again.
And it unlocks the part I find genuinely exciting, because a good harness doesn’t only push a frontier model higher, it works the other way too. With the rails in place I can reach for a smaller, cheaper, more efficient model and have it do work that supposedly needs a bigger one. Right now I’m pushing exactly that: dialling the effort level down, reaching for smaller models, seeing how much I can take away and still get the job done.
A small model under a strong harness will quietly beat a big model run carelessly. A good harness is, more or less, everything.
Where the gap still is
The models still matter, and there is one place I can’t yet do without a frontier one: the high-level work of planning and specification, the thinking at the very top. That is where the gap between a frontier model and a small one is still wide. So what I’m exploring is a hybrid: a frontier model for the planning and the spec, then smaller, and I hope eventually open, models doing the execution underneath a harness strong enough to keep them honest. I haven’t quite got there. The direction is a harness good enough that the frontier model becomes optional for everything below the very top.
Harness engineering is green software for the agentic age
There is a name forming for this, harness engineering. Prompt engineering, then context engineering, now this. For me it is simply the next chapter of the work I’ve always done: cut the waste, get the most out of what you spend, pointed now at a new kind of machine. A weaker model doing more, fewer tokens thrown away, less compute burnt to reach the same result.
The labs will keep telling you the answer is a bigger model. Sometimes it is. But far more often than they would like you to believe, the answer is a better harness.