Trust, not capability: the real bottleneck in AI-assisted engineering

July 2025

I keep seeing the same sequence play out.

Someone introduces an AI coding tool. Output goes up. Code review starts taking longer, because reviewers can't tell whether the person submitting the PR actually understands what they're submitting. PRs get bigger. Context gets thinner. Senior engineers start pushing back - not because they're resistant to change, but because they've been burned before by code nobody fully understood.

The pattern is familiar enough now that I can usually call the next step before it happens. And the interesting thing isn't the sequence itself. It's where the friction actually lives.

The capability question was settled months ago

The models are good enough. GPT-4, Claude, Gemini - they produce working functions, reasonable tests, decent refactors. The benchmarks improve every quarter. Every quarter the same conversations happen about which model to standardise on and which IDE plugin to adopt.

None of that is where teams get stuck.

They get stuck at the point where a senior engineer looks at a diff they didn't write, generated by a tool they're still forming opinions about, and has to decide whether to let it through. Not whether it's correct in some abstract benchmark sense - whether it can be understood, debugged, and maintained under pressure by someone who wasn't in the loop when it was generated.

That's a trust problem, not a capability problem. And trust doesn't scale with model accuracy.

Trust is infrastructure

Capability is measurable. You run a model against test suites, score the output, compare benchmarks. Trust is different. It's relational. It depends on context, on history, and on what specifically goes wrong when things go wrong.

A senior engineer trusts their own code because they wrote it, debugged it, and carry a mental model of how it fits together. AI-generated code arrives without any of that. It might be correct. It might even be better. But the reviewer can't verify that from the diff, and the consequences of being wrong depend entirely on what the code touches.

So trust isn't something you assert. You build it from specific things: tests that encode intent, not just coverage. Deploys you can roll back in minutes. Automated checks that catch the constraint violations humans will inevitably forget. Review processes that require the submitter to explain their reasoning, not just present a clean diff.

Good engineering organisations have made these investments for years. AI shifts the calculus because it changes how failures look. A careful engineer makes mistakes along familiar lines - you learn to recognise them. AI makes mistakes that look confident. The code compiles, the tests pass, the logic is subtly wrong in a way the diff doesn't reveal.

The teams I see making real progress with AI aren't the ones with the best models. They're the ones where the engineering infrastructure was already strong enough that new code - from any source - gets challenged and verified rather than taken on faith.

When the consequences are denominated in money

In financial services, these aren't abstract engineering principles. The systems I work on manage client mandates, execute portfolio rebalancing, produce regulatory reporting. When something breaks, we know we can fix it. The issue is speed, and whether any client was harmed in the interval.

Under SM&CR, accountability is personal. A senior manager can't point at a tool and say "it decided." So the engineering question becomes concrete: does the infrastructure around this AI-assisted process catch errors before they reach clients? Can I explain to a regulator what happened, when, and why? If a deployment goes wrong at 5pm on month-end, is the rollback faster than the damage?

Either the system supports those answers or it doesn't. If it does, AI becomes the multiplier everyone's hoping for. If it doesn't, more capable models just mean more confident errors travelling further before anyone notices.

The invisible backlog

Most AI conversations I'm part of focus on adoption: which tools, which use cases, how to upskill people. That's the visible work. The invisible work - the infrastructure that determines whether AI accelerates your team or accelerates your problems - rarely makes it onto the roadmap.

Before asking "which AI tools should we use?", ask "what happens when they get it wrong?" If you don't like the answer, that's your backlog.