Observable, reversible, enforceable: three requirements for AI in production

December 2025

After twenty-seven years of building financial systems, I've noticed that production readiness always comes down to the same three things, whether the work is done by humans, traditional automation, or AI.

Can you see what happened and why? Can you undo a bad outcome cheaply? Are the constraints checked by machines rather than remembered by people?

Observable, reversible, enforceable. Miss any one and you're running a demo that happens to touch real data. The models can already write code, generate reports, handle queries. That part resolved itself faster than most people expected. What hasn't been resolved is whether the systems around them can handle the failures.

Observable

Observable means you can see what happened, why it happened, and whether it matched your expectations - without reconstructing the reasoning after the fact.

This sounds obvious. It isn't. AI systems fail observability in ways that traditional automation doesn't.

A portfolio rebalancing script either executes trades or throws an error. The logs show you which. An AI agent that decides whether to rebalance, chooses which assets to adjust, and generates the trade instructions can fail in ways the logs don't capture. The reasoning that led to "sell 200 shares of X" might be sound or might be hallucinated from training data. The output looks identical.

Observable AI processes need several things traditional processes don't:

Input capture - not just "user asked a question" but the full context the model received, including any retrieved documents or system state
Reasoning traces - where available, the chain of thought that led to the output, in a form humans can audit
Output logging - the full response, not just the action taken
Confidence signals - where the model hedged, where it expressed certainty, where it declined to answer

In financial services, observability isn't optional. The FCA wants to know why a suitability decision was made. The auditors want to reconstruct the inputs to a valuation. If "the AI said so" is your only answer, you don't have observability. You have liability you can't explain.

The practical test: when something breaks in production, can the on-call engineer reconstruct what happened from the logs alone? If they need to re-run the AI with the same inputs and hope for the same output, you're not observable. You're hoping.

Reversible

Reversible means you can undo a bad decision at a cost proportional to catching it quickly.

The ideal is immediate, complete reversal: roll back the deployment, restore the database, cancel the trade. Reality is messier. Some actions create external state changes you can't unwind. A client email, once sent, is sent. A regulatory filing, once submitted, is on record. A trade, once executed, is in the market.

This is where AI deployment differs from traditional automation. Most automation operates on well-defined inputs and produces predictable outputs. You can test it exhaustively. AI systems are stochastic. The same input can produce different outputs. The model that worked fine yesterday might hallucinate today. Your test suite covers the cases you anticipated, not the ones you didn't.

Reversibility requires asking: when (not if) this produces an unexpected output, what breaks?

Some patterns make reversal cheap:

Stage-then-commit workflows, where AI output goes to a review queue before affecting production state
Feature flags that let you instantly disable AI paths and fall back to manual or traditional automated processes
Canary deployments that expose AI decisions to a small population before rolling out broadly
Soft deletes and audit trails that preserve the ability to reconstruct prior state

In the systems I work on - bespoke investment mandates, client reporting, regulatory submissions - we treat irreversibility as a risk multiplier. If an AI-assisted process feeds directly into something that can't be undone, the quality bar is higher. The review is more thorough. The monitoring is tighter. The acceptable error rate is lower.

Sometimes the answer is: this process isn't ready for AI, because the cost of a mistake we can't reverse exceeds the benefit of automation. That's a legitimate answer.

Enforceable

Enforceable means the rules that govern correct behaviour are checked by machines, not remembered by humans.

AI outputs look plausible. That's the problem. A human reviewer scanning AI-generated code or a model-written report will catch obvious errors. They won't reliably catch subtle constraint violations, edge cases the model didn't consider, or policy breaches wrapped in confident language.

Humans forget things. They forget that client A has a restriction on emerging market exposure. They forget that the timestamp format changed last quarter. They forget that this particular report goes to a regulator who interprets column headers literally. The cognitive load of remembering every constraint while also evaluating AI output for correctness is too high.

Enforcement moves these checks out of human memory and into automated validation:

Schema validation on AI outputs - does the structure match what downstream systems expect?
Business rule checks - does this trade comply with the investment mandate? Does this report contain required disclosures?
Consistency verification - does the AI output contradict data it should have referenced?
Boundary conditions - are numerical outputs within plausible ranges? Are dates in valid formats?

This isn't AI-specific. Good engineering teams have always encoded their constraints in CI pipelines, database constraints, and runtime validation. But AI makes enforcement more important because the failure modes are different. Traditional code fails predictably - the same bug produces the same wrong output. AI fails stochastically - the same prompt might work 99 times and fail the 100th.

When I evaluate an AI-assisted workflow, I ask: what percentage of the constraints that define "correct" are enforced by code versus remembered by reviewers? The higher the former, the safer the process.

How the three interact

These three properties interact. Strong observability makes reversal cheaper - you know what to undo. Strong enforcement catches errors before they need reversal. Weak observability means enforcement is your last line of defence.

When teams ask me whether they're ready to deploy AI in some process, I ask them to score each property:

Observable: Can you reconstruct what happened and why from logs alone?
Reversible: If the AI output is wrong, what's the cost to fix it? Minutes? Hours? Days? Unrecoverable?
Enforceable: What percentage of correctness constraints are checked automatically versus manually?

A process that scores well on all three is ready for AI to increase velocity. A process that scores poorly on any one is a candidate for infrastructure investment before AI deployment.

This framing has a useful side effect: it depoliticises the AI adoption conversation. Instead of arguing about whether AI is good or dangerous, you're asking whether a specific process has the infrastructure to handle the errors AI will inevitably produce. That's an engineering question with an engineering answer.

Start where the infrastructure is strong

Model selection matters. Use case identification matters. Training people to prompt effectively matters. But these are second-order decisions. The first-order decision is: which processes already have the engineering infrastructure to handle mistakes cheaply? Those are where AI belongs first.

The processes where mistakes are expensive and hard to reverse - regulatory submissions, client-facing calculations, anything with external state changes - need investment in observability, reversibility, and enforcement before they need investment in AI.

The advantage won't go to whoever adopts fastest. It will go to the teams whose systems were already built to absorb errors from any source - human or machine - without drama. For them, AI is a genuine force multiplier. For everyone else, it's a faster way to generate problems that take longer to find.

Observable, reversible, enforceable. If you can't say yes to all three, you're not ready.