Real-world reality

What breaks when an AI moves to production

The failures observed on AI systems in real exploitation are almost never the ones the demos suggested.

A model that answers very correctly on the demo test set starts producing off-topic responses in production on unusual inputs. Not systematically, but enough that no user quite trusts it anymore.

Inference costs drift silently. The token cost calculation done at project start forgets retries, multi-step, prompts that lengthen as the system grows more complex. Six months later, the observed monthly cost is ten times the initial estimate.

An incident occurs in production, a user reports aberrant behaviour. No one can reconstruct the exact context of the request, nor the model's intermediate decision, nor which tool it called with which parameters. Traceability, not designed in at conception, is now impossible to retrofit without an overhaul.

An apparently minor prompt modification, deployed without an evaluation suite, silently degrades response quality for three weeks. The drift is only visible through slow, noisy user feedback.

The base model provider updates its endpoint. The name stays the same, the behaviour changes. Teams that had not pinned a precise version only notice when the drift has become massive.

None of these phenomena is attributable to the model. They all result from a failure of system around the model.

Thesis

The real subject: supervision and operability

The stake of an AI project in a professional environment is not to make a model work, it is to produce an operable system. An operable AI system is recognised by four properties that the demo does not reveal.

It is continuously supervisable. Not merely instrumented with classic HTTP monitoring, but equipped with quality metrics observed in real time: distribution of automatic evaluation scores on production calls, negative feedback rate, response length anomalies, drift in the distribution of inputs.

Posterior auditability is the second. Every significant decision of the system is traceable with its inputs, outputs, the model and version used, the timestamp, the tools called in chain. This journaling is not a compliance add-on, it is what lets you understand what happened when strange behaviour surfaces.

Every modification of the system (prompt, configuration, model version) traverses a validation pipeline: automated evals against a reference dataset, shadow mode, canary deployment by stages. That is what makes it evolve without silent regression. No update reaches user production without having been compared to the previous behaviour.

And it must be reversible. A configuration rollback is possible in minutes, and it also covers data: what was written by the new version must be reconcilable with the previous state.

These four properties are not added to an AI system after the fact. They must be architected from the design stage.

Compliance by design

AI and critical systems

When AI is embedded in a regulated system (banking, insurance, healthcare, public sector), the requirements increase significantly. Compliance by design becomes a structural condition, not a finalisation step.

The European AI Act classifies AI systems into four risk levels. For those falling under high risk (creditworthiness assessment, credit scoring, decisions on access to regulated services), the obligations cover risk management throughout the lifecycle, governance of training data, automatic decision logging, effective human oversight, and technical documentation kept up to date. None of these obligations bolts on top of an existing AI system without a redesign. They are properties of the system, not organisational processes.

Beyond the AI Act, regulated sectors carry their own constraints: integrated GDPR, medical confidentiality, health-data hosting, ANSSI certifications, evidential archiving. An AI system deployed in a banking onboarding flow must absorb all of these constraints from its conception.

Experience operating critical information systems extends naturally to AI. The same questions structure the work: what gets journaled, how is a decision traced, how is the system supervised in real time, how is it evolved without interrupting exploitation. AI is simply a new kind of component to integrate, operate and maintain in environments that do not tolerate approximation.

Infrastructure

Open source, hosting, sovereignty

The choice between proprietary cloud models and self-hosted open-weight models is not ideological. It is an arbitration between data sovereignty, real total cost, the team's operational capacity to manage an inference infrastructure, and the real complexity of the tasks the system must handle.

Open-weight models (Llama, Mistral, Qwen, Gemma) have become a serious option for a growing number of use cases. On simple to moderately complex tasks (extraction, classification, standard content generation, routine code), they are comparable to frontier models. On genuinely difficult tasks, the gap remains real, and it is worth evaluating on the organisation's concrete cases rather than generalising from benchmarks.

Self-hosting becomes economically interesting starting from a significant inference volume, typically several million tokens per day. Below that, cloud APIs remain more economical once all hidden costs are factored in: GPU infrastructure, MLOps skills, updates, fault management. Total cost is not just the per-token price.

Sovereignty is the strongest argument for self-hosting. For data covered by medical confidentiality, trade secrets, or information under contractual constraint, self-hosting is not an optimisation, it is a requirement. In air-gapped environments (defence, sensitive industry), there is simply no alternative.

Mature architectures often combine both registers: lightweight self-hosted models for volume and sensitive data, frontier models via API for occasional complex tasks. An LLM router directs each request to the appropriate model.

This is not a compromise, it is a lucid architecture.

Method

Evaluate before automating

The accuracy of an AI system is not a verdict. It is one indicator among others, which says nothing about auditability, stability over time, fitness for exception cases, or operational cost.

Evaluating an operable AI system requires a structured protocol, defined before industrialisation. This protocol specifies the expected test cases with their reference outputs, tolerable errors and intolerable errors (an awkward phrasing does not weigh the same as an unauthorised action), the escalation thresholds at which the system must acknowledge that it does not know, and the expected behaviour when a downstream tool fails.

The automation threshold is an architectural decision, not a configuration decision. It materialises an arbitration between fluidity, risk and operational cost. Too low, the system does nothing useful and every request escalates to human review. Too high, decisions reach production without appropriate control, and the cost of an incident far exceeds the gains from automation.

A good evaluation is not done once. Models drift, user inputs evolve, prompts accumulate. Without regular evals in production, you are not operating an AI system, you are hoping it keeps responding well. That is not the same thing.

Field experience

What we actually observe in production

Several phenomena recur across deployments of AI systems in professional environments. They are not anecdotal: they outline the zones where systems engineering falls short, regardless of the model used.

Non-reproducible behaviour. A system processes the same request twice and produces two different results, sometimes both acceptable, sometimes divergent to the point of becoming a problem. The probabilistic component of the model propagates through the entire chain when it involves several steps (retrieval, planning, tool calls). Reproducing an incident becomes a statistical task, not a debugging task.

External API dependencies. A growing fraction of AI systems relies on models provided by third parties. Every endpoint deprecation, every pricing change, every silent adjustment of the base model becomes an external variable the team does not control. Pinning a version, planning an abstraction layer, anticipating a forced migration to another provider: these precautions are rarely taken at the time of the initial choice.

The absence of real rollback. Technical rollback is easy to design: a configuration switch and the system returns to its previous state. Functional rollback is much harder. The data generated by the version being rolled back does not disappear, it keeps living in downstream systems, in user histories, in reference databases. Reconciling state after rollback requires a design effort that cannot be improvised in an incident.

Incomplete supervision. Teams monitor what they know how to monitor: latency, API error rate, cost per call. They almost never monitor the actual quality of responses produced in production. When the system starts to drift, the signal does not appear in technical dashboards. It appears in user feedback, weeks later.

These observations are not an inventory of pitfalls to avoid. They are the conditions from which an AI system meant to be genuinely operated is designed.

Practice

Our approach to controlled AI

The work begins before the choice of framework, before the choice of model, before the first prompt is written. It begins with an initial scoping that sets the system's functional perimeter, the non-negotiable guardrails, the escalation conditions to human review, the associated evaluation protocol, and the non-negotiable business constraints (AI Act, regulated sector, partner contracts). Without this scoping, the choice of an agentic framework or a model remains largely theoretical.

Then comes instrumentation. The system's observability is designed in from the first production deployment, not added when an incident makes it necessary. Distributed tracing to reconstruct every call in its full context, technical and quality metrics collected continuously, timestamped signature of every significant decision to allow posterior auditing. What is observed in production feeds the next iteration.

Exploitation is brought online through progressive cutover. A modification (new prompt, new model version, new RAG chain) traverses a pipeline through automated evals against a reference dataset, then a shadow mode where the new component processes real requests in parallel to the old one without serving its responses, then a canary deployment where a fraction of traffic is progressively switched over. A regression at any stage triggers an instantaneous configuration rollback.

Operated supervision comes next. Evals run on a regular cadence on production cases, the input distribution is monitored to detect drift (embedding drift), model updates proposed by the provider are scoped through the pipeline rather than absorbed silently. Human oversight is not a compliance checkbox, it is an architected control point, with readable interfaces, effective alert thresholds, and the ability to stop the system in seconds if necessary.

It is the natural extension of our work on critical systems: an AI system is simply a new kind of component to integrate, operate and maintain in environments that do not tolerate approximation.

Library

Controlled AI

What breaks when an AI moves to production

The real subject: supervision and operability

AI and critical systems

Open source, hosting, sovereignty

Evaluate before automating

What we actually observe in production

Our approach to controlled AI

Related articles and references

Putting AI in production

Evaluation

Models and infrastructure

Regulation and risks

Use cases and architecture