The POC works. The model responds well in the demo. The team is enthusiastic. Now what?

Putting an LLM into production in a real business context is an operation of categorically different complexity from deploying a classic API. Teams that discovered this after the fact paid a high price: off-topic responses in production, exploding inference costs, inability to audit a problematic behaviour, undetected regression after a prompt change.

LLMOps is the set of practices that prevent these pitfalls. Here is what it actually covers.

What makes LLMs different from other software components

A classic software component is deterministic: the same inputs produce the same outputs. An LLM is probabilistic: the same inputs produce outputs that vary according to temperature, sampling, and the internal state of the model.

This property fundamentally changes how you test, monitor and maintain the system.

No classic unit tests. You cannot write assertEqual(llm.generate(prompt), expected_output). You write probabilistic evaluations: “over 100 calls with this prompt, at least 90% of responses must satisfy these criteria.”

No classic versioning. When the provider updates their base model (GPT-4 → GPT-4o, Claude 2 → Claude 3), your system changes, even if you have not touched anything. And you may not know about it.

No classic debugging. Understanding why the model produced a particular response is not as simple as reading a stack trace.

AI agent progression on long-horizon tasks, METR August 2024

Source: METR, Evaluating Frontier AI (August 2024). AI agent performance on tasks requiring 30 min to 8h for a skilled human. The progression from 2019 to 2024 illustrates why LLMOps practices are becoming critical: models evolve fast, and the systems that embed them must keep up.

The 5 pillars of LLMOps

1. Managing prompts as code

Prompt engineering is development. Prompts must be versioned, reviewed, tested and deployed with the same rigour as application code.

What this means concretely:

Prompts are stored in the code repository (or a dedicated system like LangSmith, PromptLayer)
Every prompt change goes through a review and an evaluation suite
The deployment of a new prompt follows a CI/CD pipeline with automated tests
The prompt version is logged with every production call

A prompt modified without these guardrails can silently degrade response quality for days before anyone notices.

2. Systematic evaluation (evals)

Evals are the LLM equivalent of automated tests. They are sets of test cases with evaluation criteria, run before each deployment and regularly in production.

Types of evals:

Criterion-based evals: the model must extract a date from a text. The extracted date is either correct or it is not. Exactly measurable.

Model-as-judge evals: a second LLM (often more powerful) evaluates response quality according to defined criteria (relevance, accuracy, tone, length). More flexible but introduces its own variance.

Human evals: a panel of human evaluators scores responses according to a rubric. Expensive, slow, but the only way to evaluate subjective dimensions.

What to measure:

Rate of responses in the expected format (valid JSON, length respected, etc.)
Rate of inappropriate refusals (the model refuses to answer when it should)
Rate of detectable hallucinations (verifiable factual claims)
Consistency across calls (same question → coherent response)
Drift from a reference baseline

3. Observability in production

Monitoring an LLM in production goes well beyond latency and HTTP error rate.

Technical metrics:

Time-to-first-token (TTFT) latency and total latency
Number of tokens consumed (input + output), directly correlated with cost
API error rate (timeout, rate limiting, context length exceeded)
Cost per call, cost per user session, total cost

Quality metrics:

Distribution of automatic evaluation scores on production calls
Negative user feedback rate (if a feedback mechanism exists)
Response length anomalies (abnormally short or long responses)
Detection of off-topic response patterns

Tracing: LLM calls are embedded in larger processing chains (RAG, agents, multi-step). Distributed tracing (LangSmith, Arize Phoenix, or OpenTelemetry with custom spans) allows you to reconstruct the full context of a problematic call.

4. Managing drift

Drift in LLMOps has two sources:

Model drift: the provider updates the underlying model. Even if the endpoint name stays the same (“gpt-4”), the behaviour may change. The only protection is pinning the exact model version (gpt-4-0613 rather than gpt-4) and having an eval suite that detects regressions.

Data drift: the distribution of user inputs changes over time. A model trained or fine-tuned on January data may perform poorly on July patterns. Monitoring the distribution of inputs (embedding drift) allows you to detect these shifts.

Prompt drift: the accumulation of prompt modifications can create undetected inconsistencies if each modification is not tested in the global context.

5. Cost management

The cost of an LLM in production is proportional to the number of tokens processed. A poorly designed system can generate costs 10x higher than initial estimates.

Cost control levers:

Caching: responses to identical or very similar queries can be cached. LLM providers often offer prompt caching (system prompt tokens are not rebilled on each call).

Intelligent routing: not all requests require the most powerful model. A router that directs simple requests to a less expensive model (GPT-4o mini, Haiku) and complex requests to the premium model can divide costs by 5 to 10.

Prompt optimisation: verbose prompts cost more. A regular audit of prompts to remove redundant instructions reduces consumption.

Batch processing: for non-real-time use cases (report generation, data enrichment), the batch API is significantly cheaper (50% reduction at OpenAI).

LLMOps pipeline: from prompt change to production, with the 5 pillars

The LLM deployment pipeline

Shadow mode is particularly valuable: it allows you to compare the behaviour of the new prompt with the old one on real requests, without risk to users.

What the AI Act changes for LLMOps

For high-risk AI systems (decision-making systems in regulated domains such as finance, health, employment, justice), the AI Act imposes requirements that translate directly into LLMOps practices:

Logging: retention of usage logs enabling a posteriori auditability
Human oversight: defined human control points for high-impact decisions
Accuracy and robustness: documented and maintained evaluation metrics
Transparency: documentation of the system’s capabilities and limitations

These requirements are not incompatible with efficient operation, provided they are integrated from the design stage, not as an afterthought.

Conclusion

Putting an LLM in production without LLMOps practices is like deploying an application without monitoring or tests. It may work for a while, but problems arrive at the worst moment.

The good news: LLMOps practices are not inventions. They draw heavily on what the DevOps world has developed over 20 years, adapted to the probabilistic specificities of language models. Teams with a solid MLOps culture have half the journey done.

The other half is understanding what is fundamentally new: probabilistic evaluation, model drift management, and traceability in multi-step agentic systems.

Are you industrialising an AI use case in a critical business context? Let’s talk about your LLMOps architecture.

LLMOps: what it really means to put an AI model in production