7°C10:21 amLondon, UK.

Blog | 13 Mar 2026

How To Build Reliable AI Systems

A programmer's guide to engineering AI systems that actually work at scale.

10 min read

# ai# llms# rag

Illustration for How To Build Reliable AI Systems

We've all been there.

You open ChatGPT, drop a prompt. "Extract all emails from this sheet and categorize by sentiment." It gives you something close. You correct it, it apologizes, and gives you a new version. You ask for a different format, and suddenly, it's lost all context from earlier, and you're starting over.

Errors like that could be fine for little tasks, but it's a disaster for production systems. The gap between "this worked in my ChatGPT conversation" and "this runs reliably in production" is massive. It's not closed by better prompts. It's closed by engineering.

This article is about that engineering. You'll learn the architecture patterns, failure modes, and implementation strategies that separate AI experiments from AI products.

What you'll learn

In this tutorial, you'll learn how to:

  • Understand why AI systems fail differently from traditional software

  • Identify and prevent the three critical failure modes in production AI

  • Implement the validator sandwich pattern for consistent outputs

  • Build observable pipelines with proper monitoring and alerting

  • Control costs at scale with rate limiting and circuit breakers

  • Design a complete production-ready AI architecture

Prerequisites

To get the most from this tutorial, you should have:

  • Basic understanding of any programming language

  • Familiarity with REST APIs and asynchronous programming

  • Experience with at least one LLM API (OpenAI, Anthropic, or similar)

  • Node.js installed locally (optional, for running code examples)

You don't need to be an expert in any of these—intermediate knowledge is sufficient.


Table of contents


What makes AI systems fundamentally different

Traditional software is deterministic. You write if (urgency > 8) { return 'high' } and it does exactly that, every single time. Same input, same output. Forever. You can write unit tests that cover every path. You can predict every failure mode.

AI systems, on the other hand, are probabilistic. You ask an LLM to classify urgency and sometimes it says "high," sometimes "urgent," sometimes it gives you a 1–10 score, sometimes it writes a paragraph explaining its reasoning. Same input, different outputs—depending on temperature settings, model version, context window, and factors you can't fully control.

Here's what that looks like in practice:

Challenge

Traditional systems

AI systems

Consistency

100% reproducible

Varies per request

Debugging

Stack traces, logs

"The model just changed its behaviour."

Testing

Unit tests cover all paths

Can't test all possible outputs

Deployment

Deploy once, works forever

Degrades over time (data drift)

Failure modes

Predictable, finite

Creative, infinite

The engineering challenge is: how do you build reliability on top of inherent unpredictability?

The answer is not "use a better model." The model is maybe 20% of the solution. The remaining 80% is the system you build around it.


Failure mode #1: Inconsistent outputs

The problem

You ask the AI to extract a customer email from a support ticket. Sometimes you get the email back. Sometimes you get just the name. Sometimes you get a phone number. The format changes every time. Same prompt, different outputs.

Prompt: "Extract the customer email from this support ticket"
Output on Monday: "john@example.com"
Output on Tuesday: "Customer email: john@example.com (verified)"
Output on Wednesday: "John Doe"
Output on Thursday: {
"customer_info": {
"email": "john@example.com"
}
}

All three outputs contain correct information, but you can't parse them programmatically. You can't route tickets, trigger workflow systems, or integrate with other code because your response data lacks consistency.

The solution: The validator sandwich pattern

The validator sandwich pattern (also called the guardrails pattern) ensures the AI system does not generate or process the wrong data by sandwiching your AI between two layers of deterministic code.

Validator Sandwich Pattern

Essentially, we have three layers:

  1. The top bun - Input guardrails (deterministic)

  2. The meat - The LLM (probabilistic)

  3. The bottom bun - Output guardrails (deterministic)

Let's break down each layer.

The top bun: Input guardrails

Before anything touches the AI, validate it. Reject garbage immediately, fail fast and cheaply. Here's a basic example with deterministic code that checks the data being received:

function validateTicketInput(raw): TicketInput {
// Type checks
if (!raw.email || typeof raw.email !== 'string') {
throw new ValidationError('Missing or invalid email');
}
// Format checks
if (!isValidEmail(raw.email)) {
throw new ValidationError(`Invalid email format: ${raw.email}`);
}
// Range checks
if (!raw.body || raw.body.length < 10) {
throw new ValidationError('Ticket body too short to classify');
}
if (raw.body.length > 10000) {
throw new ValidationError('Ticket body exceeds max length');
}
// Return typed, validated input
return {
email: raw.email.toLowerCase().trim(),
subject: raw.subject?.trim() || 'No subject',
body: raw.body.trim(),
timestamp: new Date(raw.timestamp),
};
}

This runs before the LLM is ever called. It's fast, cheap, and deterministic. It catches easy failures immediately.

The meat: Structured outputs from the LLM

Stop asking the AI for free text. Force it into a schema. Most modern APIs support this directly.

Here's an example using Anthropic's Claude API:

const response = await anthropic.messages.create({
model: 'claude-haiku-4-5',
max_tokens: 500,
system: `You are a support ticket classifier.
You MUST respond with valid JSON matching this exact schema:
{
"category": "bug" | "billing" | "feature" | "other",
"confidence": number between 0 and 1,
"priority": integer 1-5,
"reasoning": "one sentence explanation"
}
Do not include any text outside the JSON object.`,
messages: [
{
role: 'user',
content: `Classify this ticket:
Subject: ${ticket.subject}
Body: ${ticket.body}`,
},
],
});
const classification = JSON.parse(response.content[0].text);

The key difference: you're making the AI conform to your format instead of hoping it does the right thing.

The bottom bun: Output guardrails

This is the most critical layer. LLMs will hallucinate. This layer catches those hallucinations before they break your database or confuse your users.

You got a structured response. Now validate it aggressively before you use it:

function validateClassification(raw): Classification {
const required = ['category', 'confidence', 'priority', 'reasoning'];
for (const field of required) {
if (raw[field] === undefined || raw[field] === null) {
throw new ValidationError(`Missing required field: ${field}`);
}
}
if (!['bug', 'billing', 'feature', 'other'].includes(raw.category)) {
throw new ValidationError(`Invalid category: ${raw.category}`);
}
if (
typeof raw.confidence !== 'number' ||
raw.confidence < 0 ||
raw.confidence > 1
) {
throw new ValidationError(`Invalid confidence: ${raw.confidence}`);
}
if (!Number.isInteger(raw.priority) || raw.priority < 1 || raw.priority > 5) {
throw new ValidationError(`Invalid priority: ${raw.priority}`);
}
if (raw.category === 'billing' && raw.priority > 3) {
logger.warn('Suspicious: billing classified as low priority', raw);
}
return raw as Classification;
}

The deterministic rule

Here's a rule to follow religiously:

If it can be solved with an if-statement, don't use AI.

Email format validation? Use regex. Date parsing? Use a date library. Checking if a string contains a keyword? Use a string method. Math? Use actual math.

AI is expensive and probabilistic. Traditional code is free, instant, and deterministic. Use AI for genuinely ambiguous tasks, extracting meaning from unstructured text, generating content, and reasoning about complex inputs. Let deterministic code handle everything else.


Failure mode #2: Silent failures

The problem

Model hallucinations are quite common in AI workflows, ranging from degraded accuracy to outdated training data to misclassification issues. This is the scariest failure mode because you don't know it's happening.

Consider accuracy drift. You trained your model on 2024 data. It's now mid-2026. Your vendors changed their invoice formats. Your classification accuracy has drifted from 95% down to 71%. You won't know until you do a quarterly audit—and by then, thousands of records have been processed incorrectly.

The principle is simple: you cannot fix what you cannot see.

The solution: Observable pipelines

Every production AI system needs observability baked in from day one. Here's how this plays out in a production system:

Observable Pipeline

With monitoring, you can detect issues in your system and address them as soon as possible. Monitoring doesn't just catch problems, it gives you data to diagnose and fix them in hours instead of months.

Metrics that matter

Metric

Why it Matters

Response Time

API Health, model issues

Confidence

Model degradation

Human Override Rate

Output quality problems

Error Rate

System Failures

Cost per Request

Budget control

Token Usage Trend

Prompt efficiency

The goal is not to remove humans from the loop, it's to only involve humans when the system is genuinely uncertain.


Failure mode #3: Uncontrolled costs

The problem

You test your workflow with 10 tickets. It works great and costs 50 cents. You deploy to production. 1,000 requests hit your API. Your bill: $500 for the day.

Or you write a retry loop incorrectly. It creates infinite API calls. Your bill: $5,000 for the day.

Or you're using the most expensive model for everything, including simple tasks that a cheaper model could handle.

The reality: "works for 10 requests" ≠ "works for 10,000 requests." Scale changes everything.

The solution: Gated pipelines with circuit breakers

To move from a fragile prototype to a robust production system, you must abandon the naive approach of directly connecting user inputs to LLM APIs. Instead, implement a gated pipeline.

Think of this architecture as a series of blast doors. A request must successfully pass through each gate before it earns the right to cost you money. If any gate closes, the request is rejected cheaply and quickly, protecting your budget and your upstream dependencies.

Gated Pipeline Architecture

From the diagram above, these gates are:

  1. The rate limiter

  2. The cache check

  3. The request queue

  4. The circuit breaker

Let's examine each one.

Gate 1: Rate limiting

The first line of defence stops abuse before it enters your system. In standard web development, rate limiting is about protecting the server CPU. In AI development, it's about protecting your wallet.

Gate 2: Cache check

The cheapest LLM API call is the one you never have to make. Many AI requests are repeated or highly similar. Cache aggressively.

Gate 3: Request queue

LLM APIs are not like standard REST APIs; requests often take 10–30 seconds to complete. If 500 users hit "submit" simultaneously, your server cannot open 500 simultaneous connections without crashing or hitting provider concurrency limits. A request queue solves this by batching requests and processing them at a controlled rate.

Gate 4: Circuit breaker

Retry logic is necessary for transient network blips, but it is destructive during a real outage. If an LLM provider is experiencing downtime and returning 500 errors, a naive retry loop will frantically hammer their API, wasting your money on failed requests.


How to build a complete production architecture

When you combine all three failure mode solutions, consistent outputs, observability, and cost control, you get a complete production architecture.

Full Architecture

When you solve for all three major failure modes, inconsistent outputs, silent failures, and uncontrolled costs. You graduate from a simple script to a true enterprise-grade system. This architecture doesn't just generate text; it actively protects itself, manages resources, and learns from its mistakes.


Conclusion: Engineering over prompting

The teams winning with AI right now aren't winning because they have better models. They're winning because they've built better systems around imperfect models.

Any company can call the OpenAI API. The ones that pull ahead are the ones who wrap that API call in validation, observability, cost controls, and thoughtful architecture — the ones who treat AI as a component in an assembly line, not a creative partner in a conversation.

The three things every production AI system needs:

  1. Structure: Validators, schemas, deterministic layers that enforce consistency and eliminate unpredictability at the edges.

  2. Visibility: Logging, monitoring, and alerting so you catch problems in hours, not months. Observable pipelines that let you see exactly what the system is doing and why.

  3. Control: Rate limits, caching, circuit breakers, and cost gates so scale doesn't turn your experiment into a budget emergency.

Reliable AI workflows aren't about better prompts. They're about better architecture around unreliable components.


If you found this helpful, you can connect with me on LinkedIn or subscribe to my newsletter. You can also visit my website.

© 2026 Jide Abdul-Qudus. All rights reserved.

Built with Gatsby & Coffee.