Blog | 13 Mar 2026
How To Build Reliable AI Systems
A programmer's guide to engineering AI systems that actually work at scale.
10 min read
We've all been there.
You open ChatGPT, drop a prompt. "Extract all emails from this sheet and categorize by sentiment." It gives you something close. You correct it, it apologizes, and gives you a new version. You ask for a different format, and suddenly, it's lost all context from earlier, and you're starting over.
Errors like that could be fine for little tasks, but it's a disaster for production systems. The gap between "this worked in my ChatGPT conversation" and "this runs reliably in production" is massive. It's not closed by better prompts. It's closed by engineering.
This article is about that engineering. You'll learn the architecture patterns, failure modes, and implementation strategies that separate AI experiments from AI products.
What you'll learn
In this tutorial, you'll learn how to:
-
Understand why AI systems fail differently from traditional software
-
Identify and prevent the three critical failure modes in production AI
-
Implement the validator sandwich pattern for consistent outputs
-
Build observable pipelines with proper monitoring and alerting
-
Control costs at scale with rate limiting and circuit breakers
-
Design a complete production-ready AI architecture
Prerequisites
To get the most from this tutorial, you should have:
-
Basic understanding of any programming language
-
Familiarity with REST APIs and asynchronous programming
-
Experience with at least one LLM API (OpenAI, Anthropic, or similar)
-
Node.js installed locally (optional, for running code examples)
You don't need to be an expert in any of these—intermediate knowledge is sufficient.
Table of contents
What makes AI systems fundamentally different
Traditional software is deterministic. You write if (urgency > 8) { return 'high' } and it does exactly that, every single time. Same input, same output. Forever. You can write unit tests that cover every path. You can predict every failure mode.
AI systems, on the other hand, are probabilistic. You ask an LLM to classify urgency and sometimes it says "high," sometimes "urgent," sometimes it gives you a 1–10 score, sometimes it writes a paragraph explaining its reasoning. Same input, different outputs—depending on temperature settings, model version, context window, and factors you can't fully control.
Here's what that looks like in practice:
Challenge | Traditional systems | AI systems |
|---|---|---|
Consistency | 100% reproducible | Varies per request |
Debugging | Stack traces, logs | "The model just changed its behaviour." |
Testing | Unit tests cover all paths | Can't test all possible outputs |
Deployment | Deploy once, works forever | Degrades over time (data drift) |
Failure modes | Predictable, finite | Creative, infinite |
The engineering challenge is: how do you build reliability on top of inherent unpredictability?
The answer is not "use a better model." The model is maybe 20% of the solution. The remaining 80% is the system you build around it.
Failure mode #1: Inconsistent outputs
The problem
You ask the AI to extract a customer email from a support ticket. Sometimes you get the email back. Sometimes you get just the name. Sometimes you get a phone number. The format changes every time. Same prompt, different outputs.
Prompt: "Extract the customer email from this support ticket"Output on Monday: "john@example.com"Output on Tuesday: "Customer email: john@example.com (verified)"Output on Wednesday: "John Doe"Output on Thursday: {"customer_info": {"email": "john@example.com"}}
All three outputs contain correct information, but you can't parse them programmatically. You can't route tickets, trigger workflow systems, or integrate with other code because your response data lacks consistency.
The solution: The validator sandwich pattern
The validator sandwich pattern (also called the guardrails pattern) ensures the AI system does not generate or process the wrong data by sandwiching your AI between two layers of deterministic code.

Essentially, we have three layers:
-
The top bun - Input guardrails (deterministic)
-
The meat - The LLM (probabilistic)
-
The bottom bun - Output guardrails (deterministic)
Let's break down each layer.
The top bun: Input guardrails
Before anything touches the AI, validate it. Reject garbage immediately, fail fast and cheaply. Here's a basic example with deterministic code that checks the data being received:
function validateTicketInput(raw): TicketInput {// Type checksif (!raw.email || typeof raw.email !== 'string') {throw new ValidationError('Missing or invalid email');}// Format checksif (!isValidEmail(raw.email)) {throw new ValidationError(`Invalid email format: ${raw.email}`);}// Range checksif (!raw.body || raw.body.length < 10) {throw new ValidationError('Ticket body too short to classify');}if (raw.body.length > 10000) {throw new ValidationError('Ticket body exceeds max length');}// Return typed, validated inputreturn {email: raw.email.toLowerCase().trim(),subject: raw.subject?.trim() || 'No subject',body: raw.body.trim(),timestamp: new Date(raw.timestamp),};}
This runs before the LLM is ever called. It's fast, cheap, and deterministic. It catches easy failures immediately.
The meat: Structured outputs from the LLM
Stop asking the AI for free text. Force it into a schema. Most modern APIs support this directly.
Here's an example using Anthropic's Claude API:
const response = await anthropic.messages.create({model: 'claude-haiku-4-5',max_tokens: 500,system: `You are a support ticket classifier.You MUST respond with valid JSON matching this exact schema:{"category": "bug" | "billing" | "feature" | "other","confidence": number between 0 and 1,"priority": integer 1-5,"reasoning": "one sentence explanation"}Do not include any text outside the JSON object.`,messages: [{role: 'user',content: `Classify this ticket:Subject: ${ticket.subject}Body: ${ticket.body}`,},],});const classification = JSON.parse(response.content[0].text);
The key difference: you're making the AI conform to your format instead of hoping it does the right thing.
The bottom bun: Output guardrails
This is the most critical layer. LLMs will hallucinate. This layer catches those hallucinations before they break your database or confuse your users.
You got a structured response. Now validate it aggressively before you use it:
function validateClassification(raw): Classification {const required = ['category', 'confidence', 'priority', 'reasoning'];for (const field of required) {if (raw[field] === undefined || raw[field] === null) {throw new ValidationError(`Missing required field: ${field}`);}}if (!['bug', 'billing', 'feature', 'other'].includes(raw.category)) {throw new ValidationError(`Invalid category: ${raw.category}`);}if (typeof raw.confidence !== 'number' ||raw.confidence < 0 ||raw.confidence > 1) {throw new ValidationError(`Invalid confidence: ${raw.confidence}`);}if (!Number.isInteger(raw.priority) || raw.priority < 1 || raw.priority > 5) {throw new ValidationError(`Invalid priority: ${raw.priority}`);}if (raw.category === 'billing' && raw.priority > 3) {logger.warn('Suspicious: billing classified as low priority', raw);}return raw as Classification;}
The deterministic rule
Here's a rule to follow religiously:
If it can be solved with an if-statement, don't use AI.
Email format validation? Use regex. Date parsing? Use a date library. Checking if a string contains a keyword? Use a string method. Math? Use actual math.
AI is expensive and probabilistic. Traditional code is free, instant, and deterministic. Use AI for genuinely ambiguous tasks, extracting meaning from unstructured text, generating content, and reasoning about complex inputs. Let deterministic code handle everything else.
Failure mode #2: Silent failures
The problem
Model hallucinations are quite common in AI workflows, ranging from degraded accuracy to outdated training data to misclassification issues. This is the scariest failure mode because you don't know it's happening.
Consider accuracy drift. You trained your model on 2024 data. It's now mid-2026. Your vendors changed their invoice formats. Your classification accuracy has drifted from 95% down to 71%. You won't know until you do a quarterly audit—and by then, thousands of records have been processed incorrectly.
The principle is simple: you cannot fix what you cannot see.
The solution: Observable pipelines
Every production AI system needs observability baked in from day one. Here's how this plays out in a production system:

With monitoring, you can detect issues in your system and address them as soon as possible. Monitoring doesn't just catch problems, it gives you data to diagnose and fix them in hours instead of months.
Metrics that matter
Metric | Why it Matters |
|---|---|
Response Time | API Health, model issues |
Confidence | Model degradation |
Human Override Rate | Output quality problems |
Error Rate | System Failures |
Cost per Request | Budget control |
Token Usage Trend | Prompt efficiency |
The goal is not to remove humans from the loop, it's to only involve humans when the system is genuinely uncertain.
Failure mode #3: Uncontrolled costs
The problem
You test your workflow with 10 tickets. It works great and costs 50 cents. You deploy to production. 1,000 requests hit your API. Your bill: $500 for the day.
Or you write a retry loop incorrectly. It creates infinite API calls. Your bill: $5,000 for the day.
Or you're using the most expensive model for everything, including simple tasks that a cheaper model could handle.
The reality: "works for 10 requests" ≠ "works for 10,000 requests." Scale changes everything.
The solution: Gated pipelines with circuit breakers
To move from a fragile prototype to a robust production system, you must abandon the naive approach of directly connecting user inputs to LLM APIs. Instead, implement a gated pipeline.
Think of this architecture as a series of blast doors. A request must successfully pass through each gate before it earns the right to cost you money. If any gate closes, the request is rejected cheaply and quickly, protecting your budget and your upstream dependencies.

From the diagram above, these gates are:
-
The rate limiter
-
The cache check
-
The request queue
-
The circuit breaker
Let's examine each one.
Gate 1: Rate limiting
The first line of defence stops abuse before it enters your system. In standard web development, rate limiting is about protecting the server CPU. In AI development, it's about protecting your wallet.
Gate 2: Cache check
The cheapest LLM API call is the one you never have to make. Many AI requests are repeated or highly similar. Cache aggressively.
Gate 3: Request queue
LLM APIs are not like standard REST APIs; requests often take 10–30 seconds to complete. If 500 users hit "submit" simultaneously, your server cannot open 500 simultaneous connections without crashing or hitting provider concurrency limits. A request queue solves this by batching requests and processing them at a controlled rate.
Gate 4: Circuit breaker
Retry logic is necessary for transient network blips, but it is destructive during a real outage. If an LLM provider is experiencing downtime and returning 500 errors, a naive retry loop will frantically hammer their API, wasting your money on failed requests.
How to build a complete production architecture
When you combine all three failure mode solutions, consistent outputs, observability, and cost control, you get a complete production architecture.

When you solve for all three major failure modes, inconsistent outputs, silent failures, and uncontrolled costs. You graduate from a simple script to a true enterprise-grade system. This architecture doesn't just generate text; it actively protects itself, manages resources, and learns from its mistakes.
Conclusion: Engineering over prompting
The teams winning with AI right now aren't winning because they have better models. They're winning because they've built better systems around imperfect models.
Any company can call the OpenAI API. The ones that pull ahead are the ones who wrap that API call in validation, observability, cost controls, and thoughtful architecture — the ones who treat AI as a component in an assembly line, not a creative partner in a conversation.
The three things every production AI system needs:
-
Structure: Validators, schemas, deterministic layers that enforce consistency and eliminate unpredictability at the edges.
-
Visibility: Logging, monitoring, and alerting so you catch problems in hours, not months. Observable pipelines that let you see exactly what the system is doing and why.
-
Control: Rate limits, caching, circuit breakers, and cost gates so scale doesn't turn your experiment into a budget emergency.
Reliable AI workflows aren't about better prompts. They're about better architecture around unreliable components.
If you found this helpful, you can connect with me on LinkedIn or subscribe to my newsletter. You can also visit my website.