Ch 26. Production Architecture¶
What you'll learn
- PoC vs. production — what changes when real traffic hits
- Five layers: API · LLM · Retrieval · Session · Observability
- How one LLM call survives: cache → rate limit → call → retry → circuit breaker → fallback (5 stages)
- Minimum skeleton in FastAPI · async · Redis · tenacity
- Sync/async tradeoffs · idempotency · backpressure
- Five pitfalls (all-sync · no cache · blind retry · single provider · session in memory)
Prerequisites
You've completed Part 5 — single agents, LangGraph, conversation memory. Now we rebuild everything assuming multiple concurrent users and failures always happen.
1. Concept — From PoC to production¶
Your PoC looks like this.
@app.post("/chat")
def chat(msg: str):
return client.messages.create(model="...", messages=[{"role": "user", "content": msg}])
This code breaks the moment:
- Two users call at once — single synchronous handler queues requests
- Provider freezes for 30 seconds — our server hangs for 30 seconds too
- Rate limit hit — all users get 500 errors
- Same question asked 100 times — we pay token cost every time
- No trace — we have no idea what got slow
Production architecture splits these five problems into layers, each handling one.
| Layer | Responsibility | Failure impact |
|---|---|---|
| API Gateway | Auth · routing · rate limit · async | Total shutdown |
| LLM Layer | Provider abstraction · cache · retry · routing | Answer quality / cost |
| Retrieval Layer | Vector + lexical · embedding cache · reranking | Answer accuracy |
| Session Store | Thread state · preferences · idempotency | Conversation breaks |
| Observability | Trace · log · cost · latency | Invisible failures |
The key insight: each layer fails and recovers independently. Retrieval dies? The LLM answers from context alone. A provider dies? Route to another. Nothing depends on everything working.
2. Why it matters — why one big handler doesn't scale¶
① Concurrency. If one LLM call takes 5 seconds and you have four sync workers, you handle 0.8 requests per second. With async + connection pool, one worker handles 50+ req/sec.
② Separation of concerns. Cache, retry, tracing logic mixed into business code bloats your handler to 200 lines. One fix breaks three things.
③ Multi-provider switching. When Anthropic goes down, you need to flip to OpenAI instantly. That requires a provider abstraction layer that the caller knows nothing about.
④ Cost visibility. Trace and logs scattered across services means you can't answer "which endpoint is expensive?" You learn about cost spikes a week late.
⑤ Idempotency. Users refresh the page — same request hits twice. Operations like billing or email can't run twice. You need a deduplication layer.
3. Where it's used — three typical patterns¶
| Scenario | Core decision |
|---|---|
| Synchronous chat (reply within 5 seconds) | FastAPI async + streaming + Redis session |
| Background analysis (10+ seconds) | Queue (Celery/SQS) + webhook notification + idempotency key |
| Agent workflow | LangGraph + checkpointer + interrupt (Ch 23) |
Most products run both sync chat and background queue tracks. Chat is real-time; heavy work queues.
4. Minimal example — FastAPI + tenacity + Redis¶
- Cache key is model + full message hash. One character difference = different key. Intentional variation busts the cache.
tenacity's@retrydecorator auto-retries 5xx, 429, timeout up to 4 times. Exponential backoff with jitter.- TTL 1 hour. Deterministic tasks can go longer; time-sensitive answers shorter.
Now the FastAPI handler:
- Async handler — one worker handles dozens of concurrent requests (processing others while waiting for the LLM).
- Retry and cache live inside
call_llm. The handler only knows business logic. - If all retries fail, fallback (e.g., smaller model, static response, "try again later").
5. Hands-on — the five stages one call survives¶
A PoC one-liner becomes a five-stage gauntlet in production.
| Stage | What | When |
|---|---|---|
| ① Cache | Same input → return immediately | Before call |
| ② Rate Limit | Per-user/tenant token bucket | Before call |
| ③ LLM Call | Provider API · explicit timeout | During call |
| ④ Retry | 5xx · 429 · timeout only · backoff + jitter | On failure |
| ⑤ Circuit Breaker | N consecutive failures → open · block for a period | Above retry |
Circuit breaker prevents "one provider dies, all workers retry forever, server melts." After 5 consecutive failures, stop calling that provider for 30 seconds — immediately fallback instead.
- When open, don't call. Jump straight to fallback.
- Threshold hit → flip to open, auto-block for cooldown duration.
Library choice: use pybreaker · purgatory · cloud mesh (Istio) instead of rolling your own. Code above is concept only.
Idempotency — user refresh doesn't double-execute¶
@app.post("/refund")
async def refund(req: RefundRequest, idempotency_key: str = Header(...)):
if cached := await cache.get(f"idem:{idempotency_key}"):
return json.loads(cached)
result = await process_refund(req)
await cache.setex(f"idem:{idempotency_key}", 86400, json.dumps(result))
return result
Idempotency-Key header is RFC standard. Enforce it on every endpoint with side effects (payments · emails · tool calls).
6. Common pitfalls¶
- Everything is synchronous.
def chat(...)blocks one worker for 5 seconds per LLM call. Use async + ASGI (uvicorn/gunicorn) as default. Swaprequestsforhttpx.AsyncClient. - LLM calls aren't cached. 90% of FAQ questions are identical. One cache line cuts cost in half. But if
temperature > 0, add seed to the cache key to preserve determinism. - Blind retries. Retrying 4xx (validation errors) creates infinite loops. Limit retries to 5xx, 429, timeout only — like
retry_if_exception_type(APIStatusError). - Single provider. Call Anthropic only? When they're down, you're down. Abstraction layer + at least one fallback (OpenAI/Bedrock) is non-negotiable.
- Session in memory dict. Two workers? Users ping different ones each request and lose conversation context. Redis or Postgres required. Solves memory leaks too.
- No timeout. SDK defaults are long or infinite. Explicitly set
timeout=30.0. Threadless workers need the same timeout. - Health check calls LLM. If
/healthmakes an LLM request, external outages become your outages./healthshould be lightweight ping only.
7. Operations checklist¶
- All external calls have timeout · retry · circuit breaker
- Cache hit rate tracked (target 30–70%, higher for deterministic tasks)
- Rate limit is per user_id (per IP fails in NAT)
- Side-effect endpoints enforce
Idempotency-Key - Session in external store (Redis/PG) · horizontal scaling works
- LLM provider abstraction + at least one fallback provider
- Async I/O consistent (no sync libraries blocking workers)
-
/healthand/readysplit — ready checks dependencies, health doesn't - Logs/trace searchable by user ID + request ID (Ch 27)
- PII policy enforced — never in cache or logs (Ch 28)
8. Exercises & next chapter¶
- Take a PoC chat handler and refactor it to production using all five stages (cache · rate · retry · breaker · fallback). Make each stage its own function.
- Define
LLMProviderinterface and writeAnthropicProvider+OpenAIProviderimplementations. Expose onlycall(messages) → str. - Add idempotency to a payment endpoint. Second call with same key returns cached response immediately.
- Design three fallback policies for when circuit breaker opens. What do you show the user?
Next → Ch 27 — Observability and Operations
These five layers only work if you can see them. Next chapter: tracing, logging, cost tracking, and the dashboards that keep production alive.
Sources¶
- LangSmith · Langfuse architecture reference docs
- Stripe — Designing robust and predictable APIs with idempotency
- Anthropic SDK —
timeout· streaming · retry options documentation - Release It! (Michael Nygard) — circuit breaker · bulkhead patterns