CrackingWalnuts

System Design AIMarch 22, 2026· 105 min read

RAG and LLM Platform at Scale: Ingestion, Retrieval, Generation, and Evaluation for 10M Queries/Day

Goal: Build an enterprise knowledge assistant that serves 10M queries per day across 500+ engineering teams. The platform ingests 2M+ internal documents (API docs, runbooks, design docs, RFCs, Confluence pages, Slack threads, code repositories), lets engineers ask natural language questions, and returns accurate, cited answers grounded in internal knowledge. Think of it as an internal Perplexity for engineering. ~1.5-2.5s P95 latency depending on query complexity, single-digit hallucination rate with strict grounding and abstention, and under $0.02 per query average cost. All metrics in this post are directional estimates. Costs change quarterly, latency depends on infrastructure, and retrieval quality varies by corpus. Treat numbers as calibrated starting points for your own sizing, not specifications.

Reading guide: This is a long, detailed deep dive. You don't need to read it linearly.

Sections 1-2: Problem framing and requirements

Sections 3-6: Architecture overview, design principles, technology selection, and capacity planning

Sections 7-9: Document ingestion, chunking strategies (the deepest section), and embedding pipeline

Sections 10-11: Storage architecture and retrieval pipeline

Section 12: Agentic RAG with MCP-based tool use

Section 13: Generation layer (model routing, prompt engineering, streaming)

Section 14: Hallucination mitigation and citation enforcement

Sections 15-16: Evaluation systems and observability

Sections 17-21: Production architecture, cost analysis, security, failure modes, and a reusable production readiness checklist

New to RAG? Start with Sections 1-2 for the problem context, then Section 3 for the architecture overview. Read Section 8 carefully for chunking strategies. Skip to Section 21 for the production readiness checklist.

Building something similar? Sections 8-12 have the implementation details you need. Section 18 covers cost reasoning.

Preparing for a system design interview? Sections 1-6 cover what interviewers expect. Section 12 (agentic RAG) and Section 17 (production architecture) are common follow-up topics.

TL;DR: A production RAG platform handling 10M queries/day across 2M+ documents for 500+ engineering teams. Multi-strategy chunking (recursive, semantic, late chunking) produces 10M chunks stored in Qdrant with HNSW indexing. Hybrid retrieval (BM25 + dense vectors) with cross-encoder re-ranking, P99 retrieval latency typically 200-350ms depending on filter complexity. Agentic RAG with MCP-based tool use handles complex multi-hop queries through iterative retrieval and query decomposition. In practice, 60-80% of queries can be handled by smaller models, cutting LLM costs by 50-70%. Baseline hallucination rate of 8-15% in typical enterprise RAG deployments, reduced to single digits with strict grounding, citation enforcement, and answer abstention. LLM-as-judge evaluation on 5-10% of production traffic feeds a continuous improvement loop. The hardest problems: chunking quality for heterogeneous documents, access control in vector search, embedding model upgrades without downtime, and keeping hallucination rates low as the corpus grows. All metrics in this post are based on aggregated industry benchmarks and production observations. Actual results vary significantly by corpus quality, query distribution, infrastructure choices, and current model pricing.

1. Problem Statement

A few clarifications before getting into the architecture.

Scale context: This design targets a large enterprise with 500+ engineering teams, 5,000+ engineers, and a corpus of 2M+ documents spread across a dozen systems. That is not hypothetical. Companies like Stripe, Uber, and Google have internal knowledge bases of this size, and the search problem only gets worse as the organization grows. Ask any engineer at a company this size how much time they spend just finding things. It is a lot. Every developer experience survey confirms it, even at companies with world-class search infrastructure.

The 10M queries/day figure assumes 5,000 engineers making an average of 10-15 queries per workday, plus automated queries from CI/CD pipelines, Slack bots, and IDE integrations. Peak traffic hits around 10am-2pm in each timezone, roughly 3-4x the average QPS.

Document landscape:

Source	Document Type	Count	Update Frequency	Challenges
Confluence	Design docs, runbooks, ADRs	800K pages	50K updates/month	Deep nesting, stale pages, mixed formatting
GitHub	README files, code comments, PRs	500K files	200K updates/month	Code-text boundary, rapid churn
Slack	Thread discussions, incident channels	400K threads	100K new/month	Noisy, conversational, context-dependent
Google Docs	RFCs, specs, meeting notes	200K docs	80K updates/month	Access controls, revision history
Internal wikis / S3	PDFs, diagrams, legacy docs	100K files	20K updates/month	Unstructured, poor metadata

Why naive approaches fail:

The first instinct is to dump everything into a giant context window. Modern models support 128K-200K tokens. But 2M documents at an average of 2,000 tokens each is 4 billion tokens. That is roughly 20,000x larger than the biggest context window available. Even if you could fit it, the per-query cost would be prohibitively expensive. At 10M queries per day, the daily bill would be astronomical. Pricing changes, but the math never works for full-context approaches at this scale.

The second instinct is keyword search. Elasticsearch on internal docs, done. But keyword search fails on intent. "How do I handle auth for the payments service?" often fails to match effectively against a document titled "OAuth2 Integration Guide for Payment Gateway" because the keywords don't align well. Engineers give up after 2-3 failed keyword searches and go ask a colleague instead, which does not scale.

RAG solves this by combining semantic understanding (embeddings capture meaning, not just keywords) with grounded generation (the LLM answers from retrieved documents, not from its training data). But a production RAG system has about twenty things that can go wrong between "user types a question" and "user gets a useful, accurate, cited answer." This post covers all of them.

Assumptions:

The platform serves internal engineers only (not customer-facing). This simplifies some safety requirements but raises the bar on accuracy because engineers will notice and lose trust quickly.
Documents are primarily English text with code snippets. Multi-language support is out of scope.
Access controls from source systems (Confluence spaces, GitHub repos, Google Drive sharing) must be respected. An engineer should never see answers sourced from documents they cannot access.
The platform team operates the infrastructure. Individual teams contribute documents by connecting their tools.

Scope:

In scope: Natural language Q&A with citations, document ingestion from 5+ sources, hybrid retrieval, agentic RAG for complex queries, model routing, evaluation pipeline, observability, multi-tenant isolation.
Out of scope: Document authoring or editing, code generation (use AI coding assistants for that), real-time collaboration, customer-facing chatbot (different safety requirements), training custom foundation models.

2. Requirements

2.1 Functional Requirements

ID	Requirement	Priority
FR-01	Natural language Q&A with grounded, cited answers from internal documents	P0
FR-02	Source attribution: every claim links to the specific document and section it came from	P0
FR-03	Multi-format document ingestion (Confluence, GitHub, Slack, Google Docs, S3)	P0
FR-04	Access control: answers only reference documents the querying user can access	P0
FR-05	Document freshness: updates reflected in search within 15 minutes	P0
FR-06	Hybrid retrieval: semantic search combined with keyword matching	P0
FR-07	Streaming responses with sub-2s time to first token (P95)	P0
FR-08	Conversational follow-ups: multi-turn context within a session	P1
FR-09	User feedback collection: thumbs up/down, copy events, corrections	P1
FR-10	Multi-tenant isolation: teams see only their authorized content	P1
FR-11	Admin dashboard: query volume, quality metrics, cost breakdown, retrieval performance	P1
FR-12	Agentic RAG: complex multi-hop queries handled via iterative retrieval and query decomposition	P1
FR-13	Code-aware search: understand code snippets, function signatures, and API references	P2
FR-14	Multi-modal support: extract and search content from diagrams and screenshots	P2

2.2 Non-Functional Requirements

Requirement	Target
End-to-end latency (time to first token)	P50 ~1.0-1.5s, P95 ~1.5-2.5s, P99 < 4s (varies by query complexity)
Retrieval latency (search + re-rank)	P50 < 150ms, P99 < 350ms
Query throughput	500 QPS sustained, 1,500 QPS burst
Document ingestion throughput	500K doc updates/month processed within SLA
Document freshness	Updates searchable within 15 minutes
Availability	99.9% (8.7 hours downtime/year)
Answer accuracy (golden dataset)	> 85% correct on curated Q&A benchmark (definition of "correct" varies; measure per query type)
Hallucination rate	8-15% baseline; can be reduced to single digits with strict citation enforcement + abstention (varies by corpus quality; measured via LLM-as-judge)
Cost per query (average)	Minimize through model routing and caching (see Section 18)
Embedding re-indexing	Full corpus re-embedded within 72 hours
Data isolation	Zero cross-tenant document leakage

Architecture in One Minute

A RAG system is mostly a data, retrieval, and evaluation problem. The LLM does the last 20% of the work. The platform has six layers, each with a distinct job:

Ingestion layer. Connectors pull documents from Confluence, GitHub, Slack, Google Docs, and S3. Parsers normalize everything to structured markdown with metadata. Change detection ensures incremental updates, not full re-ingestion.
Chunking and embedding layer. A multi-strategy chunking pipeline splits documents based on type: recursive chunking for structured docs, semantic chunking for long-form prose, AST-aware chunking for code. An embedding pipeline converts chunks to vectors using a versioned embedding model.
Storage layer. Qdrant stores vectors with HNSW indexing. Elasticsearch handles BM25 keyword search. PostgreSQL tracks document metadata, permissions, and chunk lineage. Redis provides semantic caching.
Retrieval layer. Query understanding classifies, rewrites, and optionally expands the query before search. Hybrid search combines BM25 and dense vector retrieval in parallel. Reciprocal Rank Fusion merges results. A cross-encoder re-ranks the top candidates. Access control filtering ensures users only see documents they have permission to view.
Generation layer. A model router classifies query complexity and picks the right LLM (small model for simple lookups, large model for complex reasoning). Prompts are assembled dynamically with retrieved context, conversation history, and citation instructions. Responses stream via SSE.
Evaluation layer. User feedback (thumbs up/down), implicit signals (abandonment, follow-ups), and LLM-as-judge scoring on sampled queries feed into a continuous improvement loop that tightens retrieval quality and generation accuracy over time.

3. How RAG Works: One Query, Start to Finish

Before diving into each component, here is what happens when an engineer types: "How does the payments service handle retries?"

Step 1: The query becomes a vector

The system converts the question into a 1024-dimensional embedding — a list of numbers that captures the meaning of the query, not just the keywords.

"How does the payments service handle retries?"
    ↓ embedding model
[0.12, -0.98, 0.34, 0.67, -0.21, ... ] (1024 numbers)

Step 2: Two searches run in parallel

Vector search (Qdrant) finds chunks whose meaning is similar to the query:

json
[
  { "score": 0.92, "content": "The payments service uses exponential backoff with jitter for all downstream retries...", "source": "confluence", "doc": "Payment Error Handling" },
  { "score": 0.87, "content": "Retry budgets are set per-service: payments allows 3 retries with a 2-second base delay...", "source": "github", "doc": "payments-service/README.md" },
  { "score": 0.84, "content": "Circuit breakers open after 5 consecutive failures, preventing retry storms...", "source": "confluence", "doc": "Resilience Patterns" }
]

Keyword search (Elasticsearch) finds chunks containing the exact terms:

json
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title_context": { "query": "payments service handle retries", "boost": 3 } } },
        { "match": { "content": "payments service handle retries" } }
      ]
    }
  }
}

Notice: the vector database returns ranked chunks, not answers. It finds relevant text fragments. The LLM has not been involved yet.

Why both? Vector search understands meaning but misses exact identifiers. Keyword search finds precise terms but misses intent. Section 11 covers hybrid retrieval in detail.

Step 3: Merge and re-rank

Reciprocal Rank Fusion (RRF) combines both result lists, boosting documents that appear in both. A cross-encoder re-ranker then scores the top candidates by reading each chunk alongside the query. The top 3 go to the LLM. Section 11 covers the mechanics.

Step 4: The LLM generates an answer

The system assembles a prompt with the retrieved chunks as context:

System: You are an internal engineering assistant. Answer ONLY from the provided context.
        Cite sources using [Source: document_name].

Context:
[Chunk 1: Payment Error Handling - "The payments service uses exponential backoff..."]
[Chunk 2: payments-service README - "Retry budgets are set per-service..."]
[Chunk 3: Resilience Patterns - "Circuit breakers open after 5 consecutive..."]

Question: How does the payments service handle retries?

The LLM produces:

The payments service uses exponential backoff with jitter for all downstream retries [Source: Payment Error Handling]. Each service has a retry budget — payments allows 3 retries with a 2-second base delay [Source: payments-service/README.md]. After 5 consecutive failures, the circuit breaker opens to prevent retry storms from cascading [Source: Resilience Patterns].

Total time: ~1.1 seconds. The user sees a streaming response starting at ~400ms.

What about complex queries?

"Why is checkout slow and what changed after the Q3 migration?"

This cannot be answered with a single search. The system classifies it as complex and routes to an MCP agent:

Step 1: Classify → complex (multi-part, temporal reference)
Step 2: Decompose into sub-queries:
        → "checkout latency issues"
        → "Q3 migration changes checkout"
Step 3: Agent calls search tool twice, evaluates results
Step 4: Agent decides it needs more: "Q3 migration rollout timeline"
Step 5: Third search, combines all context
Step 6: LLM generates combined answer with citations from all sources

The mental model:

Simple query → one search → answer
Complex query → think (decompose) → multiple searches → combine → answer

RAG retrieves information. MCP decides how to retrieve it.

4. Design Principles

1. Retrieval quality determines answer quality. The LLM can only work with what retrieval gives it. Invest 60% of your effort in chunking, embedding, and retrieval. A mediocre LLM with excellent retrieval will outperform a frontier LLM with poor retrieval every time. And evaluation quality determines whether you know if retrieval is actually working.

2. Chunk quality beats chunk quantity. Five well-formed, relevant chunks in the context window produce better answers than twenty loosely related fragments. Over-retrieval wastes tokens, increases cost, and actually degrades answer quality because of the "lost in the middle" effect where LLMs underweight information in the middle of long contexts.

3. Cost-aware model routing is not optional. Not every query needs a frontier model. "What is the endpoint for the user service?" does not need Claude Opus. A smaller, faster model handles it in 200ms at 1/20th the cost. Route based on query complexity, not uniformly.

4. Fail safe, not fail silent. When retrieval confidence is low, the system should say "I don't have enough information to answer this confidently" with the best partial answer and source links. Silent hallucination destroys trust permanently. An honest "I'm not sure" preserves it.

5. Evaluate continuously, not once. Offline benchmarks catch regressions before they ship. Online evaluation (user feedback + LLM-as-judge) catches issues that benchmarks miss. Both are required. Neither alone is sufficient.

6. Observe every pipeline stage. If you cannot measure retrieval latency, re-ranking quality, and generation cost independently, you cannot debug production issues. OpenTelemetry traces span the entire RAG pipeline.

7. Design for document freshness. Stale answers erode trust faster than wrong answers. An engineer who gets an answer based on a doc that was updated two weeks ago will stop using the platform. Incremental ingestion with change detection is not a nice-to-have.

8. Avoid the common anti-patterns. Embedding full documents as single vectors kills granularity. Retrieval recall drops hard. Using one chunking strategy for everything is just as bad. Fixed-size chunks destroy Slack thread context and split code mid-function. And never trust the LLM to catch its own hallucinations. Without grounding and verification, it will confidently cite APIs that do not exist. Finally, skip evaluation at your own risk. "It looks good in demos" tells you nothing. Without a golden dataset and feedback loops, you have no idea if quality is improving or rotting.

5. Technology Selection

5.1 Technology Selection

Component	Technology	Why This Choice
Document parsing	Unstructured.io + custom parsers	Handles 20+ file formats. Custom parsers for Slack JSON and code files where Unstructured falls short.
Chunking engine	Custom multi-strategy pipeline	No single library handles all document types well. LangChain's text splitters for basics, custom logic for semantic and code-aware chunking.
Embedding model	OpenAI text-embedding-3-large (1024d)	Best cost/performance ratio on MTEB benchmarks. 3072d available if needed. Matryoshka support for dimensionality reduction.
Vector database	Qdrant (clustered)	Rust-based, fast HNSW with quantization. Payload filtering for metadata queries. Horizontal sharding. Open source with managed option.
Keyword index	Elasticsearch	Battle-tested BM25. Field boosting, analyzers, and aggregations. Already deployed at most enterprises.
Re-ranker	Cohere Rerank 3.5 + BGE-reranker-v2-m3 (fallback)	Cross-encoder accuracy on top-k results. Cohere for quality, open-source BGE as self-hosted fallback.
LLM inference (API)	Claude Sonnet / GPT-4o	Best quality for complex reasoning. Streaming support. Structured output mode for citations.
LLM inference (small)	Claude Haiku / GPT-4o-mini	10-20x cheaper than frontier models. Sufficient for simple factual lookups. Sub-300ms TTFT.
LLM inference (self-hosted)	vLLM + Llama 4 70B	Fallback for provider outages. PagedAttention for efficient KV cache. Runs on 4x A100s. See Section 13.5 for open-source model details.
Agent orchestration	LangGraph	Stateful agent workflows with cycles. Built-in checkpointing. Cleaner than raw LangChain for multi-step reasoning.
Tool interface	MCP (Model Context Protocol)	Standardized tool interface across model providers. Write a search tool once, use it with Claude, GPT, or self-hosted models. OAuth 2.1 for secure access.
Semantic cache	Redis + embedding similarity	Embed incoming query, check cosine similarity against cached queries. Threshold > 0.95 returns cached response. 5-25% hit rate in enterprise workloads, highly dependent on query repetition patterns and team size.
Metadata store	PostgreSQL	Document metadata, ACLs, chunk lineage, user feedback. ACID transactions for permission updates.
Message queue	Apache Kafka	Decouples ingestion from embedding. Replay capability for re-processing. Partitioned by source system.
Observability	OpenTelemetry + Grafana	OTel traces across full RAG pipeline. Grafana dashboards for latency, cost, and quality metrics.
Evaluation	RAGAS + custom LLM-as-judge	RAGAS for offline metrics (faithfulness, relevance, context recall). Custom judge for production sampling.

Important: these model choices are independent. The chunking model, embedding model, and generation LLM do not need to come from the same provider or even the same architecture. You can chunk with all-MiniLM-L6-v2 (open source, 22M params), embed with OpenAI text-embedding-3-large (API), and generate with Claude Sonnet (different API). The only coupling: the embedding model at ingestion must match the embedding model at query time. Section 9.3 covers when and how to change your embedding model.

5.2 Why RAG (and When Not)

The three main approaches to giving LLMs access to private knowledge:

Approach	Best For	Latency	Relative Cost	Knowledge Freshness	Accuracy on Enterprise Data
RAG	Large, frequently changing knowledge base	1-4s (retrieval + generation)	Low	Minutes (incremental indexing)	High (grounded in source docs)
Fine-tuning	Consistent tone/style, domain terminology, structured output formats	0.5-2s (no retrieval overhead)	Very low	Weeks-months (retrain cycle)	Medium (baked into weights, can hallucinate)
Long context	Small corpus (< 500 pages), real-time analysis	2-10s (large input processing)	High (scales with input size)	Real-time (docs in context)	High for small corpus, degrades with size

RAG wins for this use case because: (1) the corpus is too large for context windows, (2) documents change daily so fine-tuning staleness is unacceptable, (3) citation and attribution require knowing exactly which document supports each claim. Fine-tuning complements RAG for style and format consistency but does not replace it. Long context works as a last-mile technique within RAG: after retrieval narrows to 5-10 relevant chunks, a 128K context model processes them.

The evolution: Naive to Agentic RAG. Most tutorials teach naive RAG: embed query, search vectors, stuff results into prompt, generate. That works for demos. Production systems need advanced RAG (hybrid search, re-ranking, query rewriting) and increasingly agentic RAG (iterative retrieval, query decomposition, tool use) to handle the 30-40% of queries that are too complex for single-shot retrieval. Section 12 covers agentic RAG in detail.

6. Capacity Planning

Storage sizing

Documents:           2,000,000
Avg chunks per doc:  ~5 (varies widely: Slack threads = 1-2, API docs = 3-5, long RFCs = 10-20)
Total chunks:        10,000,000

Embedding storage (Qdrant):
  10M chunks × 1024 dimensions × 4 bytes/float = 40 GB (raw vectors)
  With scalar quantization (int8): 40 GB × 0.25 = 10 GB
  Payload metadata per chunk: ~500 bytes × 10M = 5 GB
  HNSW graph overhead: ~30% of vector size = 3-12 GB
  Total Qdrant storage: 18-57 GB (depending on quantization)

Keyword index (Elasticsearch):
  10M chunks × avg 300 tokens × 6 bytes/token = 18 GB raw text
  With inverted index overhead: ~54 GB
  Total ES storage: ~54 GB

Metadata store (PostgreSQL):
  10M chunk records × 1 KB avg = 10 GB
  2M document records × 2 KB avg = 4 GB
  ACL tables, indexes: ~2 GB
  Total PG storage: ~16 GB

Query load

Daily queries:       10,000,000
Average QPS:         10M / 86,400 = ~115 QPS
Peak QPS (3x avg):   ~350-500 QPS
Burst QPS (5x avg):  ~575 QPS (Monday morning, incident response)

Per-query compute:
  Embedding query:    5-10ms (API call)
  Vector search:      10-50ms (Qdrant)
  BM25 search:        5-20ms (Elasticsearch)
  Re-ranking (top 20): 80-150ms P99 (Cohere API or GPU, includes network)
  ACL filtering:      5-15ms
  LLM generation:     400-1500ms (depends on model tier)
  Total P50:          ~1.0-1.5s
  Total P99:          ~3-5s

For costs, see Section 18. The short version: LLM inference is 60-85% of the bill, so model routing is not optional.

7. Document Ingestion Pipeline

The ingestion pipeline pulls raw documents from five source systems and converts them into chunked, embedded, searchable content. Connector and parsing layers are often overlooked, but in practice they are a major source of production issues.

7.1 Connector Architecture

Each source system gets a dedicated connector service:

Source	Integration Method	Change Detection	Authentication
Confluence	REST API + webhooks	Webhook on page create/update/delete, daily full-sync reconciliation	OAuth 2.0
GitHub	Webhooks + REST API	Push webhooks for commits, PR webhooks for merges	GitHub App
Slack	Events API + conversations.history	Real-time events for new messages, backfill via pagination	Bot token
Google Docs	Drive API + Push Notifications	Drive change notifications, polling fallback	Service account
S3	S3 Event Notifications (SNS/SQS)	Object created/modified events	IAM role

Every connector follows the same pattern: detect change, fetch document, extract text and metadata, publish to document.updates Kafka topic. The Kafka topic partitions by source system so that one slow connector does not block others.

Why webhooks plus polling? Webhooks are faster but unreliable. Confluence webhooks occasionally miss events. Slack's Events API has delivery guarantees but rate limits during high activity. A daily reconciliation job polls each source for any documents modified in the last 24 hours, compares hashes with what is in PostgreSQL, and re-queues anything that was missed. In our experience, this catches a small but meaningful percentage of updates that webhooks miss.

7.2 Document Parsing

Raw documents arrive in a dozen formats. The parser normalizes everything to a common structure:

python
@dataclass
class ParsedDocument:
    doc_id: str                    # Deterministic hash of source + source_id
    source: str                    # "confluence", "github", "slack", etc.
    source_id: str                 # Native ID in source system
    title: str
    content: str                   # Normalized markdown
    content_type: str              # "prose", "code", "mixed", "conversational"
    metadata: dict                 # Author, created_at, updated_at, tags, etc.
    permissions: list[str]         # ACL groups/users who can access
    parent_doc_id: str | None      # For threaded/nested content
    url: str                       # Deep link back to source
    content_hash: str              # SHA-256 of content for change detection

Format-specific parsing:

Confluence: Atlassian Storage Format (XML-based) to markdown via custom converter. Strips macros, preserves headings, tables, and code blocks. Expands includes and excerpts inline.
GitHub: README and docs are already markdown. Code files get language-tagged code blocks with file path context. PR descriptions include diff summaries.
Slack: Thread messages are concatenated chronologically with author attribution. Reactions and emoji responses are stripped. Thread replies are grouped with their parent message.
Google Docs: Export as HTML, convert to markdown. Preserves headings, lists, tables. Strips formatting-only elements (fonts, colors).
PDFs: Unstructured.io with hi_res strategy for layout-aware extraction. Falls back to fast strategy if processing time exceeds 30 seconds per page.

7.3 Deduplication

The same information often exists in multiple places. A design doc might live in Confluence AND be linked in a Slack thread AND referenced in a GitHub PR description. Without deduplication, the same content appears three times in search results, wasting context window tokens.

Strategy: Content-hash-based dedup at the document level. After parsing, compute SHA-256 of the normalized content. If the hash already exists in PostgreSQL, skip re-processing. For near-duplicates (same content with minor formatting differences), compute SimHash and flag documents with similarity > 0.9 for manual review.

8. Chunking Pipeline

Chunking matters more than any other decision in a RAG system (see Section 4, Principle 1).

The tradeoff is simple. Too small and you lose context: a 100-token fragment that says "this approach has three advantages" is useless without knowing what "this approach" refers to. Too large and retrieval gets noisy: a 2,000-token chunk about the entire auth system dilutes the signal when the user only asked about token refresh.

Model independence. The chunking model, embedding model, and generation LLM are completely independent. You can change any one without touching the others. The only coupling: the embedding model at ingestion must match the embedding model at query time (see Section 9.3).

8.1 Classic Fixed-Size Chunking

Split text into chunks of N tokens with M tokens of overlap.

Typical config: chunk_size=512 tokens, overlap=50 tokens

Performs well when: Uniformly structured documents like API reference pages where each section is roughly the same length and self-contained.

Degrades when: Any document where meaning spans across chunk boundaries. A 512-token chunk that starts mid-paragraph and ends mid-sentence loses context in both directions. The overlap helps but only at the margins.

Retrieval impact: Baseline. Expect recall@10 of 60-70% on heterogeneous enterprise corpora.

What each strategy actually produces. Consider this section from a payments service doc:

## Retry Strategy

The payments service retries failed downstream calls using exponential
backoff with jitter. Base delay is 2 seconds.

### Configuration

| Parameter     | Default | Max   |
|---------------|---------|-------|
| max_retries   | 3       | 10    |
| base_delay_ms | 2000    | 5000  |

### Circuit Breaker

After 5 consecutive failures, the circuit breaker opens for 30 seconds.
During this window, all calls return a cached fallback response instead
of hitting the downstream service.

Fixed-size (512 tokens) produces one chunk containing the entire section — or worse, splits it mid-table if the preceding content pushes it over the limit. If the user asks "what is the default max_retries?", the chunk includes unrelated text that dilutes the embedding.

Recursive chunking splits at the ## and ### boundaries: Chunk 1 = "Retry Strategy" intro, Chunk 2 = "Configuration" table, Chunk 3 = "Circuit Breaker" paragraph. Each chunk is self-contained and maps to one concept. The "Configuration" chunk answers parameter questions precisely.

Semantic chunking analyzes embedding similarity between consecutive sentences. It might keep the "Retry Strategy" intro and "Configuration" together (they are semantically related) but split "Circuit Breaker" into its own chunk (different concept). The boundaries are driven by meaning, not structure.

8.2 Recursive / Hierarchical Chunking

Split by document structure first: headers, then paragraphs, then sentences. Only fall back to fixed-size splitting when structural elements exceed the chunk size limit.

python
# Splitting hierarchy for a Confluence page
split_order = [
    "\n## ",      # H2 headers (major sections)
    "\n### ",     # H3 headers (subsections)
    "\n\n",       # Paragraph breaks
    "\n",         # Line breaks
    ". ",         # Sentence boundaries
]
# Each split respects max_chunk_size=800 tokens
# Parent-child relationships are preserved in metadata

Parent-child context expansion: When a chunk is retrieved, the system can optionally pull its parent chunk for additional context. If a user asks about "retry backoff strategy" and retrieval returns a 200-token chunk about exponential backoff, the parent chunk (the full "Error Handling" section) provides the surrounding context that makes the answer more complete.

Performs well when: Well-structured documents with clear heading hierarchies. Confluence pages, GitHub README files, technical specs.

Degrades when: Flat documents with no headings (some Google Docs, most Slack threads). Falls back to fixed-size splitting, losing the structural advantage.

Retrieval impact: Typically 5-15% improvement in recall@10 over fixed-size for structured documents, depending on corpus. Negligible improvement for unstructured content.

8.3 Semantic Chunking

Instead of splitting on structural boundaries, split on meaning boundaries. Use embeddings to detect where the topic changes.

How it works:

Split the document into sentences.
Embed each sentence.
Compute cosine similarity between consecutive sentence embeddings using a sliding window.
When similarity drops below a threshold (or drops significantly relative to the local average), insert a chunk boundary.
Merge adjacent sentences between boundaries into chunks.

python
from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text: str, threshold: float = 0.3) -> list[str]:
    sentences = split_into_sentences(text)
    model = SentenceTransformer("all-MiniLM-L6-v2")  # Fast, small model for boundary detection
    embeddings = model.encode(sentences)

    boundaries = [0]
    for i in range(1, len(embeddings)):
        sim = cosine_similarity(embeddings[i-1], embeddings[i])
        if sim < threshold:
            boundaries.append(i)
    boundaries.append(len(sentences))

    chunks = []
    for i in range(len(boundaries) - 1):
        chunk = " ".join(sentences[boundaries[i]:boundaries[i+1]])
        chunks.append(chunk)
    return chunks

Trade-off: 3-5x more compute during ingestion (embedding every sentence for boundary detection). But this is a one-time cost per document, not per query. At 500K document updates per month, the extra embedding cost is about a modest monthly compute cost. Worth it if your corpus has lots of unstructured prose.

Performs well when: Long-form prose that covers multiple topics without clear structural boundaries. Design docs that transition between problem statement, architecture, and implementation. Meeting notes.

Degrades when: Very short documents (not enough content for meaningful topic shifts). Highly structured documents (recursive chunking already captures the boundaries well).

Retrieval impact: Can improve recall, often in the 5-15% range on heterogeneous, unstructured content. The improvement tends to be most significant for documents longer than 3,000 tokens with multiple topic transitions. Results vary significantly by corpus.

8.4 Late Chunking

A technique introduced by Jina AI in 2024 that flips the traditional embed-then-chunk approach.

Traditional approach: Chunk the document first, then embed each chunk independently. Problem: each chunk embedding has no awareness of the surrounding context. A chunk that says "This approach has three advantages" loses the referent of "this approach" because that was defined in a previous chunk.

Late chunking approach:

Pass the entire document through a long-context embedding model (one that supports 8K+ tokens).
The model produces token-level embeddings with full document context (every token's embedding is influenced by the entire document through attention).
After the full forward pass, chunk the token embeddings into segments.
Pool each segment's token embeddings to produce chunk-level embeddings.

So each chunk embedding carries context from the whole document, even though it only covers part of the text. The pronoun "this approach" now has a meaningful embedding because the model saw what it referred to.

Performs well when: Documents with heavy cross-referencing, pronouns, and context-dependent statements. Technical specs where section 3 frequently refers back to concepts from section 1.

Degrades when: Very long documents that exceed the embedding model's context window (typically 8K tokens). You need to split into overlapping windows first, which partially defeats the purpose. Also adds 2-3x latency to the embedding step because you are processing longer sequences.

Retrieval impact: Reported 5-15% improvement in recall for documents with heavy cross-references, though results depend on document structure and query patterns. The gains are smaller on well-structured documents where each section is already self-contained.

Cost: 2-3x higher embedding latency per document. For batch ingestion this is acceptable. For real-time updates where you need a document searchable within minutes, the extra latency may push against your freshness SLA.

8.5 LLM-Guided Chunking

Use an LLM to analyze the document and decide how to chunk it. (This is unrelated to Agentic RAG in Section 12. Here, "LLM-guided" means using a model at ingestion time to choose chunk boundaries. Agentic RAG is about using an agent at query time to iteratively search.) This is the most expensive approach but can produce the highest quality chunks for complex documents.

How it works:

Send the document (or a representative section) to a small LLM.
The LLM identifies logical boundaries, labels each section's topic, and suggests chunk boundaries.
Optionally, the LLM generates a summary for each chunk that serves as an alternative embedding target (the summary is often a better retrieval target than the raw content).

python
CHUNKING_PROMPT = """Analyze this document and identify logical sections.
For each section, provide:
1. Start and end markers (first and last sentence)
2. A topic label
3. A one-sentence summary suitable for search retrieval

Document:
{document_text}

Output as JSON array of sections."""

Performs well when: Highly complex, mixed-format documents. A design doc that interleaves architecture diagrams with code snippets, trade-off analysis, and meeting notes. Documents where the "right" chunk boundary depends on understanding what the content means, not just where the whitespace is.

Cost: the most expensive per-document strategy using a fast model like Haiku or GPT-4o-mini. At 500K documents per month, costs add up quickly at scale. Reserve this for high-value documents (RFCs, design docs) where chunk quality has the biggest impact.

Degrades when: When you apply it to everything. The cost adds up fast, and simpler strategies work fine for well-structured content. Use it selectively.

8.6 Our Strategy: Multi-Strategy Pipeline

No single chunking strategy works for all document types. The platform classifies each document by content type and applies the appropriate strategy:

Document Type	Source	Strategy	Chunk Size	Overlap	Rationale
API reference docs	Confluence, GitHub	Recursive (by heading)	500-800 tokens	50 tokens	Well-structured, self-contained sections
Design docs / RFCs	Confluence, Google Docs	Semantic + Late chunking	600-1000 tokens	N/A (semantic boundaries)	Long-form, cross-referencing, multi-topic
Runbooks / How-tos	Confluence	Recursive (by step)	400-600 tokens	30 tokens	Step-by-step, each step is self-contained
Code files	GitHub	AST-aware (by function/class)	200-800 tokens	0 (logical boundaries)	Function/class boundaries, preserve complete units
Slack threads	Slack	Thread-level (full thread as chunk)	Up to 1000 tokens	0	Context builds across messages, splitting breaks meaning
Meeting notes	Google Docs	Agentic chunking	500-800 tokens	N/A	Unstructured, LLM identifies topic boundaries
PDFs / legacy docs	S3	Semantic chunking	500-800 tokens	N/A	Poor structure, need semantic boundary detection

Quick reference: all strategies compared

Strategy	Best For	Ingestion Cost	Retrieval Impact
Fixed-size	Uniform structured docs	Lowest	Baseline
Recursive	Docs with clear heading hierarchy	Low	Typically +5-15%
Semantic	Long-form, multi-topic prose	Medium (3-5x compute)	Typically +5-15%
Late chunking	Cross-referencing docs	Medium-high (2-3x latency)	Typically +5-15%
LLM-guided	Complex mixed-format docs	Highest (~$0.01-0.05/doc)	Highest for target docs

Content type classification uses a lightweight model (fine-tuned DistilBERT or rule-based heuristics on document source + metadata) to route documents to the right chunking strategy. Accuracy target: 95%+ on the routing decision. Misclassification is not catastrophic because all strategies produce usable chunks. The difference is quality, not correctness.

Every chunk gets enriched metadata:

python
@dataclass
class Chunk:
    chunk_id: str              # UUID
    doc_id: str                # Parent document
    content: str               # Chunk text
    content_type: str          # "prose", "code", "mixed", "conversational"
    chunk_strategy: str        # "recursive", "semantic", "late", "agentic", "ast"
    position: int              # Order within document
    total_chunks: int          # Total chunks in parent doc
    parent_chunk_id: str | None # For hierarchical expansion
    title_context: str         # Nearest heading above this chunk
    source: str                # "confluence", "github", etc.
    url: str                   # Deep link to source
    permissions: list[str]     # Inherited from parent document
    embedding: list[float]     # 1024-dimensional vector
    created_at: datetime
    updated_at: datetime

9. Embedding Pipeline

The embedding pipeline turns text chunks into dense vectors. This section covers model selection, batch processing during ingestion, versioning strategy, and dimensionality reduction.

9.1 Embedding Model Selection

The embedding model is the foundation of retrieval quality. Pick a bad one and everything downstream pays for it.

Model	Dimensions	MTEB Avg Score	Latency (per chunk)	Matryoshka Support	License
Qwen3-Embedding-8B	4096 (or 1024 reduced)	70.58	10-20ms (self-hosted GPU)	Yes	Apache 2.0
OpenAI text-embedding-3-large	3072 (or 1024 reduced)	64.6	5-10ms	Yes	Proprietary API
Cohere embed-v4	1024	65.1	8-15ms	Yes	Proprietary API
NV-Embed-v2 (NVIDIA)	-	~69	GPU	No	Llama license
jina-embeddings-v3	570M dims	65.5	GPU or CPU	-	Apache 2.0
BGE-M3 (open source)	1024	63.0	3-8ms (GPU) or CPU	No	MIT
EmbeddingGemma-300M	-	~60	CPU or edge device	-	Apache 2.0

Our choice: OpenAI text-embedding-3-large at 1024 dimensions for the API path. The Matryoshka property lets us store 1024d vectors (instead of full 3072d) with less than 1% retrieval quality loss, cutting storage by 66%. At 10M chunks, that saves ~40 GB of vector storage. The API is reliable, well-documented, and integrates cleanly with the rest of the pipeline.

For self-hosted: Qwen3-Embedding-8B. It tops the MTEB leaderboard at 70.58, beating every commercial API on raw benchmark scores. Supports 100+ languages and custom instructions for domain-specific tuning. Requires a single A100-40GB GPU. At high volume (> 50M chunks), self-hosting cuts embedding cost by 3-5x compared to APIs.

The lightweight option: BGE-M3. MIT licensed, 568M parameters, runs on CPU for small deployments. Lower quality than Qwen3-Embedding (63.0 vs 70.58 on MTEB) but zero GPU cost and battle-tested in production RAG systems worldwide. Good enough for prototyping and small-scale deployments.

Serving self-hosted embeddings: HuggingFace's Text Embedding Inference (TEI) is the easiest production setup:

bash
# Serve BGE-M3 for production embedding
docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-m3 \
  --max-batch-tokens 65536

For smaller models like BGE-M3 (568M params), CPU serving is viable at low throughput (< 100 QPS). At higher volumes, a single GPU handles 1,000+ embeddings per second.

9.2 Batch Embedding for Ingestion

New and updated chunks flow through Kafka to the embedding workers. Each worker:

Batches chunks into groups of 100-500 (the API handles batch requests efficiently).
Calls the embedding API with retry and exponential backoff.
Writes vectors to Qdrant and text to Elasticsearch in parallel.
Acknowledges the Kafka offset after both writes succeed.

Rate limiting: OpenAI's embedding API allows up to 3,000 RPM on higher tiers (varies by plan). With batch sizes of 100 at full quota utilization, that is roughly 300K chunks per minute. Monthly ingestion of 500K updated chunks can complete quickly. Full re-indexing of 10M chunks typically takes 1-3 hours depending on API rate limits, retries, and batching strategy.

9.3 Embedding Model Versioning and Migration

Embedding models improve over time. When you upgrade from text-embedding-3-small to text-embedding-3-large, every existing vector in the database is now incompatible with new query vectors. You cannot mix embeddings from different models.

There's only one hard rule: the embedding model must match at ingestion and query time. If you embed your 10M chunks with text-embedding-3-large, every user query must also go through text-embedding-3-large. Mix models and you get vectors in different spaces where cosine similarity is meaningless. The chunking model and the generation LLM are completely independent of this choice.

Blue-green re-indexing strategy:

Create a new Qdrant collection (chunks_v2) alongside the existing one (chunks_v1).
Run a background job that re-embeds all 10M chunks with the new model. At our scale and current API rate limits, this typically takes 1-3 hours.
While re-indexing runs, queries continue hitting chunks_v1.
Once chunks_v2 is fully populated, run the retrieval evaluation benchmark against the golden dataset.
If quality meets or exceeds the bar, atomically switch the query router to chunks_v2.
Keep chunks_v1 for 48 hours as a rollback target, then delete.

This is the same pattern as blue-green deployments for application code, applied to vector data.

When to change your embedding model:

Changing the embedding model is expensive (full re-index of the corpus) and risky (retrieval quality could regress). Do it when the payoff justifies the cost:

Trigger	Example	Worth it?
Major quality improvement	New model scores 5%+ higher on MTEB for your domain	Yes. Run shadow eval first.
Significant cost reduction	Switch from API model to self-hosted with < 1% quality loss	Yes, if savings > $1K/month at your scale.
Vendor deprecation	Provider announces model sunset with 6-month deadline	Yes. Plan early, don't rush.
New capability needed	Need code-aware embeddings (CodeBERT) or multi-lingual support	Yes, if current model can't handle these.
Dimensionality optimization	Matryoshka model lets you cut dims from 3072 to 1024 with < 1% loss	Yes. Storage and latency savings compound.

When NOT to change:

Marginal improvement (< 2% on your retrieval eval set). The re-indexing cost and risk are not worth it.
You have no blue-green re-indexing capability. Changing models in-place means downtime or serving stale results.
Mid-incident or during a feature launch. Stabilize first.
The new model has not been benchmarked on YOUR data. MTEB scores are averages across generic datasets. A model that scores 3% higher on MTEB might score 2% lower on your internal engineering corpus. Always benchmark on your golden dataset before switching.

How to evaluate before switching:

Create a shadow Qdrant collection with the new model's embeddings (start with a 10% sample of your corpus).
Run your golden dataset retrieval benchmark against the shadow collection.
Compare recall@10, MRR, and NDCG@10 against the current production model.
If the new model wins or ties on all metrics, proceed with full re-indexing.
If it wins on some and loses on others, dig into the failures. If the losses are on query types you care about, do not switch.

9.4 Dimensionality Reduction

OpenAI's text-embedding-3-large supports Matryoshka representation learning: the first N dimensions of the embedding capture most of the information. Truncating from 3072 to 1024 dimensions reduces storage by 66% with typically less than 2% quality loss on most retrieval benchmarks (corpus-dependent).

For further compression, apply scalar quantization in Qdrant: convert float32 vectors to int8. This reduces memory by 4x with typically 1-5% recall loss depending on corpus and query distribution. Combined, you go from 120 GB (3072d, float32) to 10 GB (1024d, int8). That is a 12x reduction.

When to avoid quantization: If your retrieval recall is already borderline (< 75%), quantization will push it below acceptable thresholds. Fix chunking and retrieval first, then compress.

10. Storage Architecture

10.1 Vector Database Selection

Feature	Qdrant	Pinecone	Weaviate	pgvector
Architecture	Distributed, shared-nothing	Fully managed serverless	Distributed, multi-model	PostgreSQL extension
Max vectors	Billions (sharded)	Billions (managed)	Billions (sharded)	~10M practical
HNSW tuning	Full control (ef, M)	Abstracted	Full control	Limited
Quantization	Scalar, Product, Binary	Supported	Product	Half-precision
Metadata filtering	Payload indexes, fast	Namespace + filter	Inverted index	SQL WHERE clause
Hybrid search	Sparse vectors + dense	Sparse + dense	BM25 built-in	Requires extensions
Ops complexity	Medium (Helm charts)	Zero (managed)	Medium	Low (PG extension)

Our choice: Qdrant. For 10M vectors at our scale, Qdrant gives us full control over HNSW parameters, quantization, and sharding without the managed service premium. Payload indexes handle the metadata filtering we need for access control. The Rust implementation keeps memory usage predictable. Pinecone is the right choice if you want zero operational overhead. pgvector works up to about 5-10M vectors but performance degrades at our scale, especially with complex filters.

For a deeper comparison of vector database internals (HNSW vs IVF-PQ, distance metrics, re-indexing strategies), see the vector databases deep dive.

10.2 Vector Store (Qdrant)

Cluster topology: 3 nodes with replication factor 2. Each node handles a shard of the vector space. At 10M vectors with 1024 dimensions (int8 quantized), each node stores about 5 GB of vector data plus HNSW graph overhead.

HNSW index tuning:

Parameter	Value	Why
`m`	16	Connections per node in the HNSW graph. 16 balances recall and memory. Higher values (32-64) improve recall slightly but double memory usage.
`ef_construct`	200	Build-time search width. Higher values produce a better graph but slow down indexing. 200 is sufficient for 10M vectors.
`ef` (search)	128	Query-time search width. Higher = better recall but slower search. 128 often achieves high recall (frequently >90%) but exact results depend on data distribution and filtering complexity.

Payload indexes: Create indexes on source, content_type, and permissions fields. These allow fast pre-filtering during vector search. Without payload indexes, Qdrant scans all vectors first and filters after, which is much slower for restrictive filters.

What exactly is stored per chunk?

Each chunk becomes a vector with a metadata payload:

json
{
  "id": "chunk_a1b2c3d4",
  "vector": [0.12, -0.98, 0.34, 0.67, -0.21, "... 1024 dimensions total"],
  "payload": {
    "content": "The payments service uses exponential backoff with jitter for all downstream retries. Base delay is 2 seconds, multiplied by 2^attempt with random jitter up to 500ms...",
    "doc_id": "doc_payments_retry_2026",
    "title": "Payment Service Error Handling",
    "source": "confluence",
    "content_type": "prose",
    "permissions": ["team-payments", "team-platform"],
    "url": "https://wiki.internal/pages/payment-error-handling",
    "updated_at": "2026-03-15T10:30:00Z",
    "chunk_index": 3,
    "total_chunks": 8
  }
}

The vector captures meaning (1024 numbers). The payload carries the actual text, metadata, and access control filters. The vector database returns ranked chunks, not answers. The LLM has not been involved yet at this stage — it only enters the picture after retrieval selects the best chunks.

10.3 Keyword Index (Elasticsearch)

BM25 keyword search handles queries where exact term matching matters. "ErrorCode 4032" should match documents containing that exact string, regardless of semantic similarity.

Index design:

json
{
  "mappings": {
    "properties": {
      "content": { "type": "text", "analyzer": "standard" },
      "title_context": { "type": "text", "analyzer": "standard", "boost": 3.0 },
      "source": { "type": "keyword" },
      "content_type": { "type": "keyword" },
      "permissions": { "type": "keyword" },
      "doc_id": { "type": "keyword" },
      "chunk_id": { "type": "keyword" },
      "url": { "type": "keyword" },
      "updated_at": { "type": "date" }
    }
  }
}

Field boosting: title_context gets 3x boost because a match in the heading is a strong relevance signal. A chunk under the heading "Authentication Flow" that matches the query "how does auth work" should rank higher than a chunk that mentions authentication in passing.

10.4 Metadata Store (PostgreSQL)

PostgreSQL stores the system of record for document metadata, access control lists, and chunk lineage.

Key tables:

documents: Source metadata, content hash, last sync timestamp, permissions
chunks: Chunk-to-document mapping, chunk strategy, position, parent chunk
permissions: ACL entries mapping documents to user groups and individual users
feedback: User ratings, corrections, and implicit signals per query-answer pair
evaluations: LLM-as-judge scores, golden dataset results, regression test runs

Why not put metadata in Qdrant payloads? Qdrant payloads work for search-time filtering but are not queryable with SQL. Admin dashboards, permission audits, and evaluation analysis all need SQL queries that join across tables. PostgreSQL is the right tool for that.

10.5 Semantic Cache (Redis)

Before running the full retrieval pipeline, check if a semantically similar query was recently answered.

How it works:

Embed the incoming query.
Check Redis for cached query embeddings with cosine similarity > 0.95.
If found, return the cached answer immediately (< 10ms vs 1-2s for full pipeline).
If not found, run the full pipeline and cache the result.

What each cache entry stores:

json
{
  "query_embedding": [0.12, -0.98, 0.34, ...],
  "query_text": "How does auth work?",
  "answer": "The auth service uses OAuth2 with...",
  "sources": [{"title": "Auth Guide", "url": "..."}],
  "created_at": 1711234567
}

The embedding is the lookup key, not the text. "How does auth work?" and "How does authentication work?" are different strings but nearly identical embeddings. Text matching would miss the cache hit. Embedding matching catches it.

How similarity matching works: Cosine similarity measures the angle between two vectors. Same direction means same meaning.

New query:    "How does authentication work?"  → [0.11, -0.97, 0.35, ...]
Cached query: "How does auth work?"            → [0.12, -0.98, 0.34, ...]

cosine_similarity = dot_product(A, B) / (magnitude(A) × magnitude(B))
                  = 0.97

0.97 > 0.95 threshold → cache hit → return stored answer in <10ms

A query like "What auth does the payments service use?" produces a vector with similarity ~0.82, well below the threshold. Cache miss, full pipeline runs. The threshold controls how strict the match is: 0.95 is conservative (fewer hits, safer), 0.90 is relaxed (more hits, small risk of returning a slightly wrong cached answer). For enterprise RAG with sensitive runbooks and ACL-protected content, 0.95 is the right default.

At small scale (under 50K cached queries), comparing against all stored embeddings with brute-force cosine similarity takes under 1ms. If the cache grows past 100K entries, switch to Redis Vector Search (RediSearch module with HNSW indexing) or pgvector for indexed approximate search.

Cache invalidation: When a document is updated, invalidate all cached answers that cited that document. This uses the chunk lineage in PostgreSQL to find affected cache entries.

Hit rates: In practice, enterprise knowledge platforms see 5-25% cache hit rates. Engineers on the same team ask similar questions. Onboarding engineers ask the same questions that last month's new hires asked. Cache hit rate is highest on Monday mornings and during incident response when many people search for the same runbooks.

Cold start and cache warming: An empty cache means every query pays full pipeline cost. This happens after deployments, cache flushes, or Monday mornings when TTLs have expired over the weekend. Three strategies:

Pre-warm with top queries. Log the top 1,000 queries from the previous week. After a cache flush or deployment, run these through the pipeline as a background job (P2 priority, so it does not compete with real users). This covers 15-30% of Monday's traffic before anyone arrives.
Staggered TTLs. Instead of a uniform 24-hour TTL for all cache entries, randomize TTLs between 18-30 hours. This prevents mass expiration at the same time and smooths the cache rebuild.
Gradual invalidation after document updates. When a document is updated, do not flush all related cache entries immediately. Mark them as "stale but serveable" for 5 minutes while the pipeline regenerates fresh answers in the background. Users get slightly stale answers for a few minutes instead of a latency spike.

10.6 Data Lifecycle Management

Vectors do not age gracefully. Without active maintenance, the index accumulates stale chunks, orphaned vectors, and fragmented segments.

Document TTL and stale chunk cleanup: Documents that have not been updated at the source within a configurable window (for example, 6-12 months depending on document type) get flagged as potentially stale. A weekly job checks whether flagged documents still exist at the source. If deleted or archived, their chunks are removed from Qdrant and Elasticsearch. If still present but unchanged, they stay indexed but get a stale_risk: high metadata flag that the generation layer can surface as a caveat: "Note: this source was last updated 14 months ago."

Vector compaction: Qdrant segments accumulate deletions over time (from document updates and stale cleanup). Deleted vectors are soft-deleted but still occupy space and slow search. Schedule monthly compaction during low-traffic windows (weekend nights) to reclaim space and rebuild HNSW graph segments.

Index fragmentation: Elasticsearch indexes grow fragmented as documents are added and deleted. Run _forcemerge quarterly to reduce segment count. For Qdrant, the optimizer handles this automatically but benefits from a periodic full re-optimization.

11. Retrieval Pipeline

Retrieval is where trust is won or lost. All the ingestion and chunking work is just preparation for this stage. Most RAG systems fail not because of the model, but because they retrieve the wrong context and never find out.

11.1 Query Understanding

Raw user queries are often ambiguous, incomplete, or poorly phrased. The query understanding layer transforms them before retrieval.

Query classification: A lightweight classifier (fine-tuned DistilBERT or rule-based) categorizes queries into:

Query Type	Example	Routing
Simple factual	"What port does the user service run on?"	Single-shot retrieval, small model
How-to	"How do I set up local dev for the payments service?"	Single-shot retrieval, medium model
Analytical	"What are the trade-offs between our auth approaches?"	May need agentic RAG, large model
Multi-hop	"How does service X auth with Y, and what changed after Q3?"	Agentic RAG with query decomposition
Conversational	"What about the error handling?" (follow-up)	Resolve coreferences, then route

Query rewriting: For ambiguous queries, use a fast LLM call to rewrite:

Original: "that auth thing from last quarter"
Rewritten: "authentication changes implemented in Q3 2025"

Cost: negligible per-query cost using a small model. Applied to ~30% of queries (those classified as ambiguous).

HyDE (Hypothetical Document Embeddings): For abstract queries where the user's question does not overlap lexically with any document, generate a hypothetical answer first, then use that answer's embedding for retrieval.

Query: "Why is checkout slow?"
HyDE: "The checkout service experiences latency spikes due to synchronous calls
       to the payment processor and inventory service. The p99 latency increases
       from 200ms to 2s during peak traffic because..."

The HyDE embedding is semantically closer to actual documents about checkout performance than the original 4-word query. This technique can improve recall for abstract queries, with observed gains of 5-15% depending on usage patterns, but adds 200-400ms of latency (the LLM call to generate the hypothetical answer). Apply it selectively, not on every query.

11.2 Hybrid Retrieval

Run BM25 keyword search and dense vector search in parallel, then merge results.

Why hybrid? BM25 captures exact lexical matches and rare tokens. Embeddings capture semantic similarity. Their failure modes are complementary, which is why combining them outperforms either alone. Each search modality has blind spots:

Vector search misses: Exact identifiers, error codes, version numbers. "ErrorCode 4032" has no meaningful semantic embedding. BM25 finds it instantly.
BM25 misses: Intent and meaning. "how to handle failures gracefully" does not match a document titled "Retry and Circuit Breaker Patterns" because the keywords do not overlap. Vector search captures the semantic connection.

Concrete example of complementary blind spots:

Try this query: "ErrorCode 4032"

Vector search returns docs about error handling in general — semantically similar, but not the right error code.
Keyword search finds the exact match instantly: "ErrorCode 4032: Idempotency key conflict on duplicate payment submission."

Now try: "how to handle failures gracefully"

Keyword search returns nothing useful — no document contains this exact phrase.
Vector search finds "Resilience Patterns: Circuit Breakers, Retries, and Graceful Degradation" — semantically a perfect match.

Vector search understands meaning. Keyword search finds exact terms. You need both.

Reciprocal Rank Fusion (RRF) merges the two result lists:

RRF_score(doc) = sum(1 / (k + rank_in_list)) for each list containing doc
where k = 60 (standard constant)

If a document appears at rank 3 in vector search and rank 7 in BM25:

RRF_score = 1/(60+3) + 1/(60+7) = 0.0159 + 0.0149 = 0.0308

Documents appearing in both lists get boosted. Documents appearing in only one list still contribute.

RRF in action:

Document	Vector Rank	BM25 Rank	RRF Score	Final Rank
Payment Error Handling	1	3	0.0492	1
payments-service README	2	1	0.0492	2
Resilience Patterns	3	5	0.0423	3
Retry Config Reference	5	2	0.0443	4
Error Code Glossary	8	4	0.0398	5

Documents appearing in both lists get boosted. Documents appearing in only one list still contribute.

Alpha tuning: Some teams use a weighted combination instead of RRF: score = alpha * vector_score + (1 - alpha) * bm25_score. The optimal alpha varies by corpus. For our enterprise knowledge base: alpha = 0.7 (favor semantic) works well for natural language queries. For code-heavy queries: alpha = 0.4 (favor keyword) performs better.

Dynamic alpha by query type: Rather than a single fixed alpha, adjust the weight based on the query classification from Section 11.1:

Query Type	Alpha (vector weight)	Why
Conceptual ("how does auth work")	0.8	Meaning matters more than exact terms
Code/identifier ("ErrorCode 4032")	0.3	Exact match critical; BM25 excels
How-to ("set up local dev")	0.6	Mix of procedure keywords and intent
Conversational follow-up	0.7	Rewritten query benefits from semantic

Over time, you can learn the optimal alpha per query type from feedback data. If users consistently downvote code query results, try shifting alpha lower for that bucket.

Performance: Hybrid retrieval typically outperforms either modality alone by 5-15% on recall@10 across published benchmarks. The actual improvement depends on your query mix and corpus characteristics.

Every retrieval improvement (hybrid search, re-ranking, query expansion) adds latency and cost. Production systems continuously balance this trade-off rather than maximizing any single dimension. A system that gets perfect recall at 2-second retrieval latency is worse than one that gets 90% recall at 200ms.

11.3 Re-Ranking

The initial retrieval cast a wide net: top-50 to top-100 results from hybrid search. Re-ranking narrows this to the top-5 most relevant chunks using a cross-encoder model.

Why re-ranking matters: Bi-encoders encode the query and document independently into separate vectors. Fast, but the representations never "see" each other. Cross-encoders score the (query, document) pair jointly, attending to word-level interactions between them. Joint scoring is significantly more accurate but too slow to run on the full corpus.

The standard pattern: fast bi-encoder retrieval on the full corpus (millions of chunks) followed by accurate cross-encoder re-ranking on the shortlist (20-100 chunks).

Re-ranking options:

Model	Latency (20 docs)	Quality (NDCG@10)	Cost
Cohere Rerank 3.5	80-150ms (with network)	Top-tier	$2/1M docs
Qwen3-Reranker-8B (open source)	50-100ms (GPU)	Comparable to Cohere on benchmarks	GPU cost only
BGE-reranker-v2-m3 (self-hosted)	40-80ms (GPU)	Very good	GPU cost only
ColBERTv2	20-40ms	Good (late-interaction, fastest)	GPU cost only
Jina Reranker v2	50-90ms	Very good	$1/1M docs

Our choice: Cohere Rerank 3.5 as primary, with BGE-reranker-v2-m3 self-hosted as a fallback. The Cohere model handles 20 documents in ~80-150ms at P99 in typical deployments (including network overhead) which fits within our P99 retrieval budget of ~350ms. The self-hosted fallback activates if Cohere's API has availability issues.

Open-source re-rankers:

Model	Params	Quality	GPU Requirement
Qwen3-Reranker-8B	8B	Matches Cohere Rerank 3.5	1x A100-40GB
BGE-reranker-v2-m3	568M	Very good	1x RTX 4090
ColBERTv2	110M	Good (late-interaction, very fast)	1x RTX 4090

11.4 Access Control Filtering

Non-negotiable. An engineer should never see answers sourced from documents they cannot access. One leaked internal doc through the knowledge assistant and the platform is dead.

Pre-filtering vs post-filtering:

Pre-filtering (our approach): Include the user's permission groups in the vector search query. Qdrant's payload index on permissions filters before scoring, so only accessible documents are considered. This is faster and more secure. The downside: if a user has very restrictive permissions, the candidate pool shrinks and retrieval quality may drop.
Post-filtering: Retrieve top-k from the full corpus, then filter out inaccessible documents. Simpler to implement but you may end up with fewer than k results after filtering. Also briefly loads inaccessible document IDs into memory, which some compliance frameworks disallow.

Permission sync: Permissions are synced from source systems to PostgreSQL on document ingestion. When Confluence space permissions change, a webhook triggers a permission update for all documents in that space. Qdrant payload updates propagate within seconds.

Edge case: shared docs with restricted sections. Some Confluence pages are accessible to everyone but contain sections marked as restricted. The current design treats the entire document as accessible if the top-level permission allows it. A future improvement: section-level permissions mapped to chunk-level ACLs.

11.5 Context Assembly

After retrieval and re-ranking produce the top-5 chunks, assemble them into the LLM prompt.

Context window budget:

Total context window: 8,192 tokens (small model) or 200,000 tokens (large model)

Budget allocation (small model):
  System prompt:          500 tokens (instructions, citation format, guardrails)
  Retrieved context:      4,000-5,000 tokens (5 chunks × 800-1000 tokens)
  Conversation history:   1,000-1,500 tokens (last 2-3 turns, summarized)
  Generation output:      1,000-2,000 tokens (answer + citations)

Budget allocation (large model, complex queries):
  System prompt:          500 tokens
  Retrieved context:      8,000-15,000 tokens (10-15 chunks for agentic RAG)
  Conversation history:   2,000-4,000 tokens (full recent history)
  Generation output:      2,000-4,000 tokens

Chunk ordering: Research on the "lost in the middle" effect shows that LLMs pay more attention to the beginning and end of the context, and underweight information in the middle. To mitigate: place the most relevant chunk first, the second-most-relevant chunk last, and fill the middle with supporting context. In our testing, this can improve answer quality by 3-5%, though newer models show less susceptibility to this effect than earlier ones.

Context compression: When retrieved chunks are long or overlap in content, compress before stuffing into the prompt. Two techniques:

Redundancy removal. If two chunks say essentially the same thing (cosine similarity > 0.85 between their embeddings), keep only the higher-scored one. This happens more often than you would expect, especially when multiple documents describe the same service.
Extractive compression. Use a fast LLM call to extract only the sentences relevant to the query from each chunk. A 800-token chunk might compress to 200 tokens without losing the information the user actually needs. Cost: a small per-query cost that pays for itself in reduced generation tokens. Can save 30-50% of context tokens in practice, which directly reduces generation cost.

Apply compression selectively. For simple factual queries with 3-4 short chunks, skip it. For complex queries with 10+ chunks from agentic RAG, compression pays for itself in reduced generation tokens.

Token counting: Use tiktoken (for OpenAI models) or the model's native tokenizer to count precisely. Overestimating wastes context. Underestimating causes truncation errors. Count tokens before assembly, not after.

12. Agentic RAG: Beyond Single-Shot Retrieval

Single-shot RAG (retrieve once, generate once) handles about 60-70% of enterprise queries well. The remaining 30-40% are too complex: they require information from multiple documents, involve reasoning across topics, or are phrased so abstractly that initial retrieval misses the target.

Agentic RAG uses an LLM as a reasoning agent that can iteratively search, evaluate results, and refine its approach. Instead of a fixed pipeline, the agent decides what to do next based on what it has found so far.

12.1 When to Use Agentic RAG

Not every query needs an agent. Agents typically add 1.5-3x latency and 2-5x cost compared to single-shot RAG (varies by iteration count and tools called). Route to the agentic path only when needed.

Routing criteria:

Signal	Single-Shot	Agentic
Query complexity classifier	Simple, factual, how-to	Analytical, multi-hop, comparative
Estimated retrieval confidence	High (clear topic match)	Low (abstract, ambiguous)
Keyword count	< 15 words	> 15 words, multiple clauses
Contains comparison words	No	"compare", "difference between", "trade-offs"
Contains temporal references	No	"changed since", "before and after", "history of"

The percentage varies by organization, but typically 10-30% of queries benefit from the agentic path. These are the queries that produce the most value because they are the ones engineers previously could not answer without asking a senior colleague.

Concrete examples of classification:

SIMPLE → single-shot RAG:
  "What is the endpoint for the user service?"     → direct lookup
  "ErrorCode 4032"                                 → exact match
  "How to set up local dev for payments service?"  → how-to, single topic

COMPLEX → agentic RAG:
  "Compare auth v1 vs v2 and what changed after Q3"  → multi-part + temporal
  "Why is checkout slow?"                             → analytical, multiple causes
  "How does payments talk to user service and what    → multi-hop, cross-service
   happens when user service is down?"

The classifier is a lightweight LLM call (Haiku/GPT-4o-mini, ~10ms) that returns simple or complex with a confidence score. When confidence is below 0.7, default to single-shot first and fall back to agentic if retrieval confidence is low.

12.2 Query Decomposition

Complex queries are broken into simpler sub-queries that can each be answered independently.

Original: "How does the payments service authenticate with the user service,
           and what changed after the Q3 auth migration?"

Decomposed:
  Sub-query 1: "Payments service authentication mechanism with user service"
  Sub-query 2: "Q3 2025 authentication migration changes"
  Sub-query 3: "Payments service auth changes after Q3 migration"

Each sub-query runs through the full retrieval pipeline independently. Results are merged and deduplicated before being passed to the generation step.

Implementation: A single LLM call decomposes the query. Prompt:

Given this complex question, break it into 2-5 simpler sub-questions
that can each be answered independently by searching an internal
knowledge base. Each sub-question should be self-contained.

Question: {original_query}

Estimated cost (as of early 2026): $0.002-0.005 per decomposition using a small model. Latency: 200-400ms.

12.3 Iterative Retrieval with Self-Reflection

After initial retrieval, the agent evaluates whether the retrieved context is sufficient to answer the query. If not, it formulates a follow-up search.

The loop:

Retrieve context for the query (or sub-query).
Ask the agent: "Given this context, can you answer the question? If not, what additional information do you need?"
If the agent identifies gaps, it generates a refined query targeting the missing information.
Retrieve again with the refined query.
Repeat up to 3 iterations (configurable).

Why this matters: Initial retrieval often returns documents that are close but not quite right. The agent can recognize "I found the auth documentation but it's for the old system, I need the post-migration docs" and search more specifically.

Latency impact: Each iteration adds 500-800ms (retrieval + evaluation). A 3-iteration query takes 2.5-4.5s total. This is acceptable for complex queries where the alternative is the engineer spending 20 minutes searching manually.

12.4 Multi-Source Routing

The agent decides which knowledge sources to query based on the query type:

Query Type	Primary Sources	Retrieval Strategy
API reference	GitHub README, Confluence API docs	Code-aware chunking, exact match boosted
Incident investigation	Slack threads, runbooks	Thread-level retrieval, recency-weighted
Architecture decisions	RFCs, design docs, ADRs	Semantic chunking, full-section retrieval
Setup instructions	Runbooks, READMEs	Recursive chunking, step-by-step retrieval

Rather than searching the entire 10M-chunk corpus for every query, the agent narrows to relevant source types first. This reduces search space, improves precision, and cuts retrieval latency.

12.5 Tool Use via MCP

The agent interacts with the retrieval system and other data sources through MCP (Model Context Protocol) tools. Each tool is an MCP server, and the agent orchestrator is the MCP client.

Available tools:

MCP Tool Server	Capabilities	When Used
`search-vector`	Semantic search on Qdrant with filters	Every retrieval step
`search-keyword`	BM25 search on Elasticsearch	Exact term queries, error codes
`search-code`	AST-aware code search on GitHub index	API lookups, function signatures
`query-metadata`	SQL queries on PostgreSQL metadata	"Who owns this service?", "When was this last updated?"
`calculate`	Math operations for sizing/estimation queries	"How much storage does X need?"
`fetch-document`	Retrieve full document by ID	When a chunk reference needs full context

Why MCP over custom tool interfaces? Three reasons:

Model portability. Write each tool server once. Use it with Claude, GPT, Llama, or any MCP-compatible model. No per-model tool format conversion.
Security. MCP's OAuth 2.1 support means each tool server can enforce authentication and authorization independently. The code search tool can verify that the requesting user has access to the repo before returning results.
Capability negotiation. At initialization, the agent discovers what tools are available and their schemas. If the code search tool is down for maintenance, the agent gracefully skips it instead of failing.

For a deeper dive on MCP architecture, transports, and security model, see the MCP server guide.

How does the LLM pick which tool to call? The agent receives tool schemas at initialization and selects based on query intent:

Query: "How does auth work?"
→ Agent selects: search-vector("authentication flow architecture")
   Why: conceptual question, semantic search finds best results

Query: "ErrorCode 4032"
→ Agent selects: search-keyword("ErrorCode 4032")
   Why: exact term lookup, BM25 finds the precise match

Query: "Who owns the checkout service?"
→ Agent selects: query-metadata("SELECT owner FROM services WHERE name = 'checkout'")
   Why: structured data question, SQL is more precise than text search

Query: "How does the retry logic in payments-service/retry.go work?"
→ Agent selects: search-code("retry.go exponential backoff payments-service")
   Why: code-specific question, AST-aware search finds function definitions

The LLM does not guess. It receives the tool descriptions and input schemas, then calls the most appropriate tool for each sub-query. Complex queries often call multiple tools in sequence.

LangGraph orchestration: The agent loop is implemented as a LangGraph state machine:

python
from langgraph.graph import StateGraph, END

class AgentState(TypedDict):
    query: str
    sub_queries: list[str]
    retrieved_chunks: list[Chunk]
    iteration: int
    max_iterations: int
    is_sufficient: bool
    answer: str | None

def should_continue(state: AgentState) -> str:
    if state["is_sufficient"] or state["iteration"] >= state["max_iterations"]:
        return "generate"
    return "retrieve"

graph = StateGraph(AgentState)
graph.add_node("decompose", decompose_query)
graph.add_node("retrieve", retrieve_with_tools)
graph.add_node("evaluate", evaluate_sufficiency)
graph.add_node("generate", generate_answer)

graph.set_entry_point("decompose")
graph.add_edge("decompose", "retrieve")
graph.add_edge("retrieve", "evaluate")
graph.add_conditional_edges("evaluate", should_continue)
graph.add_edge("generate", END)

Detailed example: a 2-iteration agent trace.

Query: "How does the payments service handle retries, and what changed after the Q3 migration?"

ITERATION 1:
  Decompose → 2 sub-queries:
    [1] "payments service retry strategy"
    [2] "Q3 migration changes payments"

  Tool calls:
    search-vector("payments service retry strategy") → 4 chunks found
      ✓ "Exponential backoff with jitter, base delay 2s..."
      ✓ "Retry budgets: max_retries=3, circuit breaker at 5 failures..."
    search-vector("Q3 migration changes payments") → 2 chunks found
      ~ "Q3 platform migration overview" (too broad)
      ~ "Service mesh migration timeline" (wrong service)

  Evaluate: "I have strong retry docs but the Q3 migration results
             are too generic. I need to search for checkout-specific
             migration changes."
  → is_sufficient: false

ITERATION 2:
  Refined search:
    search-keyword("Q3 migration payments retry") → 3 chunks found
      ✓ "Q3 migration: payments retry policy changed from 5→3 max retries"
      ✓ "Migration rollback: reverted retry timeout from 10s to 5s"
    search-vector("payments Q3 breaking changes") → 1 chunk found
      ✓ "BREAKING: retry jitter algorithm changed from full to decorrelated"

  Evaluate: "Now I have both retry strategy docs AND specific Q3 changes."
  → is_sufficient: true

GENERATE:
  Context: 7 chunks (4 from iteration 1 + 3 from iteration 2)
  Answer: "The payments service uses exponential backoff with jitter...
           After Q3, three things changed: max_retries was reduced from
           5 to 3, timeout was halved from 10s to 5s, and the jitter
           algorithm switched from full to decorrelated [Source: Q3 Migration Notes]."

  Total: 2 iterations, 4 tool calls, 3.2 seconds, $0.04

Notice how the agent reasons about what is missing and refines its search. Single-shot RAG would have returned the retry strategy but missed the Q3 changes entirely.

12.6 Guardrails on Agent Loops

Agentic RAG without guardrails is a cost bomb waiting to go off. An agent that keeps searching and re-searching without finding useful results can burn through tokens and latency.

Hard limits:

Guardrail	Limit	What Happens at Limit
Max iterations	3	Fall back to best-effort answer with available context
Max tool calls	15	Stop searching, generate from what you have
Max tokens (input)	50,000	Budget exhausted, generate immediately
Max wall-clock time	15 seconds	Timeout, return partial answer with apology
Cost cap per query	$0.10	Circuit breaker, fall back to single-shot RAG

Circuit breaker: If the agent detects it is looping (same retrieval results appearing twice, or evaluation scores not improving), it breaks out and falls back to single-shot RAG with the best results found so far. The user gets a slightly worse answer fast, rather than a perfect answer never.

13. Generation Layer

13.1 Model Routing

Not every query needs a frontier model. A tiered routing strategy can cut costs by 50-70% while maintaining answer quality, depending on your query distribution and how accurately you classify complexity.

Tier	Model	Query Types	TTFT	Relative Cost	% of Traffic
Fast	Claude Haiku / GPT-4o-mini	Simple factual, definitions, single-doc answers	150-300ms	Very low	~70-80%
Standard	Claude Sonnet / GPT-4o	How-to, moderate reasoning, multi-chunk synthesis	400-800ms	Moderate	~15-25%
Complex	Claude Opus / GPT-4	Multi-hop reasoning, comparative analysis, complex synthesis	800-2000ms	High	~5-10%

Routing decision: The query complexity classifier (from Section 11.1) determines the tier. When in doubt, route up not down. An overqualified model wastes a few cents. An underqualified model produces a bad answer that erodes trust.

Fallback chain: If the primary provider (say, Anthropic) is down:

Try the same tier on the secondary provider (OpenAI).
If both API providers are down, fall back to self-hosted vLLM with Llama 4 70B (or DeepSeek V3.2).
If self-hosted is also unavailable (hardware failure), return a degraded response: "I found these potentially relevant documents: [links]. Our answer generation service is temporarily unavailable."

The degraded response is still useful. It turns the platform from a Q&A system into a search engine, which is better than a 500 error.

13.2 Prompt Engineering at Scale

The system prompt is the most critical piece of text in the entire platform. It determines citation behavior, hallucination boundaries, and response format.

You are an internal knowledge assistant for engineers. Answer questions
using ONLY the retrieved context provided below. Follow these rules:

1. GROUNDING: Every factual claim must be supported by the retrieved context.
   If the context does not contain enough information, say so explicitly.

2. CITATIONS: Reference sources using [Source N] notation inline.
   At the end, list all sources with their titles and URLs.

3. UNCERTAINTY: If you are not confident in an answer, say
   "Based on the available documentation, [answer], but I'd recommend
   verifying with [suggested source or team]."

4. SCOPE: Do not answer questions about topics not covered in the
   retrieved context. Do not use your training data as a source.

5. FORMAT: Use markdown. Code blocks for code. Keep answers concise
   but complete. Prefer bullet points for multi-step answers.

Retrieved Context:
{retrieved_chunks_with_source_labels}

Conversation History:
{recent_turns}

User Question: {query}

Prompt versioning: Prompts are stored in a version-controlled config file, not hardcoded. Each change gets a version tag. A/B testing compares prompt versions on 5-10% of traffic, measuring answer quality via LLM-as-judge scores and user feedback.

Dynamic prompt assembly: The prompt template varies based on query type:

Factual queries get a shorter system prompt emphasizing conciseness.
How-to queries get a prompt emphasizing step-by-step formatting.
Analytical queries get a prompt emphasizing nuance and trade-off discussion.

13.3 Streaming Architecture

Users should not stare at a blank screen for 2 seconds. Streaming shows tokens as they are generated, reducing perceived latency.

Implementation: Server-Sent Events (SSE) from the backend to the frontend.

The LLM provider streams tokens via its API.
The backend forwards each token to the client over an SSE connection.
The client renders markdown incrementally.

Citation handling in streaming: Citations are tricky to stream because the model might output "[Source 1]" across multiple token chunks. The backend buffers citation markers and only sends them to the client once the full marker is complete. The source URL resolution happens at the end of the stream.

Latency breakdown:

Time to first token (TTFT):
  Query understanding:    50-200ms
  Cache check:            5-10ms
  Retrieval + re-ranking: 100-200ms
  Prompt assembly:        5-10ms
  LLM TTFT:              200-800ms (depends on model tier)
  Total TTFT:            400-1,200ms

Time to last token:
  TTFT + generation time: 1.5-5s (depends on answer length)

With streaming, the user sees the first words within 400-1,200ms even though the full answer takes 2-5 seconds. This makes a huge difference in how responsive the tool feels.

13.4 Context Window Management

Context windows are a shared resource. Budget them carefully.

Conversation memory: For multi-turn conversations, include the last 2-3 turns in the prompt. If the conversation is longer, summarize older turns:

Turn 1: User asked about auth flow → Assistant explained OAuth2 integration
Turn 2: User asked about error handling → Assistant listed retry strategies
[Current turn with full context]

Summarization cost is tiny compared to generation. Even at 3 turns per conversation, it barely registers.

When to use large context models: For agentic RAG queries that accumulate 10-15 chunks across multiple retrieval iterations, a 200K context model handles the full context without truncation. But the cost is 5-10x higher per token. Use large context models only for the 5% of queries that actually need them (the "Complex" tier in model routing).

13.5 Self-Hosted LLM Serving

You do not have to use API providers. Every component in this pipeline (LLM generation, embeddings, re-ranking) has open-source alternatives that you can self-host. Open-source models have caught up. On many benchmarks they match or beat the commercial APIs as of early 2026. Here is what is available and how to run it.

Open-source LLMs for generation:

Model	Params	Architecture	GPU Requirement	Best For
Llama 4 70B	70B	Dense	4x A100-80GB or 2x H100	Best ecosystem support, most versatile, strong default
DeepSeek V3.2	685B MoE (37B active)	MoE	8x A100-80GB	Reasoning-heavy queries, beats GPT-5 on benchmarks
Qwen3-72B	72B	Dense	4x A100-80GB	119 languages, strong reasoning
Mistral Large 3	675B MoE	MoE	8x A100-80GB	92% of GPT-5.2 quality at ~15% the cost
Llama 4 8B	8B	Dense	1x A100-40GB or RTX 4090	Fast tier in model routing, sub-200ms TTFT
DeepSeek R1 Distill 7B	7B	Dense	1x RTX 4090 (24GB)	Strong reasoning for its size, good for query decomposition

Serving with vLLM: The standard way to serve open-source LLMs in production. PagedAttention reduces KV cache waste from 60-80% down to under 4%. For models too large for your GPUs, quantization (AWQ, GPTQ, or FP8) cuts memory by 2-4x. A 70B model quantized to INT4 fits on 2x RTX 4090s instead of 4x A100s. See the vLLM guide for serving commands, configuration, and quantization details.

API vs self-hosted: when does self-hosting pay off?

Component	Use API When	Self-Host When
LLM generation	< 1M queries/month; want zero ops burden	> 5M queries/month; data sovereignty required; need custom fine-tuning
Embeddings	< 50M chunks total corpus; infrequent re-indexing	> 50M chunks; continuous re-indexing; data cannot leave your network
Re-ranking	Almost always (cheap at $2/1M docs)	Data cannot leave your network; need custom re-ranker fine-tuned on your domain

14. Hallucination Mitigation

Hallucinations in an internal knowledge assistant are worse than no answer. An engineer who follows a hallucinated API endpoint will break production. Saying "I don't know" is always better than making something up.

14.1 Grounding via Retrieval

The first line of defense: instruct the model to answer only from retrieved context.

System prompt grounding: The prompt (Section 13.2) explicitly says "use ONLY the retrieved context." This reduces hallucinations dramatically compared to an ungrounded model, but it is not foolproof. Models sometimes synthesize information that "sounds right" given the context but is not actually stated.

Confidence scoring: After retrieval and re-ranking, compute an aggregate confidence score:

python
def compute_confidence(reranked_chunks: list[ScoredChunk]) -> float:
    if not reranked_chunks:
        return 0.0

    top_score = reranked_chunks[0].score
    score_gap = top_score - reranked_chunks[1].score if len(reranked_chunks) > 1 else 0
    avg_top3 = mean([c.score for c in reranked_chunks[:3]])

    # These weights are starting points. Tune them on your corpus.
    # High confidence: top result is clearly relevant and well-separated
    # Low confidence: top results are all mediocre or tightly clustered
    confidence = (top_score * 0.5) + (score_gap * 0.2) + (avg_top3 * 0.3)
    return min(confidence, 1.0)

Three-tier response modes based on confidence:

Confidence	Mode	Response Behavior
> 0.6	Full answer	Generate grounded answer with citations. Standard path.
0.3 - 0.6	Hedged answer	Generate answer but prepend: "Based on what I found, [answer]. I'd recommend verifying with [team/owner] directly." Include source links prominently.
< 0.3	Abstention	Skip generation entirely. Return: "I couldn't find reliable information about this. Here are the closest matches: [links]. Try asking in #[relevant-slack-channel]."

This matters more than it sounds. Most RAG systems only have two modes: answer or error. The hedged middle tier is where you prevent the worst hallucinations while still being useful. About 15-20% of queries land in this tier, and users actually appreciate the honesty.

Multi-model verification (optional, for high-stakes queries): For the Complex tier (5% of queries), run the generated answer through a second, cheaper model with the prompt: "Does this answer accurately reflect the provided context? Flag any claims not supported by the sources." Cost is low per query. This can catch a significant portion of hallucinations that slip past NLI checking, particularly subtle misrepresentations where NLI models see "neutral" but a reasoning LLM recognizes the claim is misleading.

14.2 Citation and Attribution

Every factual claim in the response must cite its source. This is enforced at three levels:

Level 1: Prompt instruction. The system prompt requires inline [Source N] citations.

Level 2: Structured output. For the highest-quality responses, use the LLM's structured output mode to enforce a JSON schema:

json
{
  "answer": "The payments service uses OAuth2 for authentication [Source 1]. After the Q3 migration, it switched to mutual TLS for service-to-service calls [Source 2].",
  "citations": [
    {"id": 1, "chunk_id": "abc123", "title": "Payments Auth Guide", "url": "https://..."},
    {"id": 2, "chunk_id": "def456", "title": "Q3 Auth Migration RFC", "url": "https://..."}
  ],
  "confidence": 0.85,
  "needs_verification": false
}

Level 3: Post-generation citation verification. After generation, verify that each cited source actually supports the claim made. A fast LLM call checks: "Does [chunk text] support the claim [extracted claim]?" This catches cases where the model cites a source but the cited text does not actually say what the answer claims.

Citation verification adds a small cost per query. Applied to all queries routed to the Standard and Complex tiers (20% of traffic).

14.3 Guardrails and Validation

NLI (Natural Language Inference) checking: Run the generated answer through an NLI model that classifies each claim as "entailed", "neutral", or "contradicted" by the retrieved context.

Entailed: The context supports the claim. Good.
Neutral: The context does not address the claim. Flag as potentially hallucinated.
Contradicted: The context says the opposite. Block the response, re-generate with a stronger grounding instruction.

NLI models like deberta-v3-large can run in under 20ms with a warm model on GPU and add minimal latency to the pipeline. Use NLI on every query (cheap and fast). Use multi-model verification from Section 14.1 only on the Complex tier (expensive but catches subtler misrepresentations that NLI misses).

Entity grounding checks: Extract named entities from the response (service names, API endpoints, configuration keys) and verify they appear in the retrieved context or the broader document corpus. A response that references "the AuthService" when no document mentions that name is likely hallucinating.

Toxicity and safety filters: While less critical for internal tools, filter responses that contain PII, credentials, or secrets that may appear in source documents. A regex-based scanner checks for patterns like API keys, passwords, and internal IP addresses before returning the response.

14.4 Structured Output Enforcement

Force the LLM to output in a structured format (JSON mode or tool calling) to ensure consistent citation formatting. This prevents the model from "forgetting" to cite sources in some responses.

python
response_schema = {
    "type": "object",
    "properties": {
        "answer_markdown": {"type": "string"},
        "sources_used": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "chunk_id": {"type": "string"},
                    "relevance": {"type": "string", "enum": ["high", "medium", "low"]},
                    "quote": {"type": "string"}  # Exact quote from source
                }
            }
        },
        "confidence_level": {"type": "string", "enum": ["high", "medium", "low"]},
        "follow_up_suggestions": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["answer_markdown", "sources_used", "confidence_level"]
}

The quote field is particularly valuable: it forces the model to ground each citation in a specific passage from the source, making hallucination harder.

15. Evaluation and Feedback System

If you are not measuring answer quality, you are guessing. "It looks good in demos" is not a metric. Evaluation is a production requirement, not a nice-to-have.

15.1 Offline Evaluation

Golden dataset: A curated set of 500-1,000 question-answer pairs with expected source documents. The dataset covers:

Simple factual questions (40%)
How-to questions (25%)
Analytical questions (20%)
Multi-hop questions (15%)

Each entry includes:

json
{
  "question": "How does the payments service handle idempotency?",
  "expected_answer_contains": ["idempotency key", "header", "X-Idempotency-Key"],
  "expected_source_docs": ["payments-api-guide", "payments-rfc-042"],
  "category": "factual",
  "difficulty": "medium"
}

What does "correct" mean? It depends on query type. Factual queries need exact accuracy. How-to queries need step completeness. Analytical queries need reasoning coherence. Measure each separately or your aggregate score hides the failures that matter.

Metrics (RAGAS framework):

Metric	What It Measures	Target
Faithfulness	Does the answer only contain information from the context?	> 0.90
Answer relevance	Does the answer address the question?	> 0.85
Context relevance	Are the retrieved chunks relevant to the question?	> 0.80
Context recall	Do the retrieved chunks contain the information needed?	> 0.80

Retrieval-specific metrics:

Metric	Definition	Target
Recall@5	% of relevant docs in top 5 results	> 75%
Recall@10	% of relevant docs in top 10 results	> 85%
MRR (Mean Reciprocal Rank)	Average 1/rank of first relevant result	> 0.70
NDCG@10	Normalized discounted cumulative gain	> 0.75

Regression testing: The golden dataset runs automatically on every change to: chunking logic, embedding model, retrieval parameters, re-ranking model, or system prompt. If any metric drops by more than 2 percentage points, the pipeline blocks deployment and alerts the team. This is the RAG equivalent of a CI test suite.

15.2 Online Evaluation

Production queries are messier, more varied, and more ambiguous than any golden dataset. Online evaluation catches issues that offline benchmarks miss.

User feedback:

Explicit: Thumbs up/down on every response. Optional text correction ("This is wrong because..."). Copy-to-clipboard events as positive signal.
Implicit: Time to next query (< 10s suggests the answer was insufficient). Session abandonment (user leaves without interacting). Follow-up questions that rephrase the original (suggests the first answer missed the mark).

LLM-as-judge: Sample 5-10% of production queries and run them through an evaluation prompt:

You are evaluating the quality of a RAG system's response.

Question: {question}
Retrieved Context: {chunks}
System Response: {answer}

Rate on these dimensions (1-5):
1. Correctness: Is the answer factually correct given the context?
2. Completeness: Does the answer fully address the question?
3. Citation quality: Are sources properly cited and relevant?
4. Clarity: Is the answer clear and well-structured?

Also flag:
- Any hallucinated claims (not supported by context)
- Missing information that was in the context but not in the answer

LLM-as-judge cost scales with your sampling rate and chosen model. At 5% sampling on 10M queries/month, that is 500K evaluations. Recalculate with current model pricing.

A/B testing: When testing a new prompt version, embedding model, or retrieval parameter, route 5-10% of traffic to the variant. Compare LLM-as-judge scores and user feedback rates between control and treatment. Require statistical significance (p < 0.05) before rolling out changes.

15.3 Continuous Improvement Loop

Evaluation data feeds back into the system:

Low-rated answers are reviewed weekly by the platform team. Common failure patterns (bad chunking on a specific document type, missing source, etc.) become tasks.
User corrections are added to the golden dataset, expanding coverage over time.
Retrieval failures (queries where none of the top-10 chunks were relevant) trigger chunking quality audits on the source documents.
Model upgrade evaluation: Before upgrading the LLM (e.g., Sonnet 3.5 to Sonnet 4), run the full golden dataset benchmark and compare. Only upgrade if quality improves or holds steady.
Feedback into retrieval ranking. This is the loop most teams skip. When a user downvotes an answer, log which chunks were retrieved. Over time, chunks that consistently appear in downvoted answers get a negative signal. This can feed into: (a) chunk quality scoring (deprioritize low-quality chunks during re-ranking), (b) dynamic alpha adjustment (if code queries consistently get bad feedback, shift the BM25 weight higher for that query type), or (c) re-chunking triggers (if a specific document's chunks keep failing, flag it for re-chunking with a different strategy).

Measure retrieval and generation separately. A bad answer has two possible causes: bad retrieval (the right chunks were not found) or bad generation (the right chunks were found but the LLM misused them). If you only measure end-to-end quality, you cannot tell which one failed. Track retrieval metrics (recall@10, MRR) and generation metrics (faithfulness, citation accuracy) independently. When quality drops, check retrieval first. If retrieval recall is fine, the problem is in generation (prompt, model, or guardrails). If retrieval recall dropped, the problem is upstream (chunking, embedding, or index health).

16. Observability

You cannot fix what you cannot see. When a user reports a bad answer, you need to know exactly where the pipeline went wrong: was it retrieval, re-ranking, context assembly, or generation?

16.1 Pipeline Tracing (OpenTelemetry)

Each query generates an OpenTelemetry trace that spans the entire pipeline:

Trace: query-12345
├── Span: query_understanding (50ms)
│   ├── Attribute: query_type = "how-to"
│   ├── Attribute: rewritten = true
│   └── Attribute: cache_hit = false
├── Span: retrieval (130ms)
│   ├── Span: vector_search (40ms)
│   │   └── Attribute: results_count = 50
│   ├── Span: keyword_search (15ms)
│   │   └── Attribute: results_count = 30
│   ├── Span: rrf_merge (2ms)
│   │   └── Attribute: merged_count = 65
│   ├── Span: acl_filter (5ms)
│   │   └── Attribute: filtered_out = 8
│   └── Span: rerank (68ms)
│       └── Attribute: top5_avg_score = 0.82
├── Span: generation (920ms)
│   ├── Attribute: model = "claude-sonnet-4-6"
│   ├── Attribute: input_tokens = 4200
│   ├── Attribute: output_tokens = 380
│   ├── Attribute: cost = $0.012
│   └── Attribute: ttft = 450ms
└── Span: guardrails (25ms)
    ├── Attribute: nli_check = "passed"
    ├── Attribute: citations_verified = true
    └── Attribute: confidence = 0.85

This trace tells you exactly where time is spent and where quality signals come from. When a user reports a bad answer, pull the trace by query ID and see exactly what happened: what was retrieved, what scores looked like, which model generated the answer, and what the guardrails caught (or missed).

16.2 Key Metrics Dashboard

Metric	Granularity	Alert Threshold
End-to-end P50/P95/P99 latency	By model tier, by query type	P95 > 3s
TTFT (time to first token)	By model tier	P95 > 2s
Retrieval latency	By search type (vector, keyword, hybrid)	P99 > 300ms
Re-ranking latency	By re-ranker model	P99 > 200ms
Cache hit rate	Overall and by query pattern	Drop below 10%
Token usage (input + output)	By model, by team, by query type	Daily total > 2x average
Cost per query	By model tier	Exceeds baseline by 2x
Retrieval relevance (top-5 avg score)	Overall	Average drops below 0.6
User feedback ratio (thumbs up %)	Overall, rolling 7-day	Drops below 70%
Hallucination rate (LLM-as-judge)	Rolling 7-day	Rises above 8%
Error rate (5xx)	By pipeline stage	> 0.1%
Embedding pipeline lag	Time since last processed document	> 30 minutes

16.3 Cost Monitoring

LLM costs can spike unexpectedly. A single runaway agent loop, a prompt regression that increases output length, or a cache invalidation event can double costs overnight.

Per-query cost tracking: Every query logs its total cost (embedding + retrieval + re-ranking + generation). This is computed from token counts and model pricing, not estimated.

Cost anomaly alerting: If the rolling 1-hour cost exceeds 2x the expected baseline, trigger an alert. If it exceeds 5x, trigger a circuit breaker that routes all queries to the cheapest model tier until the issue is investigated.

Per-tenant budgets: Each team gets a monthly token budget. When 80% is consumed, the admin gets a warning. At 100%, queries are throttled (increased latency, not denied) unless the budget is increased.

16.4 Embedding Drift Detection

Embedding model behavior can change without warning. Provider-side updates, model deprecation, or subtle API changes can shift the vector space. If the embedding for "authentication" silently shifts by 10%, retrieval quality degrades without any obvious error.

Detection: Weekly, compute embeddings for a fixed set of 100 canonical queries. Compare cosine similarity to the reference embeddings computed when the system was last validated. If any query's embedding shifts by more than a threshold (typically cosine distance > 0.05), alert the team.

17. Production Architecture and Bottlenecks

17.1 Scaling Strategy

Component	Scaling Dimension	Trigger	Mechanism
Qdrant	Shard count	Vector count > 5M per shard	Add shard, rebalance
Elasticsearch	Node count	Index size > 50 GB per node	Add node, rebalance
Embedding workers	Worker count	Kafka lag > 10,000	HPA on Kubernetes
LLM inference pool	Concurrent requests	Queue depth > 50	Scale vLLM replicas
API servers	Pod count	QPS > 100 per pod	HPA on CPU/QPS

Load shedding: Under extreme load (> 1,500 QPS sustained), the platform progressively degrades:

Disable agentic RAG (route all queries to single-shot).
Reduce re-ranking from top-20 to top-5.
Force all queries to the cheapest model tier.
Disable LLM-as-judge sampling.
As a last resort, serve cached-only responses and return "system under heavy load" for cache misses.

Each step is triggered by a progressively higher load threshold. The user always gets a response. The response quality degrades gracefully.

17.2 Failure Handling

Failure	Detection	Mitigation	Recovery
LLM provider outage	API errors > 5% in 1 min	Failover to secondary provider	Auto-retry primary every 30s
Qdrant node failure	Health check timeout	Read from replica shard	Node auto-restarts, shard rebalances
Elasticsearch down	Health check timeout	Fall back to vector-only search	Cluster self-heals
Embedding API outage	API errors	Queue documents in Kafka, process later	Backfill when API recovers
Redis cache failure	Connection timeout	Skip cache, full pipeline for every query	Reconnect, warm cache gradually
PostgreSQL failure	Connection pool errors	Read from replica for permissions	Primary failover (managed DB)

Key principle: The system should always return something useful. Never show a blank error page.

Degradation ladder: When components fail, the system steps down through progressively simpler modes. Each level still returns a useful response:

Level 0: Full pipeline (normal operation)
   ↓ re-ranker timeout or error rate > 10%
Level 1: Skip re-ranking (use RRF scores directly, ~5-10% quality drop)
   ↓ Qdrant unavailable
Level 2: BM25-only retrieval (keyword search still works, semantic matching lost)
   ↓ Elasticsearch also down
Level 3: Serve cached responses only (Redis still up, covers 5-25% of queries)
   ↓ Redis also down
Level 4: Static fallback (return links to top-50 most-accessed docs with search bar)

Circuit breakers per dependency: Each external dependency (LLM API, embedding API, Qdrant, Elasticsearch, Cohere re-ranker) has its own circuit breaker. When error rate exceeds 50% over a 30-second window, the breaker opens and the system skips that component for 60 seconds before retrying. This prevents cascading failures where one slow dependency backs up the entire pipeline.

Multi-region note: This design assumes a single region. For 99.99% availability or regulatory requirements, you need active-passive replication of Qdrant and Elasticsearch across regions, with Kafka MirrorMaker for cross-region event streaming. That adds significant complexity and cost. For most enterprise deployments, 99.9% in a single region is sufficient.

17.3 Multi-Tenant Isolation

Data isolation: Each tenant (engineering team) gets its own namespace in Qdrant and permission-scoped queries in Elasticsearch. Cross-tenant data leakage is prevented by mandatory tenant_id filtering on every query.

Cost isolation: Token usage and costs are tracked per tenant. Monthly budgets are enforced at the API gateway level. One team's spike in usage does not affect other teams' latency because rate limiting is per-tenant.

Performance isolation: Noisy neighbor protection via per-tenant rate limiting (token bucket algorithm). If the platform engineering team decides to index 500K new documents in one day, the ingestion pipeline's Kafka partitioning ensures this does not slow down query serving for other teams.

17.4 Rate Limiting and Backpressure

Query rate limiting: Token bucket per tenant. Default: 10 QPS per team, burstable to 50 QPS for 30 seconds. Adjustable per team based on size and usage patterns.

Ingestion backpressure: When Kafka consumer lag exceeds a threshold, the embedding pipeline signals connectors to slow down. This prevents an unbounded queue from building up during bulk ingestion events.

Priority queues: Three priority levels:

Priority	Traffic Type	Under Load Behavior
P0 (critical)	Interactive user queries	Always served; may route to smaller model
P1 (normal)	Slack bot queries, IDE integration	Queued; dropped after 10s timeout
P2 (background)	Batch eval, re-indexing, cache warming	Paused entirely under load; resumed when QPS drops below threshold

When the LLM queue grows: If queue depth exceeds 50 pending requests: (1) force all new queries to the cheapest model tier, (2) drop P1 and P2 traffic, (3) if queue still grows past 200, return cached or degraded responses for P0. This prevents a 30-second wait that makes users think the system is broken.

17.5 Bottlenecks and Mitigation

Bottleneck	Symptom	Relief
LLM inference at peak	P95 latency > 3s, queue depth grows	Model routing (80% to small model), semantic caching, pre-computed answers for top-100 queries
Vector search with complex ACL filters	Retrieval P99 > 200ms	Payload indexes on permission fields, consider per-tenant collections for large tenants
Embedding pipeline backlog	Document freshness > 15 min SLA	Scale embedding workers horizontally, batch size optimization, parallel processing
Context window limits	Answers miss relevant info that was retrieved	Chunk summarization before context assembly, priority-based chunk selection, larger context models for complex queries
Re-ranking latency on large result sets	Re-rank step > 150ms	Reduce initial retrieval from top-100 to top-30, self-hosted GPU re-ranker with batching
Cache cold starts (Monday morning, after deployments)	Cache hit rate drops to 0%, latency spikes	Pre-warm cache with top-1000 queries from previous week, gradual rollout after cache invalidation
Cross-encoder model loading	First query after cold start takes 5-10s	Keep re-ranker model warm with periodic health check queries, pre-load on pod startup

18. Cost Analysis

Model pricing changes quarterly. Specific dollar amounts go stale within months. This section covers the cost patterns that hold true even as prices change.

Cost Structure

LLM inference typically represents 60-85% of total platform cost. Everything else — vector databases, Elasticsearch, Kafka, Redis, embedding generation, re-ranking — is a rounding error in comparison. This ratio has held steady even as absolute prices have dropped.

Approximate cost breakdown by category:

Category	% of Total Cost	What Drives It
LLM inference	60-85%	Query volume, model tier mix, average prompt size
Infrastructure (vector DB, ES, Kafka, Redis, PG)	5-15%	Corpus size, query throughput, replication factor
Evaluation (LLM-as-judge)	3-10%	Sampling rate, evaluator model choice
Embedding generation	< 1%	Only spikes during full re-indexing
Re-ranking	< 1%	Fixed per-query cost, scales linearly

Model Routing: The Biggest Cost Lever

Not every query needs a frontier model. A tiered routing strategy cuts LLM costs by 50-70% depending on query distribution and classification accuracy.

The key insight: 60-80% of enterprise queries are simple factual lookups that a small, fast model handles perfectly. Only 5-10% require frontier-model reasoning. Route based on query complexity, not uniformly.

Tier	Query Types	Relative Cost	% of Traffic
Fast	Simple factual, definitions, single-doc answers	1x (baseline)	~70-80%
Standard	How-to, moderate reasoning, multi-chunk synthesis	5-10x	~15-25%
Complex	Multi-hop reasoning, comparative analysis	15-30x	~5-10%

Caching: The Second Cost Lever

Semantic caching avoids redundant LLM calls entirely. Enterprise knowledge platforms typically see 5-25% cache hit rates, with higher rates at larger organizations where query patterns repeat. Each cache hit saves the full LLM inference cost for that query.

Self-Hosted vs API: Decision Logic

Component	Use API When	Self-Host When
LLM generation	Low-to-moderate query volume; want zero ops burden	High query volume; data sovereignty required; need custom fine-tuning
Embeddings	Moderate corpus size; infrequent re-indexing	Large corpus; continuous re-indexing; data cannot leave your network
Re-ranking	Almost always — low cost, high value	Data cannot leave your network; need domain-specific fine-tuning

The crossover point where self-hosting LLM inference becomes cheaper than APIs depends on your query volume, GPU pricing, and engineering ops cost. As a rule of thumb: below ~2M queries/month, APIs are almost always cheaper because you are not paying for idle GPUs. Above ~5M queries/month, self-hosting typically saves 30-60% on LLM costs but requires dedicated MLOps capacity.

Check current pricing from your providers and run the math for your specific scale before committing.

19. Security and Governance

For document-level access control implementation (pre-filtering vs post-filtering, permission sync), see Section 11.4.

Prompt Injection Defense

Internal users are less likely to attempt prompt injection than external users, but it still happens accidentally. An engineer pastes a document containing "ignore all previous instructions" into a query, and the model complies.

Defense layers:

Input sanitization: Strip known injection patterns from queries. Regex-based, not foolproof, but catches obvious cases.
Instruction hierarchy: The system prompt uses a clear hierarchy: system instructions > retrieved context > user query. Modern models respect this hierarchy when explicitly told to.
Output validation: The guardrails layer (Section 14.3) catches responses that deviate from expected format or contain unexpected instructions.

PII and Secrets Handling

Internal documents frequently contain PII, credentials, and secrets. The platform must not amplify their exposure.

Ingestion-time scanning: Before indexing, scan documents for patterns matching API keys, passwords, tokens, and PII (emails, phone numbers). Flag but do not block. Store a contains_sensitive flag on the chunk. During retrieval, warn the user if the answer sources contain sensitive content.

Response-time redaction: Before returning a response to the user, scan for secret patterns (AWS keys, GitHub tokens, database passwords). Redact with [REDACTED - credential detected]. This prevents the LLM from inadvertently surfacing credentials in its answers.

Audit Logging

Every query, retrieval result, generation, and feedback event is logged to an append-only audit store. Logs include:

Who asked the question (authenticated user ID)
What documents were retrieved and shown
What answer was generated
What feedback was given
What model and prompt version were used

These logs serve compliance requirements and enable forensic analysis if a data access incident occurs. Retention: 1 year minimum, per organizational policy.

For a deeper dive on LLM security patterns, prompt injection defense, and data privacy architectures, see the LLM data privacy and security guide.

20. Where This System Fails in Production

No architecture post is complete without an honest look at how it breaks. These are the failure chains we have seen or heard about from teams running RAG systems at scale:

Failure Chain	What Happens	How You Detect It
Bad chunking -> wrong retrieval -> confident wrong answer	A design doc gets split mid-paragraph. Retrieval returns a chunk that mentions "auth" but lacks the actual flow. The LLM generates a plausible-sounding but incorrect answer, confidently cited.	Low faithfulness scores in LLM-as-judge; user downvotes on specific document types
ACL bug -> data leak	A permission sync fails silently. An engineer on team A sees answers sourced from team B's confidential docs. One incident like this kills platform trust permanently.	Cross-tenant query audits; periodic permission reconciliation checks
Embedding drift after model upgrade	Provider silently updates the embedding model. Query vectors shift slightly. Retrieval recall degrades by 5-10% with no error, no alert. Answers just get subtly worse.	Weekly canonical query drift detection (Section 16.4)
Cache serving stale answers	A runbook gets updated but the cache still serves the old answer for 24 hours. An engineer follows outdated instructions during an incident.	Cache-to-source freshness monitoring; stale-but-serveable flagging
Query rewriting makes things worse	The rewriter expands "k8s OOM" to "Kubernetes out of memory error handling best practices." The expanded query retrieves generic content instead of the specific internal runbook.	A/B test rewriter on vs off; monitor retrieval scores with and without rewriting
Agent loop burns tokens without converging	An agentic query keeps searching, finding slightly different but never sufficient context. Three iterations later, answer is no better than single-shot.	Per-query cost tracking; agent iteration count dashboards; circuit breaker alerts

The checklist in Section 21 catches most of these before they hit production. But some will slip through. Your RAG system will fail. The question is whether you find out from your dashboards or from an angry Slack message.

21. Production Readiness Review Checklist

Every major engineering organization runs production readiness reviews before shipping systems. Amazon calls theirs an Operational Readiness Review (ORR). Google's SRE team has a production readiness checklist. Stripe, Uber, and similar companies use internal design review templates. They all evaluate the same dimensions: reliability, scalability, latency, data correctness, security, observability, and cost.

No universal checklist exists for RAG/LLM systems specifically. The ten areas below are adapted from these industry practices, extended with AI-specific concerns (retrieval quality, hallucination control, embedding lifecycle, agent guardrails) that traditional readiness reviews do not cover.

A note on weighting. The point values below are not industry-standard weights. No such standard exists. The importance of each area depends on your system. For RAG/LLM platforms specifically:

Area	Weight for RAG Systems	Why
Retrieval quality	Very high	Bad retrieval = bad answers. No LLM fixes this.
Hallucination control	Very high	Wrong answers are worse than no answers.
Cost and observability	High	LLM inference costs dominate and can spike without warning.
Security and governance	High	Internal docs contain sensitive data; access control is non-negotiable.
Resilience	Medium-high	Degraded answers are acceptable; total outages are not.
Operational readiness	Medium	Important but less unique to RAG than to any production system.

Treat this checklist as a guiding framework. The scores tell you where your gaps are. The specific thresholds are directional, not pass/fail gates.

How to Score

Each check: 0 (missing), 1 (partial), 2 (complete).

Score	Readiness
120+	Production-ready. Ship with confidence.
90-119	Ship with known gaps documented. Address the gaps within 30 days.
60-89	Not ready. Critical gaps likely in retrieval quality, safety, or observability.
< 60	Prototype stage. Needs significant work before production traffic.

21.1 Ingestion and Chunking (20 points max)

#	Check	Pass Criteria	Section
1	Multiple chunking strategies per document type	Not one-size-fits-all; at least 3 strategies active	8.6
2	Chunk quality validation	Manual review of 100+ random chunks; retrieval recall measured	8
3	Incremental ingestion with change detection	Not full re-index on every update; webhook + polling reconciliation	7.1
4	Document deduplication	Content-hash or SimHash dedup pipeline active	7.3
5	Metadata extraction	Author, date, permissions, source extracted and stored	7.2
6	Embedding model versioned	Version tracked; blue-green re-indexing strategy documented	9.3
7	Ingestion latency SLA defined and monitored	Freshness SLA (e.g., 15 min) measured in dashboards	16.2
8	Failure handling	Poison documents don't block the pipeline; dead letter queue active	7
9	Source connector health monitoring	Each connector reports status; alerts on sync failures or quota exhaustion	7.1
10	Backfill and replay capability	Can re-process any source from a specific date; Kafka replay or equivalent	7.1

21.2 Retrieval Quality (18 points max)

#	Check	Pass Criteria	Section
11	Hybrid retrieval	BM25 + vector search with score fusion (RRF or weighted)	11.2
12	Re-ranking with cross-encoder	Top-k results re-ranked before context assembly	11.3
13	Retrieval recall measured	Recall@10 > 80% on golden dataset	15.1
14	Access control filtering	Pre-filtering on permissions; no leaked documents in results	11.4
15	Query rewriting/expansion	Ambiguous queries rewritten before retrieval	11.1
16	"Lost in the middle" mitigation	Chunk ordering optimized in context window	11.5
17	Retrieval latency P99 < 350ms	Measured in production dashboards	16.2
18	Recency weighting for time-sensitive queries	Recent docs boosted for queries with temporal signals	11.1
19	Multi-source routing	Agent selects relevant sources instead of searching everything	12.4

21.3 Generation and Hallucination Control (20 points max)

#	Check	Pass Criteria	Section
20	Model routing active	Not every query hits the most expensive model; at least 2 tiers	13.1
21	Grounding instruction in system prompt	"Answer only from retrieved context" explicitly stated	14.1
22	Inline citations with source links	Every response includes [Source N] with clickable URLs	14.2
23	"I don't know" path	Low confidence triggers explicit uncertainty message	14.1
24	Post-generation factual consistency check	NLI or citation verification on Standard/Complex tier	14.3
25	Streaming responses	SSE streaming with sub-2s TTFT	13.3
26	Context window budget documented	Token allocation for system + context + history + output	13.4
27	Structured output enforcement	JSON mode or schema validation on citations	14.4
28	Prompt versioning with rollback	Prompts stored in version control; can revert within minutes	13.2
29	Conversation memory management	Multi-turn context handled via sliding window or summarization	13.4

21.4 Agentic RAG and Tool Use (12 points max)

#	Check	Pass Criteria	Section
30	Complex queries routed to agentic path	Query complexity classifier active; not all queries single-shot	12.1
31	Query decomposition for multi-hop questions	LLM decomposes complex queries into sub-queries	12.2
32	Max iteration and token budget caps	Hard limits on iterations (3-5), tool calls (10-15), tokens (50K)	12.6
33	Circuit breaker on agent failure	Fallback to single-shot RAG on loop detection or timeout	12.6
34	Tool interface standardized	MCP or equivalent; tools discoverable and schema-validated	12.5
35	Agent audit trail	Every tool call and reasoning step logged with trace ID	12.5

21.5 Evaluation and Feedback (18 points max)

#	Check	Pass Criteria	Section
36	Golden dataset maintained	500+ curated Q&A pairs; updated quarterly	15.1
37	Offline regression tests	Run on every pipeline change; blocks deploy on regression	15.1
38	User feedback collected	Thumbs up/down at minimum; corrections optional	15.2
39	LLM-as-judge on production traffic	5-10% of queries evaluated automatically	15.2
40	A/B testing framework	Prompt/model/retrieval changes tested on subset of traffic	15.2
41	Weekly failure analysis	Low-rated answers reviewed; action items tracked	15.3
42	Retrieval and generation metrics tracked over time	Not just point-in-time; trend dashboards active	16.2
43	Eval dataset covers all query types proportionally	Factual, how-to, analytical, multi-hop all represented	15.1
44	Domain-specific eval metrics defined	Metrics tailored to your use case beyond generic RAGAS	15.1

21.6 Observability and Cost (20 points max)

#	Check	Pass Criteria	Section
45	OpenTelemetry traces across full pipeline	Trace spans for every stage: query, retrieval, re-rank, generation	16.1
46	Latency breakdown by stage	Per-stage P50/P95/P99 in dashboards	16.2
47	Token usage and cost per query	Tracked by model, by team, by query type	16.3
48	Cost anomaly alerting	Circuit breaker on cost spike	16.3
49	Cache hit rate monitored	Target 5-25%; alert on sustained drop below 3%	10.5
50	Embedding drift detection	Weekly canonical query check; alert on shift > 0.05 cosine distance	16.4
51	Per-tenant rate limiting	Token bucket enforced at API gateway	17.4
52	Daily cost reports by team/tenant	Automated reporting; budget enforcement active	16.3
53	SLA dashboard for stakeholders	Uptime, latency, quality metrics visible to consumers and leadership	16.2
54	Alerting runbooks per alert type	Every alert has a documented response procedure	16.2

21.7 Resilience and Production Hardening (12 points max)

#	Check	Pass Criteria	Section
55	Multi-provider LLM failover	Primary + secondary + self-hosted fallback chain	13.1
56	Retrieval fallback chain	Hybrid -> keyword-only -> cached responses	17.2
57	Embedding pipeline failure isolation	Stale index served; freshness SLA breach alerted	17.2
58	Load shedding and graceful degradation	Progressive degradation under extreme load	17.1
59	Multi-tenant data isolation validated	Cross-tenant query returns zero foreign results	17.3
60	Prompt injection defense active	Input sanitization + instruction hierarchy + output validation	19

21.8 Security and Governance (20 points max)

#	Check	Pass Criteria	Section
61	PII detection on ingested documents	Scanner flags PII before indexing	19
62	Secrets and credential redaction	API keys, tokens, passwords detected and redacted in responses	19
63	Audit logging for all queries	Who asked, what was retrieved, what was generated, all logged	19
64	RBAC on admin operations	Index management, prompt editing, eval dataset changes require authorized roles	19
65	Data retention policy enforced	Query logs, feedback, and cached responses expire per policy	19
66	Source permission sync validated	Permissions propagate to vector store within SLA	11.4
67	LLM provider data processing agreements	Query/response data not used for training; DPAs signed	19
68	Input sanitization beyond regex	Layered injection defense	19
69	Sensitive document flagging	Documents marked `contains_sensitive` at ingestion	19
70	Compliance review completed	Legal/security team has reviewed data flows and access patterns	19

21.9 Data Quality and Freshness (16 points max)

#	Check	Pass Criteria	Section
71	Stale document detection	Documents not updated in 6+ months flagged	7.1
72	Source connector health dashboard	Each source shows last sync time, error rate, document count	7.1
73	Broken link and dead reference cleanup	Periodic scan detects deleted/moved content	7.3
74	Document quality scoring	Low-quality chunks detected and quarantined	8
75	Per-source freshness SLA	Each source has a defined freshness target; monitored	7.1
76	Corpus coverage tracking	Dashboard shows document count by source, team, and age	7
77	Re-indexing on schema/format changes	Ingestion pipeline adapts without data loss	7.2
78	Chunk-to-source lineage queryable	Given any chunk, can trace back to exact source doc and version	8.6

21.10 Operational Readiness (14 points max)

#	Check	Pass Criteria	Section
79	Runbooks for top 5 failure scenarios	LLM outage, vector DB degradation, embedding stall, cost spike, permission leak	17.2
80	On-call rotation defined	Named owners for platform incidents; escalation path documented	17
81	Disaster recovery tested	Full restore from backup validated within target RTO	17.2
82	Capacity planning documented	Growth projections for queries, documents, and cost	6
83	Deployment pipeline with rollback	Canary or blue-green deploy; rollback under 5 min	17.1
84	Chaos/failure injection tested	At least one failure scenario tested in staging	17.2
85	Onboarding documentation for new teams	Self-service guide for connecting new document sources	7

Total: 85 checks, 170 points maximum.

This checklist covers ten dimensions of production readiness, adapted from the same areas that Amazon ORR, Google SRE reviews, and Stripe/Uber design reviews evaluate, extended with RAG/LLM-specific concerns. The score is not a certification. It is a gap analysis. The checks you score 0 on tell you exactly where to invest next. Run this review quarterly as your system evolves, your corpus grows, and new model capabilities become available.

Explore the Technologies

Many of the technologies referenced in this post have dedicated deep-dive pages on this site:

RAG for retrieval-augmented generation fundamentals, chunking strategies, and evaluation frameworks
Vector Databases for HNSW internals, distance metrics, and vendor comparisons
vLLM for PagedAttention, continuous batching, and self-hosted inference optimization
LangChain for LCEL, LangGraph agent workflows, and LangSmith observability
MCP Server for Model Context Protocol architecture, transports, and security model

Quick-reference cheat sheets and interactive calculators that complement this post.

Cheat Sheets (view all):

LLM Model Tiers & Quantization :FP16, INT8, INT4, TTFT, KV-cache, speculative decoding, continuous batching
GPU & Inference Hardware :A100, A10G, H100, weights math, QPS per GPU, fleet sizing formula
AI Cost Engineering :Cost per request, routing savings, caching impact, session budgets, monthly math
RAG Pipeline Patterns :Chunking, embeddings, vector DB, HNSW, cosine similarity, hybrid retrieval, re-ranking
AI Agent Architecture :Agent loop, tool calling, MCP, sandbox, 3-strikes rule, checkpointing, multi-agent
AI Back-of-Envelope Formulas :QPS, GPU count, model memory, vector storage, cost per request, latency budget
LLM Prompt Engineering for Code :FIM format, context budgets, multi-candidate ranking, guardrails, prompt injection defense
AI System Failure Modes :Hallucinated imports, infinite loops, context overflow, cascade failures, embedding drift

Interactive Tools (Beta) (view all):

LLM Cost Calculator (Beta) :Calculate inference cost by task type, model tier, and volume. API vs self-hosted comparison.
GPU Fleet Sizing Calculator (Beta) :Size a self-hosted GPU fleet from model size, quantization, and QPS targets.
Vector DB Sizing Calculator (Beta) :Calculate vector storage, HNSW overhead, and embedding costs for RAG systems.

CrackingWalnuts

System Design: Maps Platform (Navigation, Routing, and Map Rendering)

April 27, 2026 · 79 min read

System Design: GitHub (200M Repos, Git Object Storage, Sparse Trigram Code Search, Per-Job VM CI)

April 23, 2026 · 69 min read

System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)

April 17, 2026 · 83 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

CrackingWalnuts

System Design AIMarch 22, 2026· 105 min read

RAG and LLM Platform at Scale: Ingestion, Retrieval, Generation, and Evaluation for 10M Queries/Day

Goal: Build an enterprise knowledge assistant that serves 10M queries per day across 500+ engineering teams. The platform ingests 2M+ internal documents (API docs, runbooks, design docs, RFCs, Confluence pages, Slack threads, code repositories), lets engineers ask natural language questions, and returns accurate, cited answers grounded in internal knowledge. Think of it as an internal Perplexity for engineering. ~1.5-2.5s P95 latency depending on query complexity, single-digit hallucination rate with strict grounding and abstention, and under $0.02 per query average cost. All metrics in this post are directional estimates. Costs change quarterly, latency depends on infrastructure, and retrieval quality varies by corpus. Treat numbers as calibrated starting points for your own sizing, not specifications.

Reading guide: This is a long, detailed deep dive. You don't need to read it linearly.

Sections 1-2: Problem framing and requirements

Sections 3-6: Architecture overview, design principles, technology selection, and capacity planning

Sections 7-9: Document ingestion, chunking strategies (the deepest section), and embedding pipeline

Sections 10-11: Storage architecture and retrieval pipeline

Section 12: Agentic RAG with MCP-based tool use

Section 13: Generation layer (model routing, prompt engineering, streaming)

Section 14: Hallucination mitigation and citation enforcement

Sections 15-16: Evaluation systems and observability

Sections 17-21: Production architecture, cost analysis, security, failure modes, and a reusable production readiness checklist

New to RAG? Start with Sections 1-2 for the problem context, then Section 3 for the architecture overview. Read Section 8 carefully for chunking strategies. Skip to Section 21 for the production readiness checklist.

Building something similar? Sections 8-12 have the implementation details you need. Section 18 covers cost reasoning.

Preparing for a system design interview? Sections 1-6 cover what interviewers expect. Section 12 (agentic RAG) and Section 17 (production architecture) are common follow-up topics.

TL;DR: A production RAG platform handling 10M queries/day across 2M+ documents for 500+ engineering teams. Multi-strategy chunking (recursive, semantic, late chunking) produces 10M chunks stored in Qdrant with HNSW indexing. Hybrid retrieval (BM25 + dense vectors) with cross-encoder re-ranking, P99 retrieval latency typically 200-350ms depending on filter complexity. Agentic RAG with MCP-based tool use handles complex multi-hop queries through iterative retrieval and query decomposition. In practice, 60-80% of queries can be handled by smaller models, cutting LLM costs by 50-70%. Baseline hallucination rate of 8-15% in typical enterprise RAG deployments, reduced to single digits with strict grounding, citation enforcement, and answer abstention. LLM-as-judge evaluation on 5-10% of production traffic feeds a continuous improvement loop. The hardest problems: chunking quality for heterogeneous documents, access control in vector search, embedding model upgrades without downtime, and keeping hallucination rates low as the corpus grows. All metrics in this post are based on aggregated industry benchmarks and production observations. Actual results vary significantly by corpus quality, query distribution, infrastructure choices, and current model pricing.

1. Problem Statement

A few clarifications before getting into the architecture.

Document landscape:

Source	Document Type	Count	Update Frequency	Challenges
Confluence	Design docs, runbooks, ADRs	800K pages	50K updates/month	Deep nesting, stale pages, mixed formatting
GitHub	README files, code comments, PRs	500K files	200K updates/month	Code-text boundary, rapid churn
Slack	Thread discussions, incident channels	400K threads	100K new/month	Noisy, conversational, context-dependent
Google Docs	RFCs, specs, meeting notes	200K docs	80K updates/month	Access controls, revision history
Internal wikis / S3	PDFs, diagrams, legacy docs	100K files	20K updates/month	Unstructured, poor metadata

Why naive approaches fail:

Assumptions:

The platform serves internal engineers only (not customer-facing). This simplifies some safety requirements but raises the bar on accuracy because engineers will notice and lose trust quickly.
Documents are primarily English text with code snippets. Multi-language support is out of scope.
Access controls from source systems (Confluence spaces, GitHub repos, Google Drive sharing) must be respected. An engineer should never see answers sourced from documents they cannot access.
The platform team operates the infrastructure. Individual teams contribute documents by connecting their tools.

Scope:

In scope: Natural language Q&A with citations, document ingestion from 5+ sources, hybrid retrieval, agentic RAG for complex queries, model routing, evaluation pipeline, observability, multi-tenant isolation.
Out of scope: Document authoring or editing, code generation (use AI coding assistants for that), real-time collaboration, customer-facing chatbot (different safety requirements), training custom foundation models.

2. Requirements

2.1 Functional Requirements

ID	Requirement	Priority
FR-01	Natural language Q&A with grounded, cited answers from internal documents	P0
FR-02	Source attribution: every claim links to the specific document and section it came from	P0
FR-03	Multi-format document ingestion (Confluence, GitHub, Slack, Google Docs, S3)	P0
FR-04	Access control: answers only reference documents the querying user can access	P0
FR-05	Document freshness: updates reflected in search within 15 minutes	P0
FR-06	Hybrid retrieval: semantic search combined with keyword matching	P0
FR-07	Streaming responses with sub-2s time to first token (P95)	P0
FR-08	Conversational follow-ups: multi-turn context within a session	P1
FR-09	User feedback collection: thumbs up/down, copy events, corrections	P1
FR-10	Multi-tenant isolation: teams see only their authorized content	P1
FR-11	Admin dashboard: query volume, quality metrics, cost breakdown, retrieval performance	P1
FR-12	Agentic RAG: complex multi-hop queries handled via iterative retrieval and query decomposition	P1
FR-13	Code-aware search: understand code snippets, function signatures, and API references	P2
FR-14	Multi-modal support: extract and search content from diagrams and screenshots	P2

2.2 Non-Functional Requirements

Requirement	Target
End-to-end latency (time to first token)	P50 ~1.0-1.5s, P95 ~1.5-2.5s, P99 < 4s (varies by query complexity)
Retrieval latency (search + re-rank)	P50 < 150ms, P99 < 350ms
Query throughput	500 QPS sustained, 1,500 QPS burst
Document ingestion throughput	500K doc updates/month processed within SLA
Document freshness	Updates searchable within 15 minutes
Availability	99.9% (8.7 hours downtime/year)
Answer accuracy (golden dataset)	> 85% correct on curated Q&A benchmark (definition of "correct" varies; measure per query type)
Hallucination rate	8-15% baseline; can be reduced to single digits with strict citation enforcement + abstention (varies by corpus quality; measured via LLM-as-judge)
Cost per query (average)	Minimize through model routing and caching (see Section 18)
Embedding re-indexing	Full corpus re-embedded within 72 hours
Data isolation	Zero cross-tenant document leakage

Architecture in One Minute

A RAG system is mostly a data, retrieval, and evaluation problem. The LLM does the last 20% of the work. The platform has six layers, each with a distinct job:

Ingestion layer. Connectors pull documents from Confluence, GitHub, Slack, Google Docs, and S3. Parsers normalize everything to structured markdown with metadata. Change detection ensures incremental updates, not full re-ingestion.
Chunking and embedding layer. A multi-strategy chunking pipeline splits documents based on type: recursive chunking for structured docs, semantic chunking for long-form prose, AST-aware chunking for code. An embedding pipeline converts chunks to vectors using a versioned embedding model.
Storage layer. Qdrant stores vectors with HNSW indexing. Elasticsearch handles BM25 keyword search. PostgreSQL tracks document metadata, permissions, and chunk lineage. Redis provides semantic caching.
Retrieval layer. Query understanding classifies, rewrites, and optionally expands the query before search. Hybrid search combines BM25 and dense vector retrieval in parallel. Reciprocal Rank Fusion merges results. A cross-encoder re-ranks the top candidates. Access control filtering ensures users only see documents they have permission to view.
Generation layer. A model router classifies query complexity and picks the right LLM (small model for simple lookups, large model for complex reasoning). Prompts are assembled dynamically with retrieved context, conversation history, and citation instructions. Responses stream via SSE.
Evaluation layer. User feedback (thumbs up/down), implicit signals (abandonment, follow-ups), and LLM-as-judge scoring on sampled queries feed into a continuous improvement loop that tightens retrieval quality and generation accuracy over time.

3. How RAG Works: One Query, Start to Finish

Before diving into each component, here is what happens when an engineer types: "How does the payments service handle retries?"

Step 1: The query becomes a vector

The system converts the question into a 1024-dimensional embedding — a list of numbers that captures the meaning of the query, not just the keywords.

"How does the payments service handle retries?"
    ↓ embedding model
[0.12, -0.98, 0.34, 0.67, -0.21, ... ] (1024 numbers)

Step 2: Two searches run in parallel

Vector search (Qdrant) finds chunks whose meaning is similar to the query:

json
[
  { "score": 0.92, "content": "The payments service uses exponential backoff with jitter for all downstream retries...", "source": "confluence", "doc": "Payment Error Handling" },
  { "score": 0.87, "content": "Retry budgets are set per-service: payments allows 3 retries with a 2-second base delay...", "source": "github", "doc": "payments-service/README.md" },
  { "score": 0.84, "content": "Circuit breakers open after 5 consecutive failures, preventing retry storms...", "source": "confluence", "doc": "Resilience Patterns" }
]

Keyword search (Elasticsearch) finds chunks containing the exact terms:

json
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title_context": { "query": "payments service handle retries", "boost": 3 } } },
        { "match": { "content": "payments service handle retries" } }
      ]
    }
  }
}

Notice: the vector database returns ranked chunks, not answers. It finds relevant text fragments. The LLM has not been involved yet.

Why both? Vector search understands meaning but misses exact identifiers. Keyword search finds precise terms but misses intent. Section 11 covers hybrid retrieval in detail.

Step 3: Merge and re-rank

Step 4: The LLM generates an answer

The system assembles a prompt with the retrieved chunks as context:

System: You are an internal engineering assistant. Answer ONLY from the provided context.
        Cite sources using [Source: document_name].

Context:
[Chunk 1: Payment Error Handling - "The payments service uses exponential backoff..."]
[Chunk 2: payments-service README - "Retry budgets are set per-service..."]
[Chunk 3: Resilience Patterns - "Circuit breakers open after 5 consecutive..."]

Question: How does the payments service handle retries?

The LLM produces:

The payments service uses exponential backoff with jitter for all downstream retries [Source: Payment Error Handling]. Each service has a retry budget — payments allows 3 retries with a 2-second base delay [Source: payments-service/README.md]. After 5 consecutive failures, the circuit breaker opens to prevent retry storms from cascading [Source: Resilience Patterns].

Total time: ~1.1 seconds. The user sees a streaming response starting at ~400ms.

What about complex queries?

"Why is checkout slow and what changed after the Q3 migration?"

This cannot be answered with a single search. The system classifies it as complex and routes to an MCP agent:

Step 1: Classify → complex (multi-part, temporal reference)
Step 2: Decompose into sub-queries:
        → "checkout latency issues"
        → "Q3 migration changes checkout"
Step 3: Agent calls search tool twice, evaluates results
Step 4: Agent decides it needs more: "Q3 migration rollout timeline"
Step 5: Third search, combines all context
Step 6: LLM generates combined answer with citations from all sources

The mental model:

Simple query → one search → answer
Complex query → think (decompose) → multiple searches → combine → answer

RAG retrieves information. MCP decides how to retrieve it.

4. Design Principles

5. Technology Selection

5.1 Technology Selection

Component	Technology	Why This Choice
Document parsing	Unstructured.io + custom parsers	Handles 20+ file formats. Custom parsers for Slack JSON and code files where Unstructured falls short.
Chunking engine	Custom multi-strategy pipeline	No single library handles all document types well. LangChain's text splitters for basics, custom logic for semantic and code-aware chunking.
Embedding model	OpenAI text-embedding-3-large (1024d)	Best cost/performance ratio on MTEB benchmarks. 3072d available if needed. Matryoshka support for dimensionality reduction.
Vector database	Qdrant (clustered)	Rust-based, fast HNSW with quantization. Payload filtering for metadata queries. Horizontal sharding. Open source with managed option.
Keyword index	Elasticsearch	Battle-tested BM25. Field boosting, analyzers, and aggregations. Already deployed at most enterprises.
Re-ranker	Cohere Rerank 3.5 + BGE-reranker-v2-m3 (fallback)	Cross-encoder accuracy on top-k results. Cohere for quality, open-source BGE as self-hosted fallback.
LLM inference (API)	Claude Sonnet / GPT-4o	Best quality for complex reasoning. Streaming support. Structured output mode for citations.
LLM inference (small)	Claude Haiku / GPT-4o-mini	10-20x cheaper than frontier models. Sufficient for simple factual lookups. Sub-300ms TTFT.
LLM inference (self-hosted)	vLLM + Llama 4 70B	Fallback for provider outages. PagedAttention for efficient KV cache. Runs on 4x A100s. See Section 13.5 for open-source model details.
Agent orchestration	LangGraph	Stateful agent workflows with cycles. Built-in checkpointing. Cleaner than raw LangChain for multi-step reasoning.
Tool interface	MCP (Model Context Protocol)	Standardized tool interface across model providers. Write a search tool once, use it with Claude, GPT, or self-hosted models. OAuth 2.1 for secure access.
Semantic cache	Redis + embedding similarity	Embed incoming query, check cosine similarity against cached queries. Threshold > 0.95 returns cached response. 5-25% hit rate in enterprise workloads, highly dependent on query repetition patterns and team size.
Metadata store	PostgreSQL	Document metadata, ACLs, chunk lineage, user feedback. ACID transactions for permission updates.
Message queue	Apache Kafka	Decouples ingestion from embedding. Replay capability for re-processing. Partitioned by source system.
Observability	OpenTelemetry + Grafana	OTel traces across full RAG pipeline. Grafana dashboards for latency, cost, and quality metrics.
Evaluation	RAGAS + custom LLM-as-judge	RAGAS for offline metrics (faithfulness, relevance, context recall). Custom judge for production sampling.

5.2 Why RAG (and When Not)

The three main approaches to giving LLMs access to private knowledge:

Approach	Best For	Latency	Relative Cost	Knowledge Freshness	Accuracy on Enterprise Data
RAG	Large, frequently changing knowledge base	1-4s (retrieval + generation)	Low	Minutes (incremental indexing)	High (grounded in source docs)
Fine-tuning	Consistent tone/style, domain terminology, structured output formats	0.5-2s (no retrieval overhead)	Very low	Weeks-months (retrain cycle)	Medium (baked into weights, can hallucinate)
Long context	Small corpus (< 500 pages), real-time analysis	2-10s (large input processing)	High (scales with input size)	Real-time (docs in context)	High for small corpus, degrades with size

6. Capacity Planning

Storage sizing

Documents:           2,000,000
Avg chunks per doc:  ~5 (varies widely: Slack threads = 1-2, API docs = 3-5, long RFCs = 10-20)
Total chunks:        10,000,000

Embedding storage (Qdrant):
  10M chunks × 1024 dimensions × 4 bytes/float = 40 GB (raw vectors)
  With scalar quantization (int8): 40 GB × 0.25 = 10 GB
  Payload metadata per chunk: ~500 bytes × 10M = 5 GB
  HNSW graph overhead: ~30% of vector size = 3-12 GB
  Total Qdrant storage: 18-57 GB (depending on quantization)

Keyword index (Elasticsearch):
  10M chunks × avg 300 tokens × 6 bytes/token = 18 GB raw text
  With inverted index overhead: ~54 GB
  Total ES storage: ~54 GB

Metadata store (PostgreSQL):
  10M chunk records × 1 KB avg = 10 GB
  2M document records × 2 KB avg = 4 GB
  ACL tables, indexes: ~2 GB
  Total PG storage: ~16 GB

Query load

Daily queries:       10,000,000
Average QPS:         10M / 86,400 = ~115 QPS
Peak QPS (3x avg):   ~350-500 QPS
Burst QPS (5x avg):  ~575 QPS (Monday morning, incident response)

Per-query compute:
  Embedding query:    5-10ms (API call)
  Vector search:      10-50ms (Qdrant)
  BM25 search:        5-20ms (Elasticsearch)
  Re-ranking (top 20): 80-150ms P99 (Cohere API or GPU, includes network)
  ACL filtering:      5-15ms
  LLM generation:     400-1500ms (depends on model tier)
  Total P50:          ~1.0-1.5s
  Total P99:          ~3-5s

For costs, see Section 18. The short version: LLM inference is 60-85% of the bill, so model routing is not optional.

7. Document Ingestion Pipeline

7.1 Connector Architecture

Each source system gets a dedicated connector service:

Source	Integration Method	Change Detection	Authentication
Confluence	REST API + webhooks	Webhook on page create/update/delete, daily full-sync reconciliation	OAuth 2.0
GitHub	Webhooks + REST API	Push webhooks for commits, PR webhooks for merges	GitHub App
Slack	Events API + conversations.history	Real-time events for new messages, backfill via pagination	Bot token
Google Docs	Drive API + Push Notifications	Drive change notifications, polling fallback	Service account
S3	S3 Event Notifications (SNS/SQS)	Object created/modified events	IAM role

7.2 Document Parsing

Raw documents arrive in a dozen formats. The parser normalizes everything to a common structure:

python
@dataclass
class ParsedDocument:
    doc_id: str                    # Deterministic hash of source + source_id
    source: str                    # "confluence", "github", "slack", etc.
    source_id: str                 # Native ID in source system
    title: str
    content: str                   # Normalized markdown
    content_type: str              # "prose", "code", "mixed", "conversational"
    metadata: dict                 # Author, created_at, updated_at, tags, etc.
    permissions: list[str]         # ACL groups/users who can access
    parent_doc_id: str | None      # For threaded/nested content
    url: str                       # Deep link back to source
    content_hash: str              # SHA-256 of content for change detection

Format-specific parsing:

Confluence: Atlassian Storage Format (XML-based) to markdown via custom converter. Strips macros, preserves headings, tables, and code blocks. Expands includes and excerpts inline.
GitHub: README and docs are already markdown. Code files get language-tagged code blocks with file path context. PR descriptions include diff summaries.
Slack: Thread messages are concatenated chronologically with author attribution. Reactions and emoji responses are stripped. Thread replies are grouped with their parent message.
Google Docs: Export as HTML, convert to markdown. Preserves headings, lists, tables. Strips formatting-only elements (fonts, colors).
PDFs: Unstructured.io with hi_res strategy for layout-aware extraction. Falls back to fast strategy if processing time exceeds 30 seconds per page.

7.3 Deduplication

8. Chunking Pipeline

Chunking matters more than any other decision in a RAG system (see Section 4, Principle 1).

8.1 Classic Fixed-Size Chunking

Split text into chunks of N tokens with M tokens of overlap.

Typical config: chunk_size=512 tokens, overlap=50 tokens

Performs well when: Uniformly structured documents like API reference pages where each section is roughly the same length and self-contained.

Retrieval impact: Baseline. Expect recall@10 of 60-70% on heterogeneous enterprise corpora.

What each strategy actually produces. Consider this section from a payments service doc:

## Retry Strategy

The payments service retries failed downstream calls using exponential
backoff with jitter. Base delay is 2 seconds.

### Configuration

| Parameter     | Default | Max   |
|---------------|---------|-------|
| max_retries   | 3       | 10    |
| base_delay_ms | 2000    | 5000  |

### Circuit Breaker

After 5 consecutive failures, the circuit breaker opens for 30 seconds.
During this window, all calls return a cached fallback response instead
of hitting the downstream service.

8.2 Recursive / Hierarchical Chunking

Split by document structure first: headers, then paragraphs, then sentences. Only fall back to fixed-size splitting when structural elements exceed the chunk size limit.

python
# Splitting hierarchy for a Confluence page
split_order = [
    "\n## ",      # H2 headers (major sections)
    "\n### ",     # H3 headers (subsections)
    "\n\n",       # Paragraph breaks
    "\n",         # Line breaks
    ". ",         # Sentence boundaries
]
# Each split respects max_chunk_size=800 tokens
# Parent-child relationships are preserved in metadata

Performs well when: Well-structured documents with clear heading hierarchies. Confluence pages, GitHub README files, technical specs.

Degrades when: Flat documents with no headings (some Google Docs, most Slack threads). Falls back to fixed-size splitting, losing the structural advantage.

Retrieval impact: Typically 5-15% improvement in recall@10 over fixed-size for structured documents, depending on corpus. Negligible improvement for unstructured content.

8.3 Semantic Chunking

Instead of splitting on structural boundaries, split on meaning boundaries. Use embeddings to detect where the topic changes.

How it works:

Split the document into sentences.
Embed each sentence.
Compute cosine similarity between consecutive sentence embeddings using a sliding window.
When similarity drops below a threshold (or drops significantly relative to the local average), insert a chunk boundary.
Merge adjacent sentences between boundaries into chunks.

python
from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text: str, threshold: float = 0.3) -> list[str]:
    sentences = split_into_sentences(text)
    model = SentenceTransformer("all-MiniLM-L6-v2")  # Fast, small model for boundary detection
    embeddings = model.encode(sentences)

    boundaries = [0]
    for i in range(1, len(embeddings)):
        sim = cosine_similarity(embeddings[i-1], embeddings[i])
        if sim < threshold:
            boundaries.append(i)
    boundaries.append(len(sentences))

    chunks = []
    for i in range(len(boundaries) - 1):
        chunk = " ".join(sentences[boundaries[i]:boundaries[i+1]])
        chunks.append(chunk)
    return chunks

Degrades when: Very short documents (not enough content for meaningful topic shifts). Highly structured documents (recursive chunking already captures the boundaries well).

8.4 Late Chunking

A technique introduced by Jina AI in 2024 that flips the traditional embed-then-chunk approach.

Late chunking approach:

Pass the entire document through a long-context embedding model (one that supports 8K+ tokens).
The model produces token-level embeddings with full document context (every token's embedding is influenced by the entire document through attention).
After the full forward pass, chunk the token embeddings into segments.
Pool each segment's token embeddings to produce chunk-level embeddings.

Performs well when: Documents with heavy cross-referencing, pronouns, and context-dependent statements. Technical specs where section 3 frequently refers back to concepts from section 1.

8.5 LLM-Guided Chunking

How it works:

Send the document (or a representative section) to a small LLM.
The LLM identifies logical boundaries, labels each section's topic, and suggests chunk boundaries.
Optionally, the LLM generates a summary for each chunk that serves as an alternative embedding target (the summary is often a better retrieval target than the raw content).

python
CHUNKING_PROMPT = """Analyze this document and identify logical sections.
For each section, provide:
1. Start and end markers (first and last sentence)
2. A topic label
3. A one-sentence summary suitable for search retrieval

Document:
{document_text}

Output as JSON array of sections."""

Degrades when: When you apply it to everything. The cost adds up fast, and simpler strategies work fine for well-structured content. Use it selectively.

8.6 Our Strategy: Multi-Strategy Pipeline

No single chunking strategy works for all document types. The platform classifies each document by content type and applies the appropriate strategy:

Document Type	Source	Strategy	Chunk Size	Overlap	Rationale
API reference docs	Confluence, GitHub	Recursive (by heading)	500-800 tokens	50 tokens	Well-structured, self-contained sections
Design docs / RFCs	Confluence, Google Docs	Semantic + Late chunking	600-1000 tokens	N/A (semantic boundaries)	Long-form, cross-referencing, multi-topic
Runbooks / How-tos	Confluence	Recursive (by step)	400-600 tokens	30 tokens	Step-by-step, each step is self-contained
Code files	GitHub	AST-aware (by function/class)	200-800 tokens	0 (logical boundaries)	Function/class boundaries, preserve complete units
Slack threads	Slack	Thread-level (full thread as chunk)	Up to 1000 tokens	0	Context builds across messages, splitting breaks meaning
Meeting notes	Google Docs	Agentic chunking	500-800 tokens	N/A	Unstructured, LLM identifies topic boundaries
PDFs / legacy docs	S3	Semantic chunking	500-800 tokens	N/A	Poor structure, need semantic boundary detection

Quick reference: all strategies compared

Strategy	Best For	Ingestion Cost	Retrieval Impact
Fixed-size	Uniform structured docs	Lowest	Baseline
Recursive	Docs with clear heading hierarchy	Low	Typically +5-15%
Semantic	Long-form, multi-topic prose	Medium (3-5x compute)	Typically +5-15%
Late chunking	Cross-referencing docs	Medium-high (2-3x latency)	Typically +5-15%
LLM-guided	Complex mixed-format docs	Highest (~$0.01-0.05/doc)	Highest for target docs

Every chunk gets enriched metadata:

python
@dataclass
class Chunk:
    chunk_id: str              # UUID
    doc_id: str                # Parent document
    content: str               # Chunk text
    content_type: str          # "prose", "code", "mixed", "conversational"
    chunk_strategy: str        # "recursive", "semantic", "late", "agentic", "ast"
    position: int              # Order within document
    total_chunks: int          # Total chunks in parent doc
    parent_chunk_id: str | None # For hierarchical expansion
    title_context: str         # Nearest heading above this chunk
    source: str                # "confluence", "github", etc.
    url: str                   # Deep link to source
    permissions: list[str]     # Inherited from parent document
    embedding: list[float]     # 1024-dimensional vector
    created_at: datetime
    updated_at: datetime

9. Embedding Pipeline

The embedding pipeline turns text chunks into dense vectors. This section covers model selection, batch processing during ingestion, versioning strategy, and dimensionality reduction.

9.1 Embedding Model Selection

The embedding model is the foundation of retrieval quality. Pick a bad one and everything downstream pays for it.

Model	Dimensions	MTEB Avg Score	Latency (per chunk)	Matryoshka Support	License
Qwen3-Embedding-8B	4096 (or 1024 reduced)	70.58	10-20ms (self-hosted GPU)	Yes	Apache 2.0
OpenAI text-embedding-3-large	3072 (or 1024 reduced)	64.6	5-10ms	Yes	Proprietary API
Cohere embed-v4	1024	65.1	8-15ms	Yes	Proprietary API
NV-Embed-v2 (NVIDIA)	-	~69	GPU	No	Llama license
jina-embeddings-v3	570M dims	65.5	GPU or CPU	-	Apache 2.0
BGE-M3 (open source)	1024	63.0	3-8ms (GPU) or CPU	No	MIT
EmbeddingGemma-300M	-	~60	CPU or edge device	-	Apache 2.0

Serving self-hosted embeddings: HuggingFace's Text Embedding Inference (TEI) is the easiest production setup:

bash
# Serve BGE-M3 for production embedding
docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-m3 \
  --max-batch-tokens 65536

For smaller models like BGE-M3 (568M params), CPU serving is viable at low throughput (< 100 QPS). At higher volumes, a single GPU handles 1,000+ embeddings per second.

9.2 Batch Embedding for Ingestion

New and updated chunks flow through Kafka to the embedding workers. Each worker:

Batches chunks into groups of 100-500 (the API handles batch requests efficiently).
Calls the embedding API with retry and exponential backoff.
Writes vectors to Qdrant and text to Elasticsearch in parallel.
Acknowledges the Kafka offset after both writes succeed.

9.3 Embedding Model Versioning and Migration

Blue-green re-indexing strategy:

Create a new Qdrant collection (chunks_v2) alongside the existing one (chunks_v1).
Run a background job that re-embeds all 10M chunks with the new model. At our scale and current API rate limits, this typically takes 1-3 hours.
While re-indexing runs, queries continue hitting chunks_v1.
Once chunks_v2 is fully populated, run the retrieval evaluation benchmark against the golden dataset.
If quality meets or exceeds the bar, atomically switch the query router to chunks_v2.
Keep chunks_v1 for 48 hours as a rollback target, then delete.

This is the same pattern as blue-green deployments for application code, applied to vector data.

When to change your embedding model:

Changing the embedding model is expensive (full re-index of the corpus) and risky (retrieval quality could regress). Do it when the payoff justifies the cost:

Trigger	Example	Worth it?
Major quality improvement	New model scores 5%+ higher on MTEB for your domain	Yes. Run shadow eval first.
Significant cost reduction	Switch from API model to self-hosted with < 1% quality loss	Yes, if savings > $1K/month at your scale.
Vendor deprecation	Provider announces model sunset with 6-month deadline	Yes. Plan early, don't rush.
New capability needed	Need code-aware embeddings (CodeBERT) or multi-lingual support	Yes, if current model can't handle these.
Dimensionality optimization	Matryoshka model lets you cut dims from 3072 to 1024 with < 1% loss	Yes. Storage and latency savings compound.

When NOT to change:

Marginal improvement (< 2% on your retrieval eval set). The re-indexing cost and risk are not worth it.
You have no blue-green re-indexing capability. Changing models in-place means downtime or serving stale results.
Mid-incident or during a feature launch. Stabilize first.
The new model has not been benchmarked on YOUR data. MTEB scores are averages across generic datasets. A model that scores 3% higher on MTEB might score 2% lower on your internal engineering corpus. Always benchmark on your golden dataset before switching.

How to evaluate before switching:

Create a shadow Qdrant collection with the new model's embeddings (start with a 10% sample of your corpus).
Run your golden dataset retrieval benchmark against the shadow collection.
Compare recall@10, MRR, and NDCG@10 against the current production model.
If the new model wins or ties on all metrics, proceed with full re-indexing.
If it wins on some and loses on others, dig into the failures. If the losses are on query types you care about, do not switch.

9.4 Dimensionality Reduction

When to avoid quantization: If your retrieval recall is already borderline (< 75%), quantization will push it below acceptable thresholds. Fix chunking and retrieval first, then compress.

10. Storage Architecture

10.1 Vector Database Selection

Feature	Qdrant	Pinecone	Weaviate	pgvector
Architecture	Distributed, shared-nothing	Fully managed serverless	Distributed, multi-model	PostgreSQL extension
Max vectors	Billions (sharded)	Billions (managed)	Billions (sharded)	~10M practical
HNSW tuning	Full control (ef, M)	Abstracted	Full control	Limited
Quantization	Scalar, Product, Binary	Supported	Product	Half-precision
Metadata filtering	Payload indexes, fast	Namespace + filter	Inverted index	SQL WHERE clause
Hybrid search	Sparse vectors + dense	Sparse + dense	BM25 built-in	Requires extensions
Ops complexity	Medium (Helm charts)	Zero (managed)	Medium	Low (PG extension)

For a deeper comparison of vector database internals (HNSW vs IVF-PQ, distance metrics, re-indexing strategies), see the vector databases deep dive.

10.2 Vector Store (Qdrant)

HNSW index tuning:

Parameter	Value	Why
`m`	16	Connections per node in the HNSW graph. 16 balances recall and memory. Higher values (32-64) improve recall slightly but double memory usage.
`ef_construct`	200	Build-time search width. Higher values produce a better graph but slow down indexing. 200 is sufficient for 10M vectors.
`ef` (search)	128	Query-time search width. Higher = better recall but slower search. 128 often achieves high recall (frequently >90%) but exact results depend on data distribution and filtering complexity.

What exactly is stored per chunk?

Each chunk becomes a vector with a metadata payload:

json
{
  "id": "chunk_a1b2c3d4",
  "vector": [0.12, -0.98, 0.34, 0.67, -0.21, "... 1024 dimensions total"],
  "payload": {
    "content": "The payments service uses exponential backoff with jitter for all downstream retries. Base delay is 2 seconds, multiplied by 2^attempt with random jitter up to 500ms...",
    "doc_id": "doc_payments_retry_2026",
    "title": "Payment Service Error Handling",
    "source": "confluence",
    "content_type": "prose",
    "permissions": ["team-payments", "team-platform"],
    "url": "https://wiki.internal/pages/payment-error-handling",
    "updated_at": "2026-03-15T10:30:00Z",
    "chunk_index": 3,
    "total_chunks": 8
  }
}

10.3 Keyword Index (Elasticsearch)

BM25 keyword search handles queries where exact term matching matters. "ErrorCode 4032" should match documents containing that exact string, regardless of semantic similarity.

Index design:

json
{
  "mappings": {
    "properties": {
      "content": { "type": "text", "analyzer": "standard" },
      "title_context": { "type": "text", "analyzer": "standard", "boost": 3.0 },
      "source": { "type": "keyword" },
      "content_type": { "type": "keyword" },
      "permissions": { "type": "keyword" },
      "doc_id": { "type": "keyword" },
      "chunk_id": { "type": "keyword" },
      "url": { "type": "keyword" },
      "updated_at": { "type": "date" }
    }
  }
}

10.4 Metadata Store (PostgreSQL)

PostgreSQL stores the system of record for document metadata, access control lists, and chunk lineage.

Key tables:

documents: Source metadata, content hash, last sync timestamp, permissions
chunks: Chunk-to-document mapping, chunk strategy, position, parent chunk
permissions: ACL entries mapping documents to user groups and individual users
feedback: User ratings, corrections, and implicit signals per query-answer pair
evaluations: LLM-as-judge scores, golden dataset results, regression test runs

10.5 Semantic Cache (Redis)

Before running the full retrieval pipeline, check if a semantically similar query was recently answered.

How it works:

Embed the incoming query.
Check Redis for cached query embeddings with cosine similarity > 0.95.
If found, return the cached answer immediately (< 10ms vs 1-2s for full pipeline).
If not found, run the full pipeline and cache the result.

What each cache entry stores:

json
{
  "query_embedding": [0.12, -0.98, 0.34, ...],
  "query_text": "How does auth work?",
  "answer": "The auth service uses OAuth2 with...",
  "sources": [{"title": "Auth Guide", "url": "..."}],
  "created_at": 1711234567
}

How similarity matching works: Cosine similarity measures the angle between two vectors. Same direction means same meaning.

New query:    "How does authentication work?"  → [0.11, -0.97, 0.35, ...]
Cached query: "How does auth work?"            → [0.12, -0.98, 0.34, ...]

cosine_similarity = dot_product(A, B) / (magnitude(A) × magnitude(B))
                  = 0.97

0.97 > 0.95 threshold → cache hit → return stored answer in <10ms

Cache invalidation: When a document is updated, invalidate all cached answers that cited that document. This uses the chunk lineage in PostgreSQL to find affected cache entries.

Pre-warm with top queries. Log the top 1,000 queries from the previous week. After a cache flush or deployment, run these through the pipeline as a background job (P2 priority, so it does not compete with real users). This covers 15-30% of Monday's traffic before anyone arrives.
Staggered TTLs. Instead of a uniform 24-hour TTL for all cache entries, randomize TTLs between 18-30 hours. This prevents mass expiration at the same time and smooths the cache rebuild.
Gradual invalidation after document updates. When a document is updated, do not flush all related cache entries immediately. Mark them as "stale but serveable" for 5 minutes while the pipeline regenerates fresh answers in the background. Users get slightly stale answers for a few minutes instead of a latency spike.

10.6 Data Lifecycle Management

Vectors do not age gracefully. Without active maintenance, the index accumulates stale chunks, orphaned vectors, and fragmented segments.

11. Retrieval Pipeline

11.1 Query Understanding

Raw user queries are often ambiguous, incomplete, or poorly phrased. The query understanding layer transforms them before retrieval.

Query classification: A lightweight classifier (fine-tuned DistilBERT or rule-based) categorizes queries into:

Query Type	Example	Routing
Simple factual	"What port does the user service run on?"	Single-shot retrieval, small model
How-to	"How do I set up local dev for the payments service?"	Single-shot retrieval, medium model
Analytical	"What are the trade-offs between our auth approaches?"	May need agentic RAG, large model
Multi-hop	"How does service X auth with Y, and what changed after Q3?"	Agentic RAG with query decomposition
Conversational	"What about the error handling?" (follow-up)	Resolve coreferences, then route

Query rewriting: For ambiguous queries, use a fast LLM call to rewrite:

Original: "that auth thing from last quarter"
Rewritten: "authentication changes implemented in Q3 2025"

Cost: negligible per-query cost using a small model. Applied to ~30% of queries (those classified as ambiguous).

Query: "Why is checkout slow?"
HyDE: "The checkout service experiences latency spikes due to synchronous calls
       to the payment processor and inventory service. The p99 latency increases
       from 200ms to 2s during peak traffic because..."

11.2 Hybrid Retrieval

Run BM25 keyword search and dense vector search in parallel, then merge results.

Vector search misses: Exact identifiers, error codes, version numbers. "ErrorCode 4032" has no meaningful semantic embedding. BM25 finds it instantly.
BM25 misses: Intent and meaning. "how to handle failures gracefully" does not match a document titled "Retry and Circuit Breaker Patterns" because the keywords do not overlap. Vector search captures the semantic connection.

Concrete example of complementary blind spots:

Try this query: "ErrorCode 4032"

Vector search returns docs about error handling in general — semantically similar, but not the right error code.
Keyword search finds the exact match instantly: "ErrorCode 4032: Idempotency key conflict on duplicate payment submission."

Now try: "how to handle failures gracefully"

Keyword search returns nothing useful — no document contains this exact phrase.
Vector search finds "Resilience Patterns: Circuit Breakers, Retries, and Graceful Degradation" — semantically a perfect match.

Vector search understands meaning. Keyword search finds exact terms. You need both.

Reciprocal Rank Fusion (RRF) merges the two result lists:

RRF_score(doc) = sum(1 / (k + rank_in_list)) for each list containing doc
where k = 60 (standard constant)

If a document appears at rank 3 in vector search and rank 7 in BM25:

RRF_score = 1/(60+3) + 1/(60+7) = 0.0159 + 0.0149 = 0.0308

Documents appearing in both lists get boosted. Documents appearing in only one list still contribute.

RRF in action:

Document	Vector Rank	BM25 Rank	RRF Score	Final Rank
Payment Error Handling	1	3	0.0492	1
payments-service README	2	1	0.0492	2
Resilience Patterns	3	5	0.0423	3
Retry Config Reference	5	2	0.0443	4
Error Code Glossary	8	4	0.0398	5

Documents appearing in both lists get boosted. Documents appearing in only one list still contribute.

Dynamic alpha by query type: Rather than a single fixed alpha, adjust the weight based on the query classification from Section 11.1:

Query Type	Alpha (vector weight)	Why
Conceptual ("how does auth work")	0.8	Meaning matters more than exact terms
Code/identifier ("ErrorCode 4032")	0.3	Exact match critical; BM25 excels
How-to ("set up local dev")	0.6	Mix of procedure keywords and intent
Conversational follow-up	0.7	Rewritten query benefits from semantic

Over time, you can learn the optimal alpha per query type from feedback data. If users consistently downvote code query results, try shifting alpha lower for that bucket.

11.3 Re-Ranking

The initial retrieval cast a wide net: top-50 to top-100 results from hybrid search. Re-ranking narrows this to the top-5 most relevant chunks using a cross-encoder model.

The standard pattern: fast bi-encoder retrieval on the full corpus (millions of chunks) followed by accurate cross-encoder re-ranking on the shortlist (20-100 chunks).

Re-ranking options:

Model	Latency (20 docs)	Quality (NDCG@10)	Cost
Cohere Rerank 3.5	80-150ms (with network)	Top-tier	$2/1M docs
Qwen3-Reranker-8B (open source)	50-100ms (GPU)	Comparable to Cohere on benchmarks	GPU cost only
BGE-reranker-v2-m3 (self-hosted)	40-80ms (GPU)	Very good	GPU cost only
ColBERTv2	20-40ms	Good (late-interaction, fastest)	GPU cost only
Jina Reranker v2	50-90ms	Very good	$1/1M docs

Open-source re-rankers:

Model	Params	Quality	GPU Requirement
Qwen3-Reranker-8B	8B	Matches Cohere Rerank 3.5	1x A100-40GB
BGE-reranker-v2-m3	568M	Very good	1x RTX 4090
ColBERTv2	110M	Good (late-interaction, very fast)	1x RTX 4090

11.4 Access Control Filtering

Non-negotiable. An engineer should never see answers sourced from documents they cannot access. One leaked internal doc through the knowledge assistant and the platform is dead.

Pre-filtering vs post-filtering:

Pre-filtering (our approach): Include the user's permission groups in the vector search query. Qdrant's payload index on permissions filters before scoring, so only accessible documents are considered. This is faster and more secure. The downside: if a user has very restrictive permissions, the candidate pool shrinks and retrieval quality may drop.
Post-filtering: Retrieve top-k from the full corpus, then filter out inaccessible documents. Simpler to implement but you may end up with fewer than k results after filtering. Also briefly loads inaccessible document IDs into memory, which some compliance frameworks disallow.

11.5 Context Assembly

After retrieval and re-ranking produce the top-5 chunks, assemble them into the LLM prompt.

Context window budget:

Total context window: 8,192 tokens (small model) or 200,000 tokens (large model)

Budget allocation (small model):
  System prompt:          500 tokens (instructions, citation format, guardrails)
  Retrieved context:      4,000-5,000 tokens (5 chunks × 800-1000 tokens)
  Conversation history:   1,000-1,500 tokens (last 2-3 turns, summarized)
  Generation output:      1,000-2,000 tokens (answer + citations)

Budget allocation (large model, complex queries):
  System prompt:          500 tokens
  Retrieved context:      8,000-15,000 tokens (10-15 chunks for agentic RAG)
  Conversation history:   2,000-4,000 tokens (full recent history)
  Generation output:      2,000-4,000 tokens

Context compression: When retrieved chunks are long or overlap in content, compress before stuffing into the prompt. Two techniques:

Redundancy removal. If two chunks say essentially the same thing (cosine similarity > 0.85 between their embeddings), keep only the higher-scored one. This happens more often than you would expect, especially when multiple documents describe the same service.
Extractive compression. Use a fast LLM call to extract only the sentences relevant to the query from each chunk. A 800-token chunk might compress to 200 tokens without losing the information the user actually needs. Cost: a small per-query cost that pays for itself in reduced generation tokens. Can save 30-50% of context tokens in practice, which directly reduces generation cost.

12. Agentic RAG: Beyond Single-Shot Retrieval

12.1 When to Use Agentic RAG

Not every query needs an agent. Agents typically add 1.5-3x latency and 2-5x cost compared to single-shot RAG (varies by iteration count and tools called). Route to the agentic path only when needed.

Routing criteria:

Signal	Single-Shot	Agentic
Query complexity classifier	Simple, factual, how-to	Analytical, multi-hop, comparative
Estimated retrieval confidence	High (clear topic match)	Low (abstract, ambiguous)
Keyword count	< 15 words	> 15 words, multiple clauses
Contains comparison words	No	"compare", "difference between", "trade-offs"
Contains temporal references	No	"changed since", "before and after", "history of"

Concrete examples of classification:

SIMPLE → single-shot RAG:
  "What is the endpoint for the user service?"     → direct lookup
  "ErrorCode 4032"                                 → exact match
  "How to set up local dev for payments service?"  → how-to, single topic

COMPLEX → agentic RAG:
  "Compare auth v1 vs v2 and what changed after Q3"  → multi-part + temporal
  "Why is checkout slow?"                             → analytical, multiple causes
  "How does payments talk to user service and what    → multi-hop, cross-service
   happens when user service is down?"

12.2 Query Decomposition

Complex queries are broken into simpler sub-queries that can each be answered independently.

Original: "How does the payments service authenticate with the user service,
           and what changed after the Q3 auth migration?"

Decomposed:
  Sub-query 1: "Payments service authentication mechanism with user service"
  Sub-query 2: "Q3 2025 authentication migration changes"
  Sub-query 3: "Payments service auth changes after Q3 migration"

Each sub-query runs through the full retrieval pipeline independently. Results are merged and deduplicated before being passed to the generation step.

Implementation: A single LLM call decomposes the query. Prompt:

Given this complex question, break it into 2-5 simpler sub-questions
that can each be answered independently by searching an internal
knowledge base. Each sub-question should be self-contained.

Question: {original_query}

Estimated cost (as of early 2026): $0.002-0.005 per decomposition using a small model. Latency: 200-400ms.

12.3 Iterative Retrieval with Self-Reflection

After initial retrieval, the agent evaluates whether the retrieved context is sufficient to answer the query. If not, it formulates a follow-up search.

The loop:

Retrieve context for the query (or sub-query).
Ask the agent: "Given this context, can you answer the question? If not, what additional information do you need?"
If the agent identifies gaps, it generates a refined query targeting the missing information.
Retrieve again with the refined query.
Repeat up to 3 iterations (configurable).

12.4 Multi-Source Routing

The agent decides which knowledge sources to query based on the query type:

Query Type	Primary Sources	Retrieval Strategy
API reference	GitHub README, Confluence API docs	Code-aware chunking, exact match boosted
Incident investigation	Slack threads, runbooks	Thread-level retrieval, recency-weighted
Architecture decisions	RFCs, design docs, ADRs	Semantic chunking, full-section retrieval
Setup instructions	Runbooks, READMEs	Recursive chunking, step-by-step retrieval

Rather than searching the entire 10M-chunk corpus for every query, the agent narrows to relevant source types first. This reduces search space, improves precision, and cuts retrieval latency.

12.5 Tool Use via MCP

The agent interacts with the retrieval system and other data sources through MCP (Model Context Protocol) tools. Each tool is an MCP server, and the agent orchestrator is the MCP client.

Available tools:

MCP Tool Server	Capabilities	When Used
`search-vector`	Semantic search on Qdrant with filters	Every retrieval step
`search-keyword`	BM25 search on Elasticsearch	Exact term queries, error codes
`search-code`	AST-aware code search on GitHub index	API lookups, function signatures
`query-metadata`	SQL queries on PostgreSQL metadata	"Who owns this service?", "When was this last updated?"
`calculate`	Math operations for sizing/estimation queries	"How much storage does X need?"
`fetch-document`	Retrieve full document by ID	When a chunk reference needs full context

Why MCP over custom tool interfaces? Three reasons:

Model portability. Write each tool server once. Use it with Claude, GPT, Llama, or any MCP-compatible model. No per-model tool format conversion.
Security. MCP's OAuth 2.1 support means each tool server can enforce authentication and authorization independently. The code search tool can verify that the requesting user has access to the repo before returning results.
Capability negotiation. At initialization, the agent discovers what tools are available and their schemas. If the code search tool is down for maintenance, the agent gracefully skips it instead of failing.

For a deeper dive on MCP architecture, transports, and security model, see the MCP server guide.

How does the LLM pick which tool to call? The agent receives tool schemas at initialization and selects based on query intent:

Query: "How does auth work?"
→ Agent selects: search-vector("authentication flow architecture")
   Why: conceptual question, semantic search finds best results

Query: "ErrorCode 4032"
→ Agent selects: search-keyword("ErrorCode 4032")
   Why: exact term lookup, BM25 finds the precise match

Query: "Who owns the checkout service?"
→ Agent selects: query-metadata("SELECT owner FROM services WHERE name = 'checkout'")
   Why: structured data question, SQL is more precise than text search

Query: "How does the retry logic in payments-service/retry.go work?"
→ Agent selects: search-code("retry.go exponential backoff payments-service")
   Why: code-specific question, AST-aware search finds function definitions

The LLM does not guess. It receives the tool descriptions and input schemas, then calls the most appropriate tool for each sub-query. Complex queries often call multiple tools in sequence.

LangGraph orchestration: The agent loop is implemented as a LangGraph state machine:

python
from langgraph.graph import StateGraph, END

class AgentState(TypedDict):
    query: str
    sub_queries: list[str]
    retrieved_chunks: list[Chunk]
    iteration: int
    max_iterations: int
    is_sufficient: bool
    answer: str | None

def should_continue(state: AgentState) -> str:
    if state["is_sufficient"] or state["iteration"] >= state["max_iterations"]:
        return "generate"
    return "retrieve"

graph = StateGraph(AgentState)
graph.add_node("decompose", decompose_query)
graph.add_node("retrieve", retrieve_with_tools)
graph.add_node("evaluate", evaluate_sufficiency)
graph.add_node("generate", generate_answer)

graph.set_entry_point("decompose")
graph.add_edge("decompose", "retrieve")
graph.add_edge("retrieve", "evaluate")
graph.add_conditional_edges("evaluate", should_continue)
graph.add_edge("generate", END)

Detailed example: a 2-iteration agent trace.

Query: "How does the payments service handle retries, and what changed after the Q3 migration?"

ITERATION 1:
  Decompose → 2 sub-queries:
    [1] "payments service retry strategy"
    [2] "Q3 migration changes payments"

  Tool calls:
    search-vector("payments service retry strategy") → 4 chunks found
      ✓ "Exponential backoff with jitter, base delay 2s..."
      ✓ "Retry budgets: max_retries=3, circuit breaker at 5 failures..."
    search-vector("Q3 migration changes payments") → 2 chunks found
      ~ "Q3 platform migration overview" (too broad)
      ~ "Service mesh migration timeline" (wrong service)

  Evaluate: "I have strong retry docs but the Q3 migration results
             are too generic. I need to search for checkout-specific
             migration changes."
  → is_sufficient: false

ITERATION 2:
  Refined search:
    search-keyword("Q3 migration payments retry") → 3 chunks found
      ✓ "Q3 migration: payments retry policy changed from 5→3 max retries"
      ✓ "Migration rollback: reverted retry timeout from 10s to 5s"
    search-vector("payments Q3 breaking changes") → 1 chunk found
      ✓ "BREAKING: retry jitter algorithm changed from full to decorrelated"

  Evaluate: "Now I have both retry strategy docs AND specific Q3 changes."
  → is_sufficient: true

GENERATE:
  Context: 7 chunks (4 from iteration 1 + 3 from iteration 2)
  Answer: "The payments service uses exponential backoff with jitter...
           After Q3, three things changed: max_retries was reduced from
           5 to 3, timeout was halved from 10s to 5s, and the jitter
           algorithm switched from full to decorrelated [Source: Q3 Migration Notes]."

  Total: 2 iterations, 4 tool calls, 3.2 seconds, $0.04

Notice how the agent reasons about what is missing and refines its search. Single-shot RAG would have returned the retry strategy but missed the Q3 changes entirely.

12.6 Guardrails on Agent Loops

Agentic RAG without guardrails is a cost bomb waiting to go off. An agent that keeps searching and re-searching without finding useful results can burn through tokens and latency.

Hard limits:

Guardrail	Limit	What Happens at Limit
Max iterations	3	Fall back to best-effort answer with available context
Max tool calls	15	Stop searching, generate from what you have
Max tokens (input)	50,000	Budget exhausted, generate immediately
Max wall-clock time	15 seconds	Timeout, return partial answer with apology
Cost cap per query	$0.10	Circuit breaker, fall back to single-shot RAG

13. Generation Layer

13.1 Model Routing

Tier	Model	Query Types	TTFT	Relative Cost	% of Traffic
Fast	Claude Haiku / GPT-4o-mini	Simple factual, definitions, single-doc answers	150-300ms	Very low	~70-80%
Standard	Claude Sonnet / GPT-4o	How-to, moderate reasoning, multi-chunk synthesis	400-800ms	Moderate	~15-25%
Complex	Claude Opus / GPT-4	Multi-hop reasoning, comparative analysis, complex synthesis	800-2000ms	High	~5-10%

Fallback chain: If the primary provider (say, Anthropic) is down:

Try the same tier on the secondary provider (OpenAI).
If both API providers are down, fall back to self-hosted vLLM with Llama 4 70B (or DeepSeek V3.2).
If self-hosted is also unavailable (hardware failure), return a degraded response: "I found these potentially relevant documents: [links]. Our answer generation service is temporarily unavailable."

The degraded response is still useful. It turns the platform from a Q&A system into a search engine, which is better than a 500 error.

13.2 Prompt Engineering at Scale

The system prompt is the most critical piece of text in the entire platform. It determines citation behavior, hallucination boundaries, and response format.

You are an internal knowledge assistant for engineers. Answer questions
using ONLY the retrieved context provided below. Follow these rules:

1. GROUNDING: Every factual claim must be supported by the retrieved context.
   If the context does not contain enough information, say so explicitly.

2. CITATIONS: Reference sources using [Source N] notation inline.
   At the end, list all sources with their titles and URLs.

3. UNCERTAINTY: If you are not confident in an answer, say
   "Based on the available documentation, [answer], but I'd recommend
   verifying with [suggested source or team]."

4. SCOPE: Do not answer questions about topics not covered in the
   retrieved context. Do not use your training data as a source.

5. FORMAT: Use markdown. Code blocks for code. Keep answers concise
   but complete. Prefer bullet points for multi-step answers.

Retrieved Context:
{retrieved_chunks_with_source_labels}

Conversation History:
{recent_turns}

User Question: {query}

Dynamic prompt assembly: The prompt template varies based on query type:

Factual queries get a shorter system prompt emphasizing conciseness.
How-to queries get a prompt emphasizing step-by-step formatting.
Analytical queries get a prompt emphasizing nuance and trade-off discussion.

13.3 Streaming Architecture

Users should not stare at a blank screen for 2 seconds. Streaming shows tokens as they are generated, reducing perceived latency.

Implementation: Server-Sent Events (SSE) from the backend to the frontend.

The LLM provider streams tokens via its API.
The backend forwards each token to the client over an SSE connection.
The client renders markdown incrementally.

Latency breakdown:

Time to first token (TTFT):
  Query understanding:    50-200ms
  Cache check:            5-10ms
  Retrieval + re-ranking: 100-200ms
  Prompt assembly:        5-10ms
  LLM TTFT:              200-800ms (depends on model tier)
  Total TTFT:            400-1,200ms

Time to last token:
  TTFT + generation time: 1.5-5s (depends on answer length)

With streaming, the user sees the first words within 400-1,200ms even though the full answer takes 2-5 seconds. This makes a huge difference in how responsive the tool feels.

13.4 Context Window Management

Context windows are a shared resource. Budget them carefully.

Conversation memory: For multi-turn conversations, include the last 2-3 turns in the prompt. If the conversation is longer, summarize older turns:

Turn 1: User asked about auth flow → Assistant explained OAuth2 integration
Turn 2: User asked about error handling → Assistant listed retry strategies
[Current turn with full context]

Summarization cost is tiny compared to generation. Even at 3 turns per conversation, it barely registers.

13.5 Self-Hosted LLM Serving

Open-source LLMs for generation:

Model	Params	Architecture	GPU Requirement	Best For
Llama 4 70B	70B	Dense	4x A100-80GB or 2x H100	Best ecosystem support, most versatile, strong default
DeepSeek V3.2	685B MoE (37B active)	MoE	8x A100-80GB	Reasoning-heavy queries, beats GPT-5 on benchmarks
Qwen3-72B	72B	Dense	4x A100-80GB	119 languages, strong reasoning
Mistral Large 3	675B MoE	MoE	8x A100-80GB	92% of GPT-5.2 quality at ~15% the cost
Llama 4 8B	8B	Dense	1x A100-40GB or RTX 4090	Fast tier in model routing, sub-200ms TTFT
DeepSeek R1 Distill 7B	7B	Dense	1x RTX 4090 (24GB)	Strong reasoning for its size, good for query decomposition

API vs self-hosted: when does self-hosting pay off?

Component	Use API When	Self-Host When
LLM generation	< 1M queries/month; want zero ops burden	> 5M queries/month; data sovereignty required; need custom fine-tuning
Embeddings	< 50M chunks total corpus; infrequent re-indexing	> 50M chunks; continuous re-indexing; data cannot leave your network
Re-ranking	Almost always (cheap at $2/1M docs)	Data cannot leave your network; need custom re-ranker fine-tuned on your domain

14. Hallucination Mitigation

14.1 Grounding via Retrieval

The first line of defense: instruct the model to answer only from retrieved context.

Confidence scoring: After retrieval and re-ranking, compute an aggregate confidence score:

python
def compute_confidence(reranked_chunks: list[ScoredChunk]) -> float:
    if not reranked_chunks:
        return 0.0

    top_score = reranked_chunks[0].score
    score_gap = top_score - reranked_chunks[1].score if len(reranked_chunks) > 1 else 0
    avg_top3 = mean([c.score for c in reranked_chunks[:3]])

    # These weights are starting points. Tune them on your corpus.
    # High confidence: top result is clearly relevant and well-separated
    # Low confidence: top results are all mediocre or tightly clustered
    confidence = (top_score * 0.5) + (score_gap * 0.2) + (avg_top3 * 0.3)
    return min(confidence, 1.0)

Three-tier response modes based on confidence:

Confidence	Mode	Response Behavior
> 0.6	Full answer	Generate grounded answer with citations. Standard path.
0.3 - 0.6	Hedged answer	Generate answer but prepend: "Based on what I found, [answer]. I'd recommend verifying with [team/owner] directly." Include source links prominently.
< 0.3	Abstention	Skip generation entirely. Return: "I couldn't find reliable information about this. Here are the closest matches: [links]. Try asking in #[relevant-slack-channel]."

14.2 Citation and Attribution

Every factual claim in the response must cite its source. This is enforced at three levels:

Level 1: Prompt instruction. The system prompt requires inline [Source N] citations.

Level 2: Structured output. For the highest-quality responses, use the LLM's structured output mode to enforce a JSON schema:

json
{
  "answer": "The payments service uses OAuth2 for authentication [Source 1]. After the Q3 migration, it switched to mutual TLS for service-to-service calls [Source 2].",
  "citations": [
    {"id": 1, "chunk_id": "abc123", "title": "Payments Auth Guide", "url": "https://..."},
    {"id": 2, "chunk_id": "def456", "title": "Q3 Auth Migration RFC", "url": "https://..."}
  ],
  "confidence": 0.85,
  "needs_verification": false
}

Citation verification adds a small cost per query. Applied to all queries routed to the Standard and Complex tiers (20% of traffic).

14.3 Guardrails and Validation

NLI (Natural Language Inference) checking: Run the generated answer through an NLI model that classifies each claim as "entailed", "neutral", or "contradicted" by the retrieved context.

Entailed: The context supports the claim. Good.
Neutral: The context does not address the claim. Flag as potentially hallucinated.
Contradicted: The context says the opposite. Block the response, re-generate with a stronger grounding instruction.

14.4 Structured Output Enforcement

Force the LLM to output in a structured format (JSON mode or tool calling) to ensure consistent citation formatting. This prevents the model from "forgetting" to cite sources in some responses.

python
response_schema = {
    "type": "object",
    "properties": {
        "answer_markdown": {"type": "string"},
        "sources_used": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "chunk_id": {"type": "string"},
                    "relevance": {"type": "string", "enum": ["high", "medium", "low"]},
                    "quote": {"type": "string"}  # Exact quote from source
                }
            }
        },
        "confidence_level": {"type": "string", "enum": ["high", "medium", "low"]},
        "follow_up_suggestions": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["answer_markdown", "sources_used", "confidence_level"]
}

The quote field is particularly valuable: it forces the model to ground each citation in a specific passage from the source, making hallucination harder.

15. Evaluation and Feedback System

If you are not measuring answer quality, you are guessing. "It looks good in demos" is not a metric. Evaluation is a production requirement, not a nice-to-have.

15.1 Offline Evaluation

Golden dataset: A curated set of 500-1,000 question-answer pairs with expected source documents. The dataset covers:

Simple factual questions (40%)
How-to questions (25%)
Analytical questions (20%)
Multi-hop questions (15%)

Each entry includes:

json
{
  "question": "How does the payments service handle idempotency?",
  "expected_answer_contains": ["idempotency key", "header", "X-Idempotency-Key"],
  "expected_source_docs": ["payments-api-guide", "payments-rfc-042"],
  "category": "factual",
  "difficulty": "medium"
}

Metrics (RAGAS framework):

Metric	What It Measures	Target
Faithfulness	Does the answer only contain information from the context?	> 0.90
Answer relevance	Does the answer address the question?	> 0.85
Context relevance	Are the retrieved chunks relevant to the question?	> 0.80
Context recall	Do the retrieved chunks contain the information needed?	> 0.80

Retrieval-specific metrics:

Metric	Definition	Target
Recall@5	% of relevant docs in top 5 results	> 75%
Recall@10	% of relevant docs in top 10 results	> 85%
MRR (Mean Reciprocal Rank)	Average 1/rank of first relevant result	> 0.70
NDCG@10	Normalized discounted cumulative gain	> 0.75

15.2 Online Evaluation

Production queries are messier, more varied, and more ambiguous than any golden dataset. Online evaluation catches issues that offline benchmarks miss.

User feedback:

Explicit: Thumbs up/down on every response. Optional text correction ("This is wrong because..."). Copy-to-clipboard events as positive signal.
Implicit: Time to next query (< 10s suggests the answer was insufficient). Session abandonment (user leaves without interacting). Follow-up questions that rephrase the original (suggests the first answer missed the mark).

LLM-as-judge: Sample 5-10% of production queries and run them through an evaluation prompt:

You are evaluating the quality of a RAG system's response.

Question: {question}
Retrieved Context: {chunks}
System Response: {answer}

Rate on these dimensions (1-5):
1. Correctness: Is the answer factually correct given the context?
2. Completeness: Does the answer fully address the question?
3. Citation quality: Are sources properly cited and relevant?
4. Clarity: Is the answer clear and well-structured?

Also flag:
- Any hallucinated claims (not supported by context)
- Missing information that was in the context but not in the answer

LLM-as-judge cost scales with your sampling rate and chosen model. At 5% sampling on 10M queries/month, that is 500K evaluations. Recalculate with current model pricing.

15.3 Continuous Improvement Loop

Evaluation data feeds back into the system:

Low-rated answers are reviewed weekly by the platform team. Common failure patterns (bad chunking on a specific document type, missing source, etc.) become tasks.
User corrections are added to the golden dataset, expanding coverage over time.
Retrieval failures (queries where none of the top-10 chunks were relevant) trigger chunking quality audits on the source documents.
Model upgrade evaluation: Before upgrading the LLM (e.g., Sonnet 3.5 to Sonnet 4), run the full golden dataset benchmark and compare. Only upgrade if quality improves or holds steady.
Feedback into retrieval ranking. This is the loop most teams skip. When a user downvotes an answer, log which chunks were retrieved. Over time, chunks that consistently appear in downvoted answers get a negative signal. This can feed into: (a) chunk quality scoring (deprioritize low-quality chunks during re-ranking), (b) dynamic alpha adjustment (if code queries consistently get bad feedback, shift the BM25 weight higher for that query type), or (c) re-chunking triggers (if a specific document's chunks keep failing, flag it for re-chunking with a different strategy).

16. Observability

You cannot fix what you cannot see. When a user reports a bad answer, you need to know exactly where the pipeline went wrong: was it retrieval, re-ranking, context assembly, or generation?

16.1 Pipeline Tracing (OpenTelemetry)

Each query generates an OpenTelemetry trace that spans the entire pipeline:

Trace: query-12345
├── Span: query_understanding (50ms)
│   ├── Attribute: query_type = "how-to"
│   ├── Attribute: rewritten = true
│   └── Attribute: cache_hit = false
├── Span: retrieval (130ms)
│   ├── Span: vector_search (40ms)
│   │   └── Attribute: results_count = 50
│   ├── Span: keyword_search (15ms)
│   │   └── Attribute: results_count = 30
│   ├── Span: rrf_merge (2ms)
│   │   └── Attribute: merged_count = 65
│   ├── Span: acl_filter (5ms)
│   │   └── Attribute: filtered_out = 8
│   └── Span: rerank (68ms)
│       └── Attribute: top5_avg_score = 0.82
├── Span: generation (920ms)
│   ├── Attribute: model = "claude-sonnet-4-6"
│   ├── Attribute: input_tokens = 4200
│   ├── Attribute: output_tokens = 380
│   ├── Attribute: cost = $0.012
│   └── Attribute: ttft = 450ms
└── Span: guardrails (25ms)
    ├── Attribute: nli_check = "passed"
    ├── Attribute: citations_verified = true
    └── Attribute: confidence = 0.85

16.2 Key Metrics Dashboard

Metric	Granularity	Alert Threshold
End-to-end P50/P95/P99 latency	By model tier, by query type	P95 > 3s
TTFT (time to first token)	By model tier	P95 > 2s
Retrieval latency	By search type (vector, keyword, hybrid)	P99 > 300ms
Re-ranking latency	By re-ranker model	P99 > 200ms
Cache hit rate	Overall and by query pattern	Drop below 10%
Token usage (input + output)	By model, by team, by query type	Daily total > 2x average
Cost per query	By model tier	Exceeds baseline by 2x
Retrieval relevance (top-5 avg score)	Overall	Average drops below 0.6
User feedback ratio (thumbs up %)	Overall, rolling 7-day	Drops below 70%
Hallucination rate (LLM-as-judge)	Rolling 7-day	Rises above 8%
Error rate (5xx)	By pipeline stage	> 0.1%
Embedding pipeline lag	Time since last processed document	> 30 minutes

16.3 Cost Monitoring

LLM costs can spike unexpectedly. A single runaway agent loop, a prompt regression that increases output length, or a cache invalidation event can double costs overnight.

Per-query cost tracking: Every query logs its total cost (embedding + retrieval + re-ranking + generation). This is computed from token counts and model pricing, not estimated.

16.4 Embedding Drift Detection

17. Production Architecture and Bottlenecks

17.1 Scaling Strategy

Component	Scaling Dimension	Trigger	Mechanism
Qdrant	Shard count	Vector count > 5M per shard	Add shard, rebalance
Elasticsearch	Node count	Index size > 50 GB per node	Add node, rebalance
Embedding workers	Worker count	Kafka lag > 10,000	HPA on Kubernetes
LLM inference pool	Concurrent requests	Queue depth > 50	Scale vLLM replicas
API servers	Pod count	QPS > 100 per pod	HPA on CPU/QPS

Load shedding: Under extreme load (> 1,500 QPS sustained), the platform progressively degrades:

Disable agentic RAG (route all queries to single-shot).
Reduce re-ranking from top-20 to top-5.
Force all queries to the cheapest model tier.
Disable LLM-as-judge sampling.
As a last resort, serve cached-only responses and return "system under heavy load" for cache misses.

Each step is triggered by a progressively higher load threshold. The user always gets a response. The response quality degrades gracefully.

17.2 Failure Handling

Failure	Detection	Mitigation	Recovery
LLM provider outage	API errors > 5% in 1 min	Failover to secondary provider	Auto-retry primary every 30s
Qdrant node failure	Health check timeout	Read from replica shard	Node auto-restarts, shard rebalances
Elasticsearch down	Health check timeout	Fall back to vector-only search	Cluster self-heals
Embedding API outage	API errors	Queue documents in Kafka, process later	Backfill when API recovers
Redis cache failure	Connection timeout	Skip cache, full pipeline for every query	Reconnect, warm cache gradually
PostgreSQL failure	Connection pool errors	Read from replica for permissions	Primary failover (managed DB)

Key principle: The system should always return something useful. Never show a blank error page.

Degradation ladder: When components fail, the system steps down through progressively simpler modes. Each level still returns a useful response:

Level 0: Full pipeline (normal operation)
   ↓ re-ranker timeout or error rate > 10%
Level 1: Skip re-ranking (use RRF scores directly, ~5-10% quality drop)
   ↓ Qdrant unavailable
Level 2: BM25-only retrieval (keyword search still works, semantic matching lost)
   ↓ Elasticsearch also down
Level 3: Serve cached responses only (Redis still up, covers 5-25% of queries)
   ↓ Redis also down
Level 4: Static fallback (return links to top-50 most-accessed docs with search bar)

17.3 Multi-Tenant Isolation

17.4 Rate Limiting and Backpressure

Query rate limiting: Token bucket per tenant. Default: 10 QPS per team, burstable to 50 QPS for 30 seconds. Adjustable per team based on size and usage patterns.

Priority queues: Three priority levels:

Priority	Traffic Type	Under Load Behavior
P0 (critical)	Interactive user queries	Always served; may route to smaller model
P1 (normal)	Slack bot queries, IDE integration	Queued; dropped after 10s timeout
P2 (background)	Batch eval, re-indexing, cache warming	Paused entirely under load; resumed when QPS drops below threshold

17.5 Bottlenecks and Mitigation

Bottleneck	Symptom	Relief
LLM inference at peak	P95 latency > 3s, queue depth grows	Model routing (80% to small model), semantic caching, pre-computed answers for top-100 queries
Vector search with complex ACL filters	Retrieval P99 > 200ms	Payload indexes on permission fields, consider per-tenant collections for large tenants
Embedding pipeline backlog	Document freshness > 15 min SLA	Scale embedding workers horizontally, batch size optimization, parallel processing
Context window limits	Answers miss relevant info that was retrieved	Chunk summarization before context assembly, priority-based chunk selection, larger context models for complex queries
Re-ranking latency on large result sets	Re-rank step > 150ms	Reduce initial retrieval from top-100 to top-30, self-hosted GPU re-ranker with batching
Cache cold starts (Monday morning, after deployments)	Cache hit rate drops to 0%, latency spikes	Pre-warm cache with top-1000 queries from previous week, gradual rollout after cache invalidation
Cross-encoder model loading	First query after cold start takes 5-10s	Keep re-ranker model warm with periodic health check queries, pre-load on pod startup

18. Cost Analysis

Model pricing changes quarterly. Specific dollar amounts go stale within months. This section covers the cost patterns that hold true even as prices change.

Cost Structure

Approximate cost breakdown by category:

Category	% of Total Cost	What Drives It
LLM inference	60-85%	Query volume, model tier mix, average prompt size
Infrastructure (vector DB, ES, Kafka, Redis, PG)	5-15%	Corpus size, query throughput, replication factor
Evaluation (LLM-as-judge)	3-10%	Sampling rate, evaluator model choice
Embedding generation	< 1%	Only spikes during full re-indexing
Re-ranking	< 1%	Fixed per-query cost, scales linearly

Model Routing: The Biggest Cost Lever

Not every query needs a frontier model. A tiered routing strategy cuts LLM costs by 50-70% depending on query distribution and classification accuracy.

Tier	Query Types	Relative Cost	% of Traffic
Fast	Simple factual, definitions, single-doc answers	1x (baseline)	~70-80%
Standard	How-to, moderate reasoning, multi-chunk synthesis	5-10x	~15-25%
Complex	Multi-hop reasoning, comparative analysis	15-30x	~5-10%

Caching: The Second Cost Lever

Self-Hosted vs API: Decision Logic

Component	Use API When	Self-Host When
LLM generation	Low-to-moderate query volume; want zero ops burden	High query volume; data sovereignty required; need custom fine-tuning
Embeddings	Moderate corpus size; infrequent re-indexing	Large corpus; continuous re-indexing; data cannot leave your network
Re-ranking	Almost always — low cost, high value	Data cannot leave your network; need domain-specific fine-tuning

Check current pricing from your providers and run the math for your specific scale before committing.

19. Security and Governance

For document-level access control implementation (pre-filtering vs post-filtering, permission sync), see Section 11.4.

Prompt Injection Defense

Defense layers:

Input sanitization: Strip known injection patterns from queries. Regex-based, not foolproof, but catches obvious cases.
Instruction hierarchy: The system prompt uses a clear hierarchy: system instructions > retrieved context > user query. Modern models respect this hierarchy when explicitly told to.
Output validation: The guardrails layer (Section 14.3) catches responses that deviate from expected format or contain unexpected instructions.

PII and Secrets Handling

Internal documents frequently contain PII, credentials, and secrets. The platform must not amplify their exposure.

Audit Logging

Every query, retrieval result, generation, and feedback event is logged to an append-only audit store. Logs include:

Who asked the question (authenticated user ID)
What documents were retrieved and shown
What answer was generated
What feedback was given
What model and prompt version were used

These logs serve compliance requirements and enable forensic analysis if a data access incident occurs. Retention: 1 year minimum, per organizational policy.

For a deeper dive on LLM security patterns, prompt injection defense, and data privacy architectures, see the LLM data privacy and security guide.

20. Where This System Fails in Production

No architecture post is complete without an honest look at how it breaks. These are the failure chains we have seen or heard about from teams running RAG systems at scale:

Failure Chain	What Happens	How You Detect It
Bad chunking -> wrong retrieval -> confident wrong answer	A design doc gets split mid-paragraph. Retrieval returns a chunk that mentions "auth" but lacks the actual flow. The LLM generates a plausible-sounding but incorrect answer, confidently cited.	Low faithfulness scores in LLM-as-judge; user downvotes on specific document types
ACL bug -> data leak	A permission sync fails silently. An engineer on team A sees answers sourced from team B's confidential docs. One incident like this kills platform trust permanently.	Cross-tenant query audits; periodic permission reconciliation checks
Embedding drift after model upgrade	Provider silently updates the embedding model. Query vectors shift slightly. Retrieval recall degrades by 5-10% with no error, no alert. Answers just get subtly worse.	Weekly canonical query drift detection (Section 16.4)
Cache serving stale answers	A runbook gets updated but the cache still serves the old answer for 24 hours. An engineer follows outdated instructions during an incident.	Cache-to-source freshness monitoring; stale-but-serveable flagging
Query rewriting makes things worse	The rewriter expands "k8s OOM" to "Kubernetes out of memory error handling best practices." The expanded query retrieves generic content instead of the specific internal runbook.	A/B test rewriter on vs off; monitor retrieval scores with and without rewriting
Agent loop burns tokens without converging	An agentic query keeps searching, finding slightly different but never sufficient context. Three iterations later, answer is no better than single-shot.	Per-query cost tracking; agent iteration count dashboards; circuit breaker alerts

21. Production Readiness Review Checklist

A note on weighting. The point values below are not industry-standard weights. No such standard exists. The importance of each area depends on your system. For RAG/LLM platforms specifically:

Area	Weight for RAG Systems	Why
Retrieval quality	Very high	Bad retrieval = bad answers. No LLM fixes this.
Hallucination control	Very high	Wrong answers are worse than no answers.
Cost and observability	High	LLM inference costs dominate and can spike without warning.
Security and governance	High	Internal docs contain sensitive data; access control is non-negotiable.
Resilience	Medium-high	Degraded answers are acceptable; total outages are not.
Operational readiness	Medium	Important but less unique to RAG than to any production system.

Treat this checklist as a guiding framework. The scores tell you where your gaps are. The specific thresholds are directional, not pass/fail gates.

How to Score

Each check: 0 (missing), 1 (partial), 2 (complete).

Score	Readiness
120+	Production-ready. Ship with confidence.
90-119	Ship with known gaps documented. Address the gaps within 30 days.
60-89	Not ready. Critical gaps likely in retrieval quality, safety, or observability.
< 60	Prototype stage. Needs significant work before production traffic.

21.1 Ingestion and Chunking (20 points max)

#	Check	Pass Criteria	Section
1	Multiple chunking strategies per document type	Not one-size-fits-all; at least 3 strategies active	8.6
2	Chunk quality validation	Manual review of 100+ random chunks; retrieval recall measured	8
3	Incremental ingestion with change detection	Not full re-index on every update; webhook + polling reconciliation	7.1
4	Document deduplication	Content-hash or SimHash dedup pipeline active	7.3
5	Metadata extraction	Author, date, permissions, source extracted and stored	7.2
6	Embedding model versioned	Version tracked; blue-green re-indexing strategy documented	9.3
7	Ingestion latency SLA defined and monitored	Freshness SLA (e.g., 15 min) measured in dashboards	16.2
8	Failure handling	Poison documents don't block the pipeline; dead letter queue active	7
9	Source connector health monitoring	Each connector reports status; alerts on sync failures or quota exhaustion	7.1
10	Backfill and replay capability	Can re-process any source from a specific date; Kafka replay or equivalent	7.1

21.2 Retrieval Quality (18 points max)

#	Check	Pass Criteria	Section
11	Hybrid retrieval	BM25 + vector search with score fusion (RRF or weighted)	11.2
12	Re-ranking with cross-encoder	Top-k results re-ranked before context assembly	11.3
13	Retrieval recall measured	Recall@10 > 80% on golden dataset	15.1
14	Access control filtering	Pre-filtering on permissions; no leaked documents in results	11.4
15	Query rewriting/expansion	Ambiguous queries rewritten before retrieval	11.1
16	"Lost in the middle" mitigation	Chunk ordering optimized in context window	11.5
17	Retrieval latency P99 < 350ms	Measured in production dashboards	16.2
18	Recency weighting for time-sensitive queries	Recent docs boosted for queries with temporal signals	11.1
19	Multi-source routing	Agent selects relevant sources instead of searching everything	12.4

21.3 Generation and Hallucination Control (20 points max)

#	Check	Pass Criteria	Section
20	Model routing active	Not every query hits the most expensive model; at least 2 tiers	13.1
21	Grounding instruction in system prompt	"Answer only from retrieved context" explicitly stated	14.1
22	Inline citations with source links	Every response includes [Source N] with clickable URLs	14.2
23	"I don't know" path	Low confidence triggers explicit uncertainty message	14.1
24	Post-generation factual consistency check	NLI or citation verification on Standard/Complex tier	14.3
25	Streaming responses	SSE streaming with sub-2s TTFT	13.3
26	Context window budget documented	Token allocation for system + context + history + output	13.4
27	Structured output enforcement	JSON mode or schema validation on citations	14.4
28	Prompt versioning with rollback	Prompts stored in version control; can revert within minutes	13.2
29	Conversation memory management	Multi-turn context handled via sliding window or summarization	13.4

21.4 Agentic RAG and Tool Use (12 points max)

#	Check	Pass Criteria	Section
30	Complex queries routed to agentic path	Query complexity classifier active; not all queries single-shot	12.1
31	Query decomposition for multi-hop questions	LLM decomposes complex queries into sub-queries	12.2
32	Max iteration and token budget caps	Hard limits on iterations (3-5), tool calls (10-15), tokens (50K)	12.6
33	Circuit breaker on agent failure	Fallback to single-shot RAG on loop detection or timeout	12.6
34	Tool interface standardized	MCP or equivalent; tools discoverable and schema-validated	12.5
35	Agent audit trail	Every tool call and reasoning step logged with trace ID	12.5

21.5 Evaluation and Feedback (18 points max)

#	Check	Pass Criteria	Section
36	Golden dataset maintained	500+ curated Q&A pairs; updated quarterly	15.1
37	Offline regression tests	Run on every pipeline change; blocks deploy on regression	15.1
38	User feedback collected	Thumbs up/down at minimum; corrections optional	15.2
39	LLM-as-judge on production traffic	5-10% of queries evaluated automatically	15.2
40	A/B testing framework	Prompt/model/retrieval changes tested on subset of traffic	15.2
41	Weekly failure analysis	Low-rated answers reviewed; action items tracked	15.3
42	Retrieval and generation metrics tracked over time	Not just point-in-time; trend dashboards active	16.2
43	Eval dataset covers all query types proportionally	Factual, how-to, analytical, multi-hop all represented	15.1
44	Domain-specific eval metrics defined	Metrics tailored to your use case beyond generic RAGAS	15.1

21.6 Observability and Cost (20 points max)

#	Check	Pass Criteria	Section
45	OpenTelemetry traces across full pipeline	Trace spans for every stage: query, retrieval, re-rank, generation	16.1
46	Latency breakdown by stage	Per-stage P50/P95/P99 in dashboards	16.2
47	Token usage and cost per query	Tracked by model, by team, by query type	16.3
48	Cost anomaly alerting	Circuit breaker on cost spike	16.3
49	Cache hit rate monitored	Target 5-25%; alert on sustained drop below 3%	10.5
50	Embedding drift detection	Weekly canonical query check; alert on shift > 0.05 cosine distance	16.4
51	Per-tenant rate limiting	Token bucket enforced at API gateway	17.4
52	Daily cost reports by team/tenant	Automated reporting; budget enforcement active	16.3
53	SLA dashboard for stakeholders	Uptime, latency, quality metrics visible to consumers and leadership	16.2
54	Alerting runbooks per alert type	Every alert has a documented response procedure	16.2

21.7 Resilience and Production Hardening (12 points max)

#	Check	Pass Criteria	Section
55	Multi-provider LLM failover	Primary + secondary + self-hosted fallback chain	13.1
56	Retrieval fallback chain	Hybrid -> keyword-only -> cached responses	17.2
57	Embedding pipeline failure isolation	Stale index served; freshness SLA breach alerted	17.2
58	Load shedding and graceful degradation	Progressive degradation under extreme load	17.1
59	Multi-tenant data isolation validated	Cross-tenant query returns zero foreign results	17.3
60	Prompt injection defense active	Input sanitization + instruction hierarchy + output validation	19

21.8 Security and Governance (20 points max)

#	Check	Pass Criteria	Section
61	PII detection on ingested documents	Scanner flags PII before indexing	19
62	Secrets and credential redaction	API keys, tokens, passwords detected and redacted in responses	19
63	Audit logging for all queries	Who asked, what was retrieved, what was generated, all logged	19
64	RBAC on admin operations	Index management, prompt editing, eval dataset changes require authorized roles	19
65	Data retention policy enforced	Query logs, feedback, and cached responses expire per policy	19
66	Source permission sync validated	Permissions propagate to vector store within SLA	11.4
67	LLM provider data processing agreements	Query/response data not used for training; DPAs signed	19
68	Input sanitization beyond regex	Layered injection defense	19
69	Sensitive document flagging	Documents marked `contains_sensitive` at ingestion	19
70	Compliance review completed	Legal/security team has reviewed data flows and access patterns	19

21.9 Data Quality and Freshness (16 points max)

#	Check	Pass Criteria	Section
71	Stale document detection	Documents not updated in 6+ months flagged	7.1
72	Source connector health dashboard	Each source shows last sync time, error rate, document count	7.1
73	Broken link and dead reference cleanup	Periodic scan detects deleted/moved content	7.3
74	Document quality scoring	Low-quality chunks detected and quarantined	8
75	Per-source freshness SLA	Each source has a defined freshness target; monitored	7.1
76	Corpus coverage tracking	Dashboard shows document count by source, team, and age	7
77	Re-indexing on schema/format changes	Ingestion pipeline adapts without data loss	7.2
78	Chunk-to-source lineage queryable	Given any chunk, can trace back to exact source doc and version	8.6

21.10 Operational Readiness (14 points max)

#	Check	Pass Criteria	Section
79	Runbooks for top 5 failure scenarios	LLM outage, vector DB degradation, embedding stall, cost spike, permission leak	17.2
80	On-call rotation defined	Named owners for platform incidents; escalation path documented	17
81	Disaster recovery tested	Full restore from backup validated within target RTO	17.2
82	Capacity planning documented	Growth projections for queries, documents, and cost	6
83	Deployment pipeline with rollback	Canary or blue-green deploy; rollback under 5 min	17.1
84	Chaos/failure injection tested	At least one failure scenario tested in staging	17.2
85	Onboarding documentation for new teams	Self-service guide for connecting new document sources	7

Total: 85 checks, 170 points maximum.

Explore the Technologies

Many of the technologies referenced in this post have dedicated deep-dive pages on this site:

RAG for retrieval-augmented generation fundamentals, chunking strategies, and evaluation frameworks
Vector Databases for HNSW internals, distance metrics, and vendor comparisons
vLLM for PagedAttention, continuous batching, and self-hosted inference optimization
LangChain for LCEL, LangGraph agent workflows, and LangSmith observability
MCP Server for Model Context Protocol architecture, transports, and security model

Quick-reference cheat sheets and interactive calculators that complement this post.

Cheat Sheets (view all):

LLM Model Tiers & Quantization :FP16, INT8, INT4, TTFT, KV-cache, speculative decoding, continuous batching
GPU & Inference Hardware :A100, A10G, H100, weights math, QPS per GPU, fleet sizing formula
AI Cost Engineering :Cost per request, routing savings, caching impact, session budgets, monthly math
RAG Pipeline Patterns :Chunking, embeddings, vector DB, HNSW, cosine similarity, hybrid retrieval, re-ranking
AI Agent Architecture :Agent loop, tool calling, MCP, sandbox, 3-strikes rule, checkpointing, multi-agent
AI Back-of-Envelope Formulas :QPS, GPU count, model memory, vector storage, cost per request, latency budget
LLM Prompt Engineering for Code :FIM format, context budgets, multi-candidate ranking, guardrails, prompt injection defense
AI System Failure Modes :Hallucinated imports, infinite loops, context overflow, cascade failures, embedding drift

Interactive Tools (Beta) (view all):

LLM Cost Calculator (Beta) :Calculate inference cost by task type, model tier, and volume. API vs self-hosted comparison.
GPU Fleet Sizing Calculator (Beta) :Size a self-hosted GPU fleet from model size, quantization, and QPS targets.
Vector DB Sizing Calculator (Beta) :Calculate vector storage, HNSW overhead, and embedding costs for RAG systems.

CrackingWalnuts

System Design: Maps Platform (Navigation, Routing, and Map Rendering)

April 27, 2026 · 79 min read

System Design: GitHub (200M Repos, Git Object Storage, Sparse Trigram Code Search, Per-Job VM CI)

April 23, 2026 · 69 min read

System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)

April 17, 2026 · 83 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.