RAG and LLM Platform at Scale: Ingestion, Retrieval, Generation, and Evaluation for 10M Queries/Day
Goal: Build an enterprise knowledge assistant that serves 10M queries per day across 500+ engineering teams. The platform ingests 2M+ internal documents (API docs, runbooks, design docs, RFCs, Confluence pages, Slack threads, code repositories), lets engineers ask natural language questions, and returns accurate, cited answers grounded in internal knowledge. Think of it as an internal Perplexity for engineering. ~1.5-2.5s P95 latency depending on query complexity, single-digit hallucination rate with strict grounding and abstention, and under $0.02 per query average cost. All metrics in this post are directional estimates. Costs change quarterly, latency depends on infrastructure, and retrieval quality varies by corpus. Treat numbers as calibrated starting points for your own sizing, not specifications.
Reading guide: This is a long, detailed deep dive. You don't need to read it linearly.
Sections 1-2: Problem framing and requirements
Sections 3-6: Architecture overview, design principles, technology selection, and capacity planning
Sections 7-9: Document ingestion, chunking strategies (the deepest section), and embedding pipeline
Sections 10-11: Storage architecture and retrieval pipeline
Section 12: Agentic RAG with MCP-based tool use
Section 13: Generation layer (model routing, prompt engineering, streaming)
Section 14: Hallucination mitigation and citation enforcement
Sections 15-16: Evaluation systems and observability
Sections 17-21: Production architecture, cost analysis, security, failure modes, and a reusable production readiness checklist
New to RAG? Start with Sections 1-2 for the problem context, then Section 3 for the architecture overview. Read Section 8 carefully for chunking strategies. Skip to Section 21 for the production readiness checklist.
Building something similar? Sections 8-12 have the implementation details you need. Section 18 covers cost reasoning.
Preparing for a system design interview? Sections 1-6 cover what interviewers expect. Section 12 (agentic RAG) and Section 17 (production architecture) are common follow-up topics.
TL;DR: A production RAG platform handling 10M queries/day across 2M+ documents for 500+ engineering teams. Multi-strategy chunking (recursive, semantic, late chunking) produces 10M chunks stored in Qdrant with HNSW indexing. Hybrid retrieval (BM25 + dense vectors) with cross-encoder re-ranking, P99 retrieval latency typically 200-350ms depending on filter complexity. Agentic RAG with MCP-based tool use handles complex multi-hop queries through iterative retrieval and query decomposition. In practice, 60-80% of queries can be handled by smaller models, cutting LLM costs by 50-70%. Baseline hallucination rate of 8-15% in typical enterprise RAG deployments, reduced to single digits with strict grounding, citation enforcement, and answer abstention. LLM-as-judge evaluation on 5-10% of production traffic feeds a continuous improvement loop. The hardest problems: chunking quality for heterogeneous documents, access control in vector search, embedding model upgrades without downtime, and keeping hallucination rates low as the corpus grows. All metrics in this post are based on aggregated industry benchmarks and production observations. Actual results vary significantly by corpus quality, query distribution, infrastructure choices, and current model pricing.
1. Problem Statement
A few clarifications before getting into the architecture.
Scale context: This design targets a large enterprise with 500+ engineering teams, 5,000+ engineers, and a corpus of 2M+ documents spread across a dozen systems. That is not hypothetical. Companies like Stripe, Uber, and Google have internal knowledge bases of this size, and the search problem only gets worse as the organization grows. Ask any engineer at a company this size how much time they spend just finding things. It is a lot. Every developer experience survey confirms it, even at companies with world-class search infrastructure.
The 10M queries/day figure assumes 5,000 engineers making an average of 10-15 queries per workday, plus automated queries from CI/CD pipelines, Slack bots, and IDE integrations. Peak traffic hits around 10am-2pm in each timezone, roughly 3-4x the average QPS.
Document landscape:
| Source | Document Type | Count | Update Frequency | Challenges |
|---|---|---|---|---|
| Confluence | Design docs, runbooks, ADRs | 800K pages | 50K updates/month | Deep nesting, stale pages, mixed formatting |
| GitHub | README files, code comments, PRs | 500K files | 200K updates/month | Code-text boundary, rapid churn |
| Slack | Thread discussions, incident channels | 400K threads | 100K new/month | Noisy, conversational, context-dependent |
| Google Docs | RFCs, specs, meeting notes | 200K docs | 80K updates/month | Access controls, revision history |
| Internal wikis / S3 | PDFs, diagrams, legacy docs | 100K files | 20K updates/month | Unstructured, poor metadata |
Why naive approaches fail:
The first instinct is to dump everything into a giant context window. Modern models support 128K-200K tokens. But 2M documents at an average of 2,000 tokens each is 4 billion tokens. That is roughly 20,000x larger than the biggest context window available. Even if you could fit it, the per-query cost would be prohibitively expensive. At 10M queries per day, the daily bill would be astronomical. Pricing changes, but the math never works for full-context approaches at this scale.
The second instinct is keyword search. Elasticsearch on internal docs, done. But keyword search fails on intent. "How do I handle auth for the payments service?" often fails to match effectively against a document titled "OAuth2 Integration Guide for Payment Gateway" because the keywords don't align well. Engineers give up after 2-3 failed keyword searches and go ask a colleague instead, which does not scale.
RAG solves this by combining semantic understanding (embeddings capture meaning, not just keywords) with grounded generation (the LLM answers from retrieved documents, not from its training data). But a production RAG system has about twenty things that can go wrong between "user types a question" and "user gets a useful, accurate, cited answer." This post covers all of them.
Assumptions:
- The platform serves internal engineers only (not customer-facing). This simplifies some safety requirements but raises the bar on accuracy because engineers will notice and lose trust quickly.
- Documents are primarily English text with code snippets. Multi-language support is out of scope.
- Access controls from source systems (Confluence spaces, GitHub repos, Google Drive sharing) must be respected. An engineer should never see answers sourced from documents they cannot access.
- The platform team operates the infrastructure. Individual teams contribute documents by connecting their tools.
Scope:
- In scope: Natural language Q&A with citations, document ingestion from 5+ sources, hybrid retrieval, agentic RAG for complex queries, model routing, evaluation pipeline, observability, multi-tenant isolation.
- Out of scope: Document authoring or editing, code generation (use AI coding assistants for that), real-time collaboration, customer-facing chatbot (different safety requirements), training custom foundation models.
2. Requirements
2.1 Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Natural language Q&A with grounded, cited answers from internal documents | P0 |
| FR-02 | Source attribution: every claim links to the specific document and section it came from | P0 |
| FR-03 | Multi-format document ingestion (Confluence, GitHub, Slack, Google Docs, S3) | P0 |
| FR-04 | Access control: answers only reference documents the querying user can access | P0 |
| FR-05 | Document freshness: updates reflected in search within 15 minutes | P0 |
| FR-06 | Hybrid retrieval: semantic search combined with keyword matching | P0 |
| FR-07 | Streaming responses with sub-2s time to first token (P95) | P0 |
| FR-08 | Conversational follow-ups: multi-turn context within a session | P1 |
| FR-09 | User feedback collection: thumbs up/down, copy events, corrections | P1 |
| FR-10 | Multi-tenant isolation: teams see only their authorized content | P1 |
| FR-11 | Admin dashboard: query volume, quality metrics, cost breakdown, retrieval performance | P1 |
| FR-12 | Agentic RAG: complex multi-hop queries handled via iterative retrieval and query decomposition | P1 |
| FR-13 | Code-aware search: understand code snippets, function signatures, and API references | P2 |
| FR-14 | Multi-modal support: extract and search content from diagrams and screenshots | P2 |
2.2 Non-Functional Requirements
| Requirement | Target |
|---|---|
| End-to-end latency (time to first token) | P50 ~1.0-1.5s, P95 ~1.5-2.5s, P99 < 4s (varies by query complexity) |
| Retrieval latency (search + re-rank) | P50 < 150ms, P99 < 350ms |
| Query throughput | 500 QPS sustained, 1,500 QPS burst |
| Document ingestion throughput | 500K doc updates/month processed within SLA |
| Document freshness | Updates searchable within 15 minutes |
| Availability | 99.9% (8.7 hours downtime/year) |
| Answer accuracy (golden dataset) | > 85% correct on curated Q&A benchmark (definition of "correct" varies; measure per query type) |
| Hallucination rate | 8-15% baseline; can be reduced to single digits with strict citation enforcement + abstention (varies by corpus quality; measured via LLM-as-judge) |
| Cost per query (average) | Minimize through model routing and caching (see Section 18) |
| Embedding re-indexing | Full corpus re-embedded within 72 hours |
| Data isolation | Zero cross-tenant document leakage |
Architecture in One Minute
A RAG system is mostly a data, retrieval, and evaluation problem. The LLM does the last 20% of the work. The platform has six layers, each with a distinct job:
-
Ingestion layer. Connectors pull documents from Confluence, GitHub, Slack, Google Docs, and S3. Parsers normalize everything to structured markdown with metadata. Change detection ensures incremental updates, not full re-ingestion.
-
Chunking and embedding layer. A multi-strategy chunking pipeline splits documents based on type: recursive chunking for structured docs, semantic chunking for long-form prose, AST-aware chunking for code. An embedding pipeline converts chunks to vectors using a versioned embedding model.
-
Storage layer. Qdrant stores vectors with HNSW indexing. Elasticsearch handles BM25 keyword search. PostgreSQL tracks document metadata, permissions, and chunk lineage. Redis provides semantic caching.
-
Retrieval layer. Query understanding classifies, rewrites, and optionally expands the query before search. Hybrid search combines BM25 and dense vector retrieval in parallel. Reciprocal Rank Fusion merges results. A cross-encoder re-ranks the top candidates. Access control filtering ensures users only see documents they have permission to view.
-
Generation layer. A model router classifies query complexity and picks the right LLM (small model for simple lookups, large model for complex reasoning). Prompts are assembled dynamically with retrieved context, conversation history, and citation instructions. Responses stream via SSE.
-
Evaluation layer. User feedback (thumbs up/down), implicit signals (abandonment, follow-ups), and LLM-as-judge scoring on sampled queries feed into a continuous improvement loop that tightens retrieval quality and generation accuracy over time.
3. How RAG Works: One Query, Start to Finish
Before diving into each component, here is what happens when an engineer types: "How does the payments service handle retries?"
Step 1: The query becomes a vector
The system converts the question into a 1024-dimensional embedding — a list of numbers that captures the meaning of the query, not just the keywords.
"How does the payments service handle retries?"
↓ embedding model
[0.12, -0.98, 0.34, 0.67, -0.21, ... ] (1024 numbers)
Step 2: Two searches run in parallel
Vector search (Qdrant) finds chunks whose meaning is similar to the query:
[
{ "score": 0.92, "content": "The payments service uses exponential backoff with jitter for all downstream retries...", "source": "confluence", "doc": "Payment Error Handling" },
{ "score": 0.87, "content": "Retry budgets are set per-service: payments allows 3 retries with a 2-second base delay...", "source": "github", "doc": "payments-service/README.md" },
{ "score": 0.84, "content": "Circuit breakers open after 5 consecutive failures, preventing retry storms...", "source": "confluence", "doc": "Resilience Patterns" }
]Keyword search (Elasticsearch) finds chunks containing the exact terms:
{
"query": {
"bool": {
"should": [
{ "match": { "title_context": { "query": "payments service handle retries", "boost": 3 } } },
{ "match": { "content": "payments service handle retries" } }
]
}
}
}Notice: the vector database returns ranked chunks, not answers. It finds relevant text fragments. The LLM has not been involved yet.
Why both? Vector search understands meaning but misses exact identifiers. Keyword search finds precise terms but misses intent. Section 11 covers hybrid retrieval in detail.
Step 3: Merge and re-rank
Reciprocal Rank Fusion (RRF) combines both result lists, boosting documents that appear in both. A cross-encoder re-ranker then scores the top candidates by reading each chunk alongside the query. The top 3 go to the LLM. Section 11 covers the mechanics.
Step 4: The LLM generates an answer
The system assembles a prompt with the retrieved chunks as context:
System: You are an internal engineering assistant. Answer ONLY from the provided context.
Cite sources using [Source: document_name].
Context:
[Chunk 1: Payment Error Handling - "The payments service uses exponential backoff..."]
[Chunk 2: payments-service README - "Retry budgets are set per-service..."]
[Chunk 3: Resilience Patterns - "Circuit breakers open after 5 consecutive..."]
Question: How does the payments service handle retries?
The LLM produces:
The payments service uses exponential backoff with jitter for all downstream retries [Source: Payment Error Handling]. Each service has a retry budget — payments allows 3 retries with a 2-second base delay [Source: payments-service/README.md]. After 5 consecutive failures, the circuit breaker opens to prevent retry storms from cascading [Source: Resilience Patterns].
Total time: ~1.1 seconds. The user sees a streaming response starting at ~400ms.
What about complex queries?
"Why is checkout slow and what changed after the Q3 migration?"
This cannot be answered with a single search. The system classifies it as complex and routes to an MCP agent:
Step 1: Classify → complex (multi-part, temporal reference)
Step 2: Decompose into sub-queries:
→ "checkout latency issues"
→ "Q3 migration changes checkout"
Step 3: Agent calls search tool twice, evaluates results
Step 4: Agent decides it needs more: "Q3 migration rollout timeline"
Step 5: Third search, combines all context
Step 6: LLM generates combined answer with citations from all sources
The mental model:
- Simple query → one search → answer
- Complex query → think (decompose) → multiple searches → combine → answer
RAG retrieves information. MCP decides how to retrieve it.
4. Design Principles
1. Retrieval quality determines answer quality. The LLM can only work with what retrieval gives it. Invest 60% of your effort in chunking, embedding, and retrieval. A mediocre LLM with excellent retrieval will outperform a frontier LLM with poor retrieval every time. And evaluation quality determines whether you know if retrieval is actually working.
2. Chunk quality beats chunk quantity. Five well-formed, relevant chunks in the context window produce better answers than twenty loosely related fragments. Over-retrieval wastes tokens, increases cost, and actually degrades answer quality because of the "lost in the middle" effect where LLMs underweight information in the middle of long contexts.
3. Cost-aware model routing is not optional. Not every query needs a frontier model. "What is the endpoint for the user service?" does not need Claude Opus. A smaller, faster model handles it in 200ms at 1/20th the cost. Route based on query complexity, not uniformly.
4. Fail safe, not fail silent. When retrieval confidence is low, the system should say "I don't have enough information to answer this confidently" with the best partial answer and source links. Silent hallucination destroys trust permanently. An honest "I'm not sure" preserves it.
5. Evaluate continuously, not once. Offline benchmarks catch regressions before they ship. Online evaluation (user feedback + LLM-as-judge) catches issues that benchmarks miss. Both are required. Neither alone is sufficient.
6. Observe every pipeline stage. If you cannot measure retrieval latency, re-ranking quality, and generation cost independently, you cannot debug production issues. OpenTelemetry traces span the entire RAG pipeline.
7. Design for document freshness. Stale answers erode trust faster than wrong answers. An engineer who gets an answer based on a doc that was updated two weeks ago will stop using the platform. Incremental ingestion with change detection is not a nice-to-have.
8. Avoid the common anti-patterns. Embedding full documents as single vectors kills granularity. Retrieval recall drops hard. Using one chunking strategy for everything is just as bad. Fixed-size chunks destroy Slack thread context and split code mid-function. And never trust the LLM to catch its own hallucinations. Without grounding and verification, it will confidently cite APIs that do not exist. Finally, skip evaluation at your own risk. "It looks good in demos" tells you nothing. Without a golden dataset and feedback loops, you have no idea if quality is improving or rotting.
5. Technology Selection
5.1 Technology Selection
| Component | Technology | Why This Choice |
|---|---|---|
| Document parsing | Unstructured.io + custom parsers | Handles 20+ file formats. Custom parsers for Slack JSON and code files where Unstructured falls short. |
| Chunking engine | Custom multi-strategy pipeline | No single library handles all document types well. LangChain's text splitters for basics, custom logic for semantic and code-aware chunking. |
| Embedding model | OpenAI text-embedding-3-large (1024d) | Best cost/performance ratio on MTEB benchmarks. 3072d available if needed. Matryoshka support for dimensionality reduction. |
| Vector database | Qdrant (clustered) | Rust-based, fast HNSW with quantization. Payload filtering for metadata queries. Horizontal sharding. Open source with managed option. |
| Keyword index | Elasticsearch | Battle-tested BM25. Field boosting, analyzers, and aggregations. Already deployed at most enterprises. |
| Re-ranker | Cohere Rerank 3.5 + BGE-reranker-v2-m3 (fallback) | Cross-encoder accuracy on top-k results. Cohere for quality, open-source BGE as self-hosted fallback. |
| LLM inference (API) | Claude Sonnet / GPT-4o | Best quality for complex reasoning. Streaming support. Structured output mode for citations. |
| LLM inference (small) | Claude Haiku / GPT-4o-mini | 10-20x cheaper than frontier models. Sufficient for simple factual lookups. Sub-300ms TTFT. |
| LLM inference (self-hosted) | vLLM + Llama 4 70B | Fallback for provider outages. PagedAttention for efficient KV cache. Runs on 4x A100s. See Section 13.5 for open-source model details. |
| Agent orchestration | LangGraph | Stateful agent workflows with cycles. Built-in checkpointing. Cleaner than raw LangChain for multi-step reasoning. |
| Tool interface | MCP (Model Context Protocol) | Standardized tool interface across model providers. Write a search tool once, use it with Claude, GPT, or self-hosted models. OAuth 2.1 for secure access. |
| Semantic cache | Redis + embedding similarity | Embed incoming query, check cosine similarity against cached queries. Threshold > 0.95 returns cached response. 5-25% hit rate in enterprise workloads, highly dependent on query repetition patterns and team size. |
| Metadata store | PostgreSQL | Document metadata, ACLs, chunk lineage, user feedback. ACID transactions for permission updates. |
| Message queue | Apache Kafka | Decouples ingestion from embedding. Replay capability for re-processing. Partitioned by source system. |
| Observability | OpenTelemetry + Grafana | OTel traces across full RAG pipeline. Grafana dashboards for latency, cost, and quality metrics. |
| Evaluation | RAGAS + custom LLM-as-judge | RAGAS for offline metrics (faithfulness, relevance, context recall). Custom judge for production sampling. |
Important: these model choices are independent. The chunking model, embedding model, and generation LLM do not need to come from the same provider or even the same architecture. You can chunk with all-MiniLM-L6-v2 (open source, 22M params), embed with OpenAI text-embedding-3-large (API), and generate with Claude Sonnet (different API). The only coupling: the embedding model at ingestion must match the embedding model at query time. Section 9.3 covers when and how to change your embedding model.
5.2 Why RAG (and When Not)
The three main approaches to giving LLMs access to private knowledge:
| Approach | Best For | Latency | Relative Cost | Knowledge Freshness | Accuracy on Enterprise Data |
|---|---|---|---|---|---|
| RAG | Large, frequently changing knowledge base | 1-4s (retrieval + generation) | Low | Minutes (incremental indexing) | High (grounded in source docs) |
| Fine-tuning | Consistent tone/style, domain terminology, structured output formats | 0.5-2s (no retrieval overhead) | Very low | Weeks-months (retrain cycle) | Medium (baked into weights, can hallucinate) |
| Long context | Small corpus (< 500 pages), real-time analysis | 2-10s (large input processing) | High (scales with input size) | Real-time (docs in context) | High for small corpus, degrades with size |
RAG wins for this use case because: (1) the corpus is too large for context windows, (2) documents change daily so fine-tuning staleness is unacceptable, (3) citation and attribution require knowing exactly which document supports each claim. Fine-tuning complements RAG for style and format consistency but does not replace it. Long context works as a last-mile technique within RAG: after retrieval narrows to 5-10 relevant chunks, a 128K context model processes them.
The evolution: Naive to Agentic RAG. Most tutorials teach naive RAG: embed query, search vectors, stuff results into prompt, generate. That works for demos. Production systems need advanced RAG (hybrid search, re-ranking, query rewriting) and increasingly agentic RAG (iterative retrieval, query decomposition, tool use) to handle the 30-40% of queries that are too complex for single-shot retrieval. Section 12 covers agentic RAG in detail.
6. Capacity Planning
Storage sizing
Documents: 2,000,000
Avg chunks per doc: ~5 (varies widely: Slack threads = 1-2, API docs = 3-5, long RFCs = 10-20)
Total chunks: 10,000,000
Embedding storage (Qdrant):
10M chunks × 1024 dimensions × 4 bytes/float = 40 GB (raw vectors)
With scalar quantization (int8): 40 GB × 0.25 = 10 GB
Payload metadata per chunk: ~500 bytes × 10M = 5 GB
HNSW graph overhead: ~30% of vector size = 3-12 GB
Total Qdrant storage: 18-57 GB (depending on quantization)
Keyword index (Elasticsearch):
10M chunks × avg 300 tokens × 6 bytes/token = 18 GB raw text
With inverted index overhead: ~54 GB
Total ES storage: ~54 GB
Metadata store (PostgreSQL):
10M chunk records × 1 KB avg = 10 GB
2M document records × 2 KB avg = 4 GB
ACL tables, indexes: ~2 GB
Total PG storage: ~16 GB
Query load
Daily queries: 10,000,000
Average QPS: 10M / 86,400 = ~115 QPS
Peak QPS (3x avg): ~350-500 QPS
Burst QPS (5x avg): ~575 QPS (Monday morning, incident response)
Per-query compute:
Embedding query: 5-10ms (API call)
Vector search: 10-50ms (Qdrant)
BM25 search: 5-20ms (Elasticsearch)
Re-ranking (top 20): 80-150ms P99 (Cohere API or GPU, includes network)
ACL filtering: 5-15ms
LLM generation: 400-1500ms (depends on model tier)
Total P50: ~1.0-1.5s
Total P99: ~3-5s
For costs, see Section 18. The short version: LLM inference is 60-85% of the bill, so model routing is not optional.
7. Document Ingestion Pipeline
The ingestion pipeline pulls raw documents from five source systems and converts them into chunked, embedded, searchable content. Connector and parsing layers are often overlooked, but in practice they are a major source of production issues.
7.1 Connector Architecture
Each source system gets a dedicated connector service:
| Source | Integration Method | Change Detection | Authentication |
|---|---|---|---|
| Confluence | REST API + webhooks | Webhook on page create/update/delete, daily full-sync reconciliation | OAuth 2.0 |
| GitHub | Webhooks + REST API | Push webhooks for commits, PR webhooks for merges | GitHub App |
| Slack | Events API + conversations.history | Real-time events for new messages, backfill via pagination | Bot token |
| Google Docs | Drive API + Push Notifications | Drive change notifications, polling fallback | Service account |
| S3 | S3 Event Notifications (SNS/SQS) | Object created/modified events | IAM role |
Every connector follows the same pattern: detect change, fetch document, extract text and metadata, publish to document.updates Kafka topic. The Kafka topic partitions by source system so that one slow connector does not block others.
Why webhooks plus polling? Webhooks are faster but unreliable. Confluence webhooks occasionally miss events. Slack's Events API has delivery guarantees but rate limits during high activity. A daily reconciliation job polls each source for any documents modified in the last 24 hours, compares hashes with what is in PostgreSQL, and re-queues anything that was missed. In our experience, this catches a small but meaningful percentage of updates that webhooks miss.
7.2 Document Parsing
Raw documents arrive in a dozen formats. The parser normalizes everything to a common structure:
@dataclass
class ParsedDocument:
doc_id: str # Deterministic hash of source + source_id
source: str # "confluence", "github", "slack", etc.
source_id: str # Native ID in source system
title: str
content: str # Normalized markdown
content_type: str # "prose", "code", "mixed", "conversational"
metadata: dict # Author, created_at, updated_at, tags, etc.
permissions: list[str] # ACL groups/users who can access
parent_doc_id: str | None # For threaded/nested content
url: str # Deep link back to source
content_hash: str # SHA-256 of content for change detectionFormat-specific parsing:
- Confluence: Atlassian Storage Format (XML-based) to markdown via custom converter. Strips macros, preserves headings, tables, and code blocks. Expands includes and excerpts inline.
- GitHub: README and docs are already markdown. Code files get language-tagged code blocks with file path context. PR descriptions include diff summaries.
- Slack: Thread messages are concatenated chronologically with author attribution. Reactions and emoji responses are stripped. Thread replies are grouped with their parent message.
- Google Docs: Export as HTML, convert to markdown. Preserves headings, lists, tables. Strips formatting-only elements (fonts, colors).
- PDFs: Unstructured.io with
hi_resstrategy for layout-aware extraction. Falls back tofaststrategy if processing time exceeds 30 seconds per page.
7.3 Deduplication
The same information often exists in multiple places. A design doc might live in Confluence AND be linked in a Slack thread AND referenced in a GitHub PR description. Without deduplication, the same content appears three times in search results, wasting context window tokens.
Strategy: Content-hash-based dedup at the document level. After parsing, compute SHA-256 of the normalized content. If the hash already exists in PostgreSQL, skip re-processing. For near-duplicates (same content with minor formatting differences), compute SimHash and flag documents with similarity > 0.9 for manual review.
8. Chunking Pipeline
Chunking matters more than any other decision in a RAG system (see Section 4, Principle 1).
The tradeoff is simple. Too small and you lose context: a 100-token fragment that says "this approach has three advantages" is useless without knowing what "this approach" refers to. Too large and retrieval gets noisy: a 2,000-token chunk about the entire auth system dilutes the signal when the user only asked about token refresh.
Model independence. The chunking model, embedding model, and generation LLM are completely independent. You can change any one without touching the others. The only coupling: the embedding model at ingestion must match the embedding model at query time (see Section 9.3).
8.1 Classic Fixed-Size Chunking
Split text into chunks of N tokens with M tokens of overlap.
Typical config: chunk_size=512 tokens, overlap=50 tokens
Performs well when: Uniformly structured documents like API reference pages where each section is roughly the same length and self-contained.
Degrades when: Any document where meaning spans across chunk boundaries. A 512-token chunk that starts mid-paragraph and ends mid-sentence loses context in both directions. The overlap helps but only at the margins.
Retrieval impact: Baseline. Expect recall@10 of 60-70% on heterogeneous enterprise corpora.
What each strategy actually produces. Consider this section from a payments service doc:
## Retry Strategy
The payments service retries failed downstream calls using exponential
backoff with jitter. Base delay is 2 seconds.
### Configuration
| Parameter | Default | Max |
|---------------|---------|-------|
| max_retries | 3 | 10 |
| base_delay_ms | 2000 | 5000 |
### Circuit Breaker
After 5 consecutive failures, the circuit breaker opens for 30 seconds.
During this window, all calls return a cached fallback response instead
of hitting the downstream service.
Fixed-size (512 tokens) produces one chunk containing the entire section — or worse, splits it mid-table if the preceding content pushes it over the limit. If the user asks "what is the default max_retries?", the chunk includes unrelated text that dilutes the embedding.
Recursive chunking splits at the ## and ### boundaries: Chunk 1 = "Retry Strategy" intro, Chunk 2 = "Configuration" table, Chunk 3 = "Circuit Breaker" paragraph. Each chunk is self-contained and maps to one concept. The "Configuration" chunk answers parameter questions precisely.
Semantic chunking analyzes embedding similarity between consecutive sentences. It might keep the "Retry Strategy" intro and "Configuration" together (they are semantically related) but split "Circuit Breaker" into its own chunk (different concept). The boundaries are driven by meaning, not structure.
8.2 Recursive / Hierarchical Chunking
Split by document structure first: headers, then paragraphs, then sentences. Only fall back to fixed-size splitting when structural elements exceed the chunk size limit.
# Splitting hierarchy for a Confluence page
split_order = [
"\n## ", # H2 headers (major sections)
"\n### ", # H3 headers (subsections)
"\n\n", # Paragraph breaks
"\n", # Line breaks
". ", # Sentence boundaries
]
# Each split respects max_chunk_size=800 tokens
# Parent-child relationships are preserved in metadataParent-child context expansion: When a chunk is retrieved, the system can optionally pull its parent chunk for additional context. If a user asks about "retry backoff strategy" and retrieval returns a 200-token chunk about exponential backoff, the parent chunk (the full "Error Handling" section) provides the surrounding context that makes the answer more complete.
Performs well when: Well-structured documents with clear heading hierarchies. Confluence pages, GitHub README files, technical specs.
Degrades when: Flat documents with no headings (some Google Docs, most Slack threads). Falls back to fixed-size splitting, losing the structural advantage.
Retrieval impact: Typically 5-15% improvement in recall@10 over fixed-size for structured documents, depending on corpus. Negligible improvement for unstructured content.
8.3 Semantic Chunking
Instead of splitting on structural boundaries, split on meaning boundaries. Use embeddings to detect where the topic changes.
How it works:
- Split the document into sentences.
- Embed each sentence.
- Compute cosine similarity between consecutive sentence embeddings using a sliding window.
- When similarity drops below a threshold (or drops significantly relative to the local average), insert a chunk boundary.
- Merge adjacent sentences between boundaries into chunks.
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunk(text: str, threshold: float = 0.3) -> list[str]:
sentences = split_into_sentences(text)
model = SentenceTransformer("all-MiniLM-L6-v2") # Fast, small model for boundary detection
embeddings = model.encode(sentences)
boundaries = [0]
for i in range(1, len(embeddings)):
sim = cosine_similarity(embeddings[i-1], embeddings[i])
if sim < threshold:
boundaries.append(i)
boundaries.append(len(sentences))
chunks = []
for i in range(len(boundaries) - 1):
chunk = " ".join(sentences[boundaries[i]:boundaries[i+1]])
chunks.append(chunk)
return chunksTrade-off: 3-5x more compute during ingestion (embedding every sentence for boundary detection). But this is a one-time cost per document, not per query. At 500K document updates per month, the extra embedding cost is about a modest monthly compute cost. Worth it if your corpus has lots of unstructured prose.
Performs well when: Long-form prose that covers multiple topics without clear structural boundaries. Design docs that transition between problem statement, architecture, and implementation. Meeting notes.
Degrades when: Very short documents (not enough content for meaningful topic shifts). Highly structured documents (recursive chunking already captures the boundaries well).
Retrieval impact: Can improve recall, often in the 5-15% range on heterogeneous, unstructured content. The improvement tends to be most significant for documents longer than 3,000 tokens with multiple topic transitions. Results vary significantly by corpus.
8.4 Late Chunking
A technique introduced by Jina AI in 2024 that flips the traditional embed-then-chunk approach.
Traditional approach: Chunk the document first, then embed each chunk independently. Problem: each chunk embedding has no awareness of the surrounding context. A chunk that says "This approach has three advantages" loses the referent of "this approach" because that was defined in a previous chunk.
Late chunking approach:
- Pass the entire document through a long-context embedding model (one that supports 8K+ tokens).
- The model produces token-level embeddings with full document context (every token's embedding is influenced by the entire document through attention).
- After the full forward pass, chunk the token embeddings into segments.
- Pool each segment's token embeddings to produce chunk-level embeddings.
So each chunk embedding carries context from the whole document, even though it only covers part of the text. The pronoun "this approach" now has a meaningful embedding because the model saw what it referred to.
Performs well when: Documents with heavy cross-referencing, pronouns, and context-dependent statements. Technical specs where section 3 frequently refers back to concepts from section 1.
Degrades when: Very long documents that exceed the embedding model's context window (typically 8K tokens). You need to split into overlapping windows first, which partially defeats the purpose. Also adds 2-3x latency to the embedding step because you are processing longer sequences.
Retrieval impact: Reported 5-15% improvement in recall for documents with heavy cross-references, though results depend on document structure and query patterns. The gains are smaller on well-structured documents where each section is already self-contained.
Cost: 2-3x higher embedding latency per document. For batch ingestion this is acceptable. For real-time updates where you need a document searchable within minutes, the extra latency may push against your freshness SLA.
8.5 LLM-Guided Chunking
Use an LLM to analyze the document and decide how to chunk it. (This is unrelated to Agentic RAG in Section 12. Here, "LLM-guided" means using a model at ingestion time to choose chunk boundaries. Agentic RAG is about using an agent at query time to iteratively search.) This is the most expensive approach but can produce the highest quality chunks for complex documents.
How it works:
- Send the document (or a representative section) to a small LLM.
- The LLM identifies logical boundaries, labels each section's topic, and suggests chunk boundaries.
- Optionally, the LLM generates a summary for each chunk that serves as an alternative embedding target (the summary is often a better retrieval target than the raw content).
CHUNKING_PROMPT = """Analyze this document and identify logical sections.
For each section, provide:
1. Start and end markers (first and last sentence)
2. A topic label
3. A one-sentence summary suitable for search retrieval
Document:
{document_text}
Output as JSON array of sections."""Performs well when: Highly complex, mixed-format documents. A design doc that interleaves architecture diagrams with code snippets, trade-off analysis, and meeting notes. Documents where the "right" chunk boundary depends on understanding what the content means, not just where the whitespace is.
Cost: the most expensive per-document strategy using a fast model like Haiku or GPT-4o-mini. At 500K documents per month, costs add up quickly at scale. Reserve this for high-value documents (RFCs, design docs) where chunk quality has the biggest impact.
Degrades when: When you apply it to everything. The cost adds up fast, and simpler strategies work fine for well-structured content. Use it selectively.
8.6 Our Strategy: Multi-Strategy Pipeline
No single chunking strategy works for all document types. The platform classifies each document by content type and applies the appropriate strategy:
| Document Type | Source | Strategy | Chunk Size | Overlap | Rationale |
|---|---|---|---|---|---|
| API reference docs | Confluence, GitHub | Recursive (by heading) | 500-800 tokens | 50 tokens | Well-structured, self-contained sections |
| Design docs / RFCs | Confluence, Google Docs | Semantic + Late chunking | 600-1000 tokens | N/A (semantic boundaries) | Long-form, cross-referencing, multi-topic |
| Runbooks / How-tos | Confluence | Recursive (by step) | 400-600 tokens | 30 tokens | Step-by-step, each step is self-contained |
| Code files | GitHub | AST-aware (by function/class) | 200-800 tokens | 0 (logical boundaries) | Function/class boundaries, preserve complete units |
| Slack threads | Slack | Thread-level (full thread as chunk) | Up to 1000 tokens | 0 | Context builds across messages, splitting breaks meaning |
| Meeting notes | Google Docs | Agentic chunking | 500-800 tokens | N/A | Unstructured, LLM identifies topic boundaries |
| PDFs / legacy docs | S3 | Semantic chunking | 500-800 tokens | N/A | Poor structure, need semantic boundary detection |
Quick reference: all strategies compared
| Strategy | Best For | Ingestion Cost | Retrieval Impact |
|---|---|---|---|
| Fixed-size | Uniform structured docs | Lowest | Baseline |
| Recursive | Docs with clear heading hierarchy | Low | Typically +5-15% |
| Semantic | Long-form, multi-topic prose | Medium (3-5x compute) | Typically +5-15% |
| Late chunking | Cross-referencing docs | Medium-high (2-3x latency) | Typically +5-15% |
| LLM-guided | Complex mixed-format docs | Highest (~$0.01-0.05/doc) | Highest for target docs |
Content type classification uses a lightweight model (fine-tuned DistilBERT or rule-based heuristics on document source + metadata) to route documents to the right chunking strategy. Accuracy target: 95%+ on the routing decision. Misclassification is not catastrophic because all strategies produce usable chunks. The difference is quality, not correctness.
Every chunk gets enriched metadata:
@dataclass
class Chunk:
chunk_id: str # UUID
doc_id: str # Parent document
content: str # Chunk text
content_type: str # "prose", "code", "mixed", "conversational"
chunk_strategy: str # "recursive", "semantic", "late", "agentic", "ast"
position: int # Order within document
total_chunks: int # Total chunks in parent doc
parent_chunk_id: str | None # For hierarchical expansion
title_context: str # Nearest heading above this chunk
source: str # "confluence", "github", etc.
url: str # Deep link to source
permissions: list[str] # Inherited from parent document
embedding: list[float] # 1024-dimensional vector
created_at: datetime
updated_at: datetime9. Embedding Pipeline
The embedding pipeline turns text chunks into dense vectors. This section covers model selection, batch processing during ingestion, versioning strategy, and dimensionality reduction.
9.1 Embedding Model Selection
The embedding model is the foundation of retrieval quality. Pick a bad one and everything downstream pays for it.
| Model | Dimensions | MTEB Avg Score | Latency (per chunk) | Matryoshka Support | License |
|---|---|---|---|---|---|
| Qwen3-Embedding-8B | 4096 (or 1024 reduced) | 70.58 | 10-20ms (self-hosted GPU) | Yes | Apache 2.0 |
| OpenAI text-embedding-3-large | 3072 (or 1024 reduced) | 64.6 | 5-10ms | Yes | Proprietary API |
| Cohere embed-v4 | 1024 | 65.1 | 8-15ms | Yes | Proprietary API |
| NV-Embed-v2 (NVIDIA) | - | ~69 | GPU | No | Llama license |
| jina-embeddings-v3 | 570M dims | 65.5 | GPU or CPU | - | Apache 2.0 |
| BGE-M3 (open source) | 1024 | 63.0 | 3-8ms (GPU) or CPU | No | MIT |
| EmbeddingGemma-300M | - | ~60 | CPU or edge device | - | Apache 2.0 |
Our choice: OpenAI text-embedding-3-large at 1024 dimensions for the API path. The Matryoshka property lets us store 1024d vectors (instead of full 3072d) with less than 1% retrieval quality loss, cutting storage by 66%. At 10M chunks, that saves ~40 GB of vector storage. The API is reliable, well-documented, and integrates cleanly with the rest of the pipeline.
For self-hosted: Qwen3-Embedding-8B. It tops the MTEB leaderboard at 70.58, beating every commercial API on raw benchmark scores. Supports 100+ languages and custom instructions for domain-specific tuning. Requires a single A100-40GB GPU. At high volume (> 50M chunks), self-hosting cuts embedding cost by 3-5x compared to APIs.
The lightweight option: BGE-M3. MIT licensed, 568M parameters, runs on CPU for small deployments. Lower quality than Qwen3-Embedding (63.0 vs 70.58 on MTEB) but zero GPU cost and battle-tested in production RAG systems worldwide. Good enough for prototyping and small-scale deployments.
Serving self-hosted embeddings: HuggingFace's Text Embedding Inference (TEI) is the easiest production setup:
# Serve BGE-M3 for production embedding
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:latest \
--model-id BAAI/bge-m3 \
--max-batch-tokens 65536For smaller models like BGE-M3 (568M params), CPU serving is viable at low throughput (< 100 QPS). At higher volumes, a single GPU handles 1,000+ embeddings per second.
9.2 Batch Embedding for Ingestion
New and updated chunks flow through Kafka to the embedding workers. Each worker:
- Batches chunks into groups of 100-500 (the API handles batch requests efficiently).
- Calls the embedding API with retry and exponential backoff.
- Writes vectors to Qdrant and text to Elasticsearch in parallel.
- Acknowledges the Kafka offset after both writes succeed.
Rate limiting: OpenAI's embedding API allows up to 3,000 RPM on higher tiers (varies by plan). With batch sizes of 100 at full quota utilization, that is roughly 300K chunks per minute. Monthly ingestion of 500K updated chunks can complete quickly. Full re-indexing of 10M chunks typically takes 1-3 hours depending on API rate limits, retries, and batching strategy.
9.3 Embedding Model Versioning and Migration
Embedding models improve over time. When you upgrade from text-embedding-3-small to text-embedding-3-large, every existing vector in the database is now incompatible with new query vectors. You cannot mix embeddings from different models.
There's only one hard rule: the embedding model must match at ingestion and query time. If you embed your 10M chunks with text-embedding-3-large, every user query must also go through text-embedding-3-large. Mix models and you get vectors in different spaces where cosine similarity is meaningless. The chunking model and the generation LLM are completely independent of this choice.
Blue-green re-indexing strategy:
- Create a new Qdrant collection (
chunks_v2) alongside the existing one (chunks_v1). - Run a background job that re-embeds all 10M chunks with the new model. At our scale and current API rate limits, this typically takes 1-3 hours.
- While re-indexing runs, queries continue hitting
chunks_v1. - Once
chunks_v2is fully populated, run the retrieval evaluation benchmark against the golden dataset. - If quality meets or exceeds the bar, atomically switch the query router to
chunks_v2. - Keep
chunks_v1for 48 hours as a rollback target, then delete.
This is the same pattern as blue-green deployments for application code, applied to vector data.
When to change your embedding model:
Changing the embedding model is expensive (full re-index of the corpus) and risky (retrieval quality could regress). Do it when the payoff justifies the cost:
| Trigger | Example | Worth it? |
|---|---|---|
| Major quality improvement | New model scores 5%+ higher on MTEB for your domain | Yes. Run shadow eval first. |
| Significant cost reduction | Switch from API model to self-hosted with < 1% quality loss | Yes, if savings > $1K/month at your scale. |
| Vendor deprecation | Provider announces model sunset with 6-month deadline | Yes. Plan early, don't rush. |
| New capability needed | Need code-aware embeddings (CodeBERT) or multi-lingual support | Yes, if current model can't handle these. |
| Dimensionality optimization | Matryoshka model lets you cut dims from 3072 to 1024 with < 1% loss | Yes. Storage and latency savings compound. |
When NOT to change:
- Marginal improvement (< 2% on your retrieval eval set). The re-indexing cost and risk are not worth it.
- You have no blue-green re-indexing capability. Changing models in-place means downtime or serving stale results.
- Mid-incident or during a feature launch. Stabilize first.
- The new model has not been benchmarked on YOUR data. MTEB scores are averages across generic datasets. A model that scores 3% higher on MTEB might score 2% lower on your internal engineering corpus. Always benchmark on your golden dataset before switching.
How to evaluate before switching:
- Create a shadow Qdrant collection with the new model's embeddings (start with a 10% sample of your corpus).
- Run your golden dataset retrieval benchmark against the shadow collection.
- Compare recall@10, MRR, and NDCG@10 against the current production model.
- If the new model wins or ties on all metrics, proceed with full re-indexing.
- If it wins on some and loses on others, dig into the failures. If the losses are on query types you care about, do not switch.
9.4 Dimensionality Reduction
OpenAI's text-embedding-3-large supports Matryoshka representation learning: the first N dimensions of the embedding capture most of the information. Truncating from 3072 to 1024 dimensions reduces storage by 66% with typically less than 2% quality loss on most retrieval benchmarks (corpus-dependent).
For further compression, apply scalar quantization in Qdrant: convert float32 vectors to int8. This reduces memory by 4x with typically 1-5% recall loss depending on corpus and query distribution. Combined, you go from 120 GB (3072d, float32) to 10 GB (1024d, int8). That is a 12x reduction.
When to avoid quantization: If your retrieval recall is already borderline (< 75%), quantization will push it below acceptable thresholds. Fix chunking and retrieval first, then compress.
10. Storage Architecture
10.1 Vector Database Selection
| Feature | Qdrant | Pinecone | Weaviate | pgvector |
|---|---|---|---|---|
| Architecture | Distributed, shared-nothing | Fully managed serverless | Distributed, multi-model | PostgreSQL extension |
| Max vectors | Billions (sharded) | Billions (managed) | Billions (sharded) | ~10M practical |
| HNSW tuning | Full control (ef, M) | Abstracted | Full control | Limited |
| Quantization | Scalar, Product, Binary | Supported | Product | Half-precision |
| Metadata filtering | Payload indexes, fast | Namespace + filter | Inverted index | SQL WHERE clause |
| Hybrid search | Sparse vectors + dense | Sparse + dense | BM25 built-in | Requires extensions |
| Ops complexity | Medium (Helm charts) | Zero (managed) | Medium | Low (PG extension) |
Our choice: Qdrant. For 10M vectors at our scale, Qdrant gives us full control over HNSW parameters, quantization, and sharding without the managed service premium. Payload indexes handle the metadata filtering we need for access control. The Rust implementation keeps memory usage predictable. Pinecone is the right choice if you want zero operational overhead. pgvector works up to about 5-10M vectors but performance degrades at our scale, especially with complex filters.
For a deeper comparison of vector database internals (HNSW vs IVF-PQ, distance metrics, re-indexing strategies), see the vector databases deep dive.
10.2 Vector Store (Qdrant)
Cluster topology: 3 nodes with replication factor 2. Each node handles a shard of the vector space. At 10M vectors with 1024 dimensions (int8 quantized), each node stores about 5 GB of vector data plus HNSW graph overhead.
HNSW index tuning:
| Parameter | Value | Why |
|---|---|---|
m | 16 | Connections per node in the HNSW graph. 16 balances recall and memory. Higher values (32-64) improve recall slightly but double memory usage. |
ef_construct | 200 | Build-time search width. Higher values produce a better graph but slow down indexing. 200 is sufficient for 10M vectors. |
ef (search) | 128 | Query-time search width. Higher = better recall but slower search. 128 often achieves high recall (frequently >90%) but exact results depend on data distribution and filtering complexity. |
Payload indexes: Create indexes on source, content_type, and permissions fields. These allow fast pre-filtering during vector search. Without payload indexes, Qdrant scans all vectors first and filters after, which is much slower for restrictive filters.
What exactly is stored per chunk?
Each chunk becomes a vector with a metadata payload:
{
"id": "chunk_a1b2c3d4",
"vector": [0.12, -0.98, 0.34, 0.67, -0.21, "... 1024 dimensions total"],
"payload": {
"content": "The payments service uses exponential backoff with jitter for all downstream retries. Base delay is 2 seconds, multiplied by 2^attempt with random jitter up to 500ms...",
"doc_id": "doc_payments_retry_2026",
"title": "Payment Service Error Handling",
"source": "confluence",
"content_type": "prose",
"permissions": ["team-payments", "team-platform"],
"url": "https://wiki.internal/pages/payment-error-handling",
"updated_at": "2026-03-15T10:30:00Z",
"chunk_index": 3,
"total_chunks": 8
}
}The vector captures meaning (1024 numbers). The payload carries the actual text, metadata, and access control filters. The vector database returns ranked chunks, not answers. The LLM has not been involved yet at this stage — it only enters the picture after retrieval selects the best chunks.
10.3 Keyword Index (Elasticsearch)
BM25 keyword search handles queries where exact term matching matters. "ErrorCode 4032" should match documents containing that exact string, regardless of semantic similarity.
Index design:
{
"mappings": {
"properties": {
"content": { "type": "text", "analyzer": "standard" },
"title_context": { "type": "text", "analyzer": "standard", "boost": 3.0 },
"source": { "type": "keyword" },
"content_type": { "type": "keyword" },
"permissions": { "type": "keyword" },
"doc_id": { "type": "keyword" },
"chunk_id": { "type": "keyword" },
"url": { "type": "keyword" },
"updated_at": { "type": "date" }
}
}
}Field boosting: title_context gets 3x boost because a match in the heading is a strong relevance signal. A chunk under the heading "Authentication Flow" that matches the query "how does auth work" should rank higher than a chunk that mentions authentication in passing.
10.4 Metadata Store (PostgreSQL)
PostgreSQL stores the system of record for document metadata, access control lists, and chunk lineage.
Key tables:
documents: Source metadata, content hash, last sync timestamp, permissionschunks: Chunk-to-document mapping, chunk strategy, position, parent chunkpermissions: ACL entries mapping documents to user groups and individual usersfeedback: User ratings, corrections, and implicit signals per query-answer pairevaluations: LLM-as-judge scores, golden dataset results, regression test runs
Why not put metadata in Qdrant payloads? Qdrant payloads work for search-time filtering but are not queryable with SQL. Admin dashboards, permission audits, and evaluation analysis all need SQL queries that join across tables. PostgreSQL is the right tool for that.
10.5 Semantic Cache (Redis)
Before running the full retrieval pipeline, check if a semantically similar query was recently answered.
How it works:
- Embed the incoming query.
- Check Redis for cached query embeddings with cosine similarity > 0.95.
- If found, return the cached answer immediately (< 10ms vs 1-2s for full pipeline).
- If not found, run the full pipeline and cache the result.
What each cache entry stores:
{
"query_embedding": [0.12, -0.98, 0.34, ...],
"query_text": "How does auth work?",
"answer": "The auth service uses OAuth2 with...",
"sources": [{"title": "Auth Guide", "url": "..."}],
"created_at": 1711234567
}The embedding is the lookup key, not the text. "How does auth work?" and "How does authentication work?" are different strings but nearly identical embeddings. Text matching would miss the cache hit. Embedding matching catches it.
How similarity matching works: Cosine similarity measures the angle between two vectors. Same direction means same meaning.
New query: "How does authentication work?" → [0.11, -0.97, 0.35, ...]
Cached query: "How does auth work?" → [0.12, -0.98, 0.34, ...]
cosine_similarity = dot_product(A, B) / (magnitude(A) × magnitude(B))
= 0.97
0.97 > 0.95 threshold → cache hit → return stored answer in <10ms
A query like "What auth does the payments service use?" produces a vector with similarity ~0.82, well below the threshold. Cache miss, full pipeline runs. The threshold controls how strict the match is: 0.95 is conservative (fewer hits, safer), 0.90 is relaxed (more hits, small risk of returning a slightly wrong cached answer). For enterprise RAG with sensitive runbooks and ACL-protected content, 0.95 is the right default.
At small scale (under 50K cached queries), comparing against all stored embeddings with brute-force cosine similarity takes under 1ms. If the cache grows past 100K entries, switch to Redis Vector Search (RediSearch module with HNSW indexing) or pgvector for indexed approximate search.
Cache invalidation: When a document is updated, invalidate all cached answers that cited that document. This uses the chunk lineage in PostgreSQL to find affected cache entries.
Hit rates: In practice, enterprise knowledge platforms see 5-25% cache hit rates. Engineers on the same team ask similar questions. Onboarding engineers ask the same questions that last month's new hires asked. Cache hit rate is highest on Monday mornings and during incident response when many people search for the same runbooks.
Cold start and cache warming: An empty cache means every query pays full pipeline cost. This happens after deployments, cache flushes, or Monday mornings when TTLs have expired over the weekend. Three strategies:
- Pre-warm with top queries. Log the top 1,000 queries from the previous week. After a cache flush or deployment, run these through the pipeline as a background job (P2 priority, so it does not compete with real users). This covers 15-30% of Monday's traffic before anyone arrives.
- Staggered TTLs. Instead of a uniform 24-hour TTL for all cache entries, randomize TTLs between 18-30 hours. This prevents mass expiration at the same time and smooths the cache rebuild.
- Gradual invalidation after document updates. When a document is updated, do not flush all related cache entries immediately. Mark them as "stale but serveable" for 5 minutes while the pipeline regenerates fresh answers in the background. Users get slightly stale answers for a few minutes instead of a latency spike.
10.6 Data Lifecycle Management
Vectors do not age gracefully. Without active maintenance, the index accumulates stale chunks, orphaned vectors, and fragmented segments.
Document TTL and stale chunk cleanup: Documents that have not been updated at the source within a configurable window (for example, 6-12 months depending on document type) get flagged as potentially stale. A weekly job checks whether flagged documents still exist at the source. If deleted or archived, their chunks are removed from Qdrant and Elasticsearch. If still present but unchanged, they stay indexed but get a stale_risk: high metadata flag that the generation layer can surface as a caveat: "Note: this source was last updated 14 months ago."
Vector compaction: Qdrant segments accumulate deletions over time (from document updates and stale cleanup). Deleted vectors are soft-deleted but still occupy space and slow search. Schedule monthly compaction during low-traffic windows (weekend nights) to reclaim space and rebuild HNSW graph segments.
Index fragmentation: Elasticsearch indexes grow fragmented as documents are added and deleted. Run _forcemerge quarterly to reduce segment count. For Qdrant, the optimizer handles this automatically but benefits from a periodic full re-optimization.
11. Retrieval Pipeline
Retrieval is where trust is won or lost. All the ingestion and chunking work is just preparation for this stage. Most RAG systems fail not because of the model, but because they retrieve the wrong context and never find out.
11.1 Query Understanding
Raw user queries are often ambiguous, incomplete, or poorly phrased. The query understanding layer transforms them before retrieval.
Query classification: A lightweight classifier (fine-tuned DistilBERT or rule-based) categorizes queries into:
| Query Type | Example | Routing |
|---|---|---|
| Simple factual | "What port does the user service run on?" | Single-shot retrieval, small model |
| How-to | "How do I set up local dev for the payments service?" | Single-shot retrieval, medium model |
| Analytical | "What are the trade-offs between our auth approaches?" | May need agentic RAG, large model |
| Multi-hop | "How does service X auth with Y, and what changed after Q3?" | Agentic RAG with query decomposition |
| Conversational | "What about the error handling?" (follow-up) | Resolve coreferences, then route |
Query rewriting: For ambiguous queries, use a fast LLM call to rewrite:
Original: "that auth thing from last quarter"
Rewritten: "authentication changes implemented in Q3 2025"
Cost: negligible per-query cost using a small model. Applied to ~30% of queries (those classified as ambiguous).
HyDE (Hypothetical Document Embeddings): For abstract queries where the user's question does not overlap lexically with any document, generate a hypothetical answer first, then use that answer's embedding for retrieval.
Query: "Why is checkout slow?"
HyDE: "The checkout service experiences latency spikes due to synchronous calls
to the payment processor and inventory service. The p99 latency increases
from 200ms to 2s during peak traffic because..."
The HyDE embedding is semantically closer to actual documents about checkout performance than the original 4-word query. This technique can improve recall for abstract queries, with observed gains of 5-15% depending on usage patterns, but adds 200-400ms of latency (the LLM call to generate the hypothetical answer). Apply it selectively, not on every query.
11.2 Hybrid Retrieval
Run BM25 keyword search and dense vector search in parallel, then merge results.
Why hybrid? BM25 captures exact lexical matches and rare tokens. Embeddings capture semantic similarity. Their failure modes are complementary, which is why combining them outperforms either alone. Each search modality has blind spots:
- Vector search misses: Exact identifiers, error codes, version numbers. "ErrorCode 4032" has no meaningful semantic embedding. BM25 finds it instantly.
- BM25 misses: Intent and meaning. "how to handle failures gracefully" does not match a document titled "Retry and Circuit Breaker Patterns" because the keywords do not overlap. Vector search captures the semantic connection.
Concrete example of complementary blind spots:
Try this query: "ErrorCode 4032"
- Vector search returns docs about error handling in general — semantically similar, but not the right error code.
- Keyword search finds the exact match instantly: "ErrorCode 4032: Idempotency key conflict on duplicate payment submission."
Now try: "how to handle failures gracefully"
- Keyword search returns nothing useful — no document contains this exact phrase.
- Vector search finds "Resilience Patterns: Circuit Breakers, Retries, and Graceful Degradation" — semantically a perfect match.
Vector search understands meaning. Keyword search finds exact terms. You need both.
Reciprocal Rank Fusion (RRF) merges the two result lists:
RRF_score(doc) = sum(1 / (k + rank_in_list)) for each list containing doc
where k = 60 (standard constant)
If a document appears at rank 3 in vector search and rank 7 in BM25:
RRF_score = 1/(60+3) + 1/(60+7) = 0.0159 + 0.0149 = 0.0308
Documents appearing in both lists get boosted. Documents appearing in only one list still contribute.
RRF in action:
| Document | Vector Rank | BM25 Rank | RRF Score | Final Rank |
|---|---|---|---|---|
| Payment Error Handling | 1 | 3 | 0.0492 | 1 |
| payments-service README | 2 | 1 | 0.0492 | 2 |
| Resilience Patterns | 3 | 5 | 0.0423 | 3 |
| Retry Config Reference | 5 | 2 | 0.0443 | 4 |
| Error Code Glossary | 8 | 4 | 0.0398 | 5 |
Documents appearing in both lists get boosted. Documents appearing in only one list still contribute.
Alpha tuning: Some teams use a weighted combination instead of RRF: score = alpha * vector_score + (1 - alpha) * bm25_score. The optimal alpha varies by corpus. For our enterprise knowledge base: alpha = 0.7 (favor semantic) works well for natural language queries. For code-heavy queries: alpha = 0.4 (favor keyword) performs better.
Dynamic alpha by query type: Rather than a single fixed alpha, adjust the weight based on the query classification from Section 11.1:
| Query Type | Alpha (vector weight) | Why |
|---|---|---|
| Conceptual ("how does auth work") | 0.8 | Meaning matters more than exact terms |
| Code/identifier ("ErrorCode 4032") | 0.3 | Exact match critical; BM25 excels |
| How-to ("set up local dev") | 0.6 | Mix of procedure keywords and intent |
| Conversational follow-up | 0.7 | Rewritten query benefits from semantic |
Over time, you can learn the optimal alpha per query type from feedback data. If users consistently downvote code query results, try shifting alpha lower for that bucket.
Performance: Hybrid retrieval typically outperforms either modality alone by 5-15% on recall@10 across published benchmarks. The actual improvement depends on your query mix and corpus characteristics.
Every retrieval improvement (hybrid search, re-ranking, query expansion) adds latency and cost. Production systems continuously balance this trade-off rather than maximizing any single dimension. A system that gets perfect recall at 2-second retrieval latency is worse than one that gets 90% recall at 200ms.
11.3 Re-Ranking
The initial retrieval cast a wide net: top-50 to top-100 results from hybrid search. Re-ranking narrows this to the top-5 most relevant chunks using a cross-encoder model.
Why re-ranking matters: Bi-encoders encode the query and document independently into separate vectors. Fast, but the representations never "see" each other. Cross-encoders score the (query, document) pair jointly, attending to word-level interactions between them. Joint scoring is significantly more accurate but too slow to run on the full corpus.
The standard pattern: fast bi-encoder retrieval on the full corpus (millions of chunks) followed by accurate cross-encoder re-ranking on the shortlist (20-100 chunks).
Re-ranking options:
| Model | Latency (20 docs) | Quality (NDCG@10) | Cost |
|---|---|---|---|
| Cohere Rerank 3.5 | 80-150ms (with network) | Top-tier | $2/1M docs |
| Qwen3-Reranker-8B (open source) | 50-100ms (GPU) | Comparable to Cohere on benchmarks | GPU cost only |
| BGE-reranker-v2-m3 (self-hosted) | 40-80ms (GPU) | Very good | GPU cost only |
| ColBERTv2 | 20-40ms | Good (late-interaction, fastest) | GPU cost only |
| Jina Reranker v2 | 50-90ms | Very good | $1/1M docs |
Our choice: Cohere Rerank 3.5 as primary, with BGE-reranker-v2-m3 self-hosted as a fallback. The Cohere model handles 20 documents in ~80-150ms at P99 in typical deployments (including network overhead) which fits within our P99 retrieval budget of ~350ms. The self-hosted fallback activates if Cohere's API has availability issues.
Open-source re-rankers:
| Model | Params | Quality | GPU Requirement |
|---|---|---|---|
| Qwen3-Reranker-8B | 8B | Matches Cohere Rerank 3.5 | 1x A100-40GB |
| BGE-reranker-v2-m3 | 568M | Very good | 1x RTX 4090 |
| ColBERTv2 | 110M | Good (late-interaction, very fast) | 1x RTX 4090 |
11.4 Access Control Filtering
Non-negotiable. An engineer should never see answers sourced from documents they cannot access. One leaked internal doc through the knowledge assistant and the platform is dead.
Pre-filtering vs post-filtering:
- Pre-filtering (our approach): Include the user's permission groups in the vector search query. Qdrant's payload index on
permissionsfilters before scoring, so only accessible documents are considered. This is faster and more secure. The downside: if a user has very restrictive permissions, the candidate pool shrinks and retrieval quality may drop. - Post-filtering: Retrieve top-k from the full corpus, then filter out inaccessible documents. Simpler to implement but you may end up with fewer than k results after filtering. Also briefly loads inaccessible document IDs into memory, which some compliance frameworks disallow.
Permission sync: Permissions are synced from source systems to PostgreSQL on document ingestion. When Confluence space permissions change, a webhook triggers a permission update for all documents in that space. Qdrant payload updates propagate within seconds.
Edge case: shared docs with restricted sections. Some Confluence pages are accessible to everyone but contain sections marked as restricted. The current design treats the entire document as accessible if the top-level permission allows it. A future improvement: section-level permissions mapped to chunk-level ACLs.
11.5 Context Assembly
After retrieval and re-ranking produce the top-5 chunks, assemble them into the LLM prompt.
Context window budget:
Total context window: 8,192 tokens (small model) or 200,000 tokens (large model)
Budget allocation (small model):
System prompt: 500 tokens (instructions, citation format, guardrails)
Retrieved context: 4,000-5,000 tokens (5 chunks × 800-1000 tokens)
Conversation history: 1,000-1,500 tokens (last 2-3 turns, summarized)
Generation output: 1,000-2,000 tokens (answer + citations)
Budget allocation (large model, complex queries):
System prompt: 500 tokens
Retrieved context: 8,000-15,000 tokens (10-15 chunks for agentic RAG)
Conversation history: 2,000-4,000 tokens (full recent history)
Generation output: 2,000-4,000 tokens
Chunk ordering: Research on the "lost in the middle" effect shows that LLMs pay more attention to the beginning and end of the context, and underweight information in the middle. To mitigate: place the most relevant chunk first, the second-most-relevant chunk last, and fill the middle with supporting context. In our testing, this can improve answer quality by 3-5%, though newer models show less susceptibility to this effect than earlier ones.
Context compression: When retrieved chunks are long or overlap in content, compress before stuffing into the prompt. Two techniques:
- Redundancy removal. If two chunks say essentially the same thing (cosine similarity > 0.85 between their embeddings), keep only the higher-scored one. This happens more often than you would expect, especially when multiple documents describe the same service.
- Extractive compression. Use a fast LLM call to extract only the sentences relevant to the query from each chunk. A 800-token chunk might compress to 200 tokens without losing the information the user actually needs. Cost: a small per-query cost that pays for itself in reduced generation tokens. Can save 30-50% of context tokens in practice, which directly reduces generation cost.
Apply compression selectively. For simple factual queries with 3-4 short chunks, skip it. For complex queries with 10+ chunks from agentic RAG, compression pays for itself in reduced generation tokens.
Token counting: Use tiktoken (for OpenAI models) or the model's native tokenizer to count precisely. Overestimating wastes context. Underestimating causes truncation errors. Count tokens before assembly, not after.
12. Agentic RAG: Beyond Single-Shot Retrieval
Single-shot RAG (retrieve once, generate once) handles about 60-70% of enterprise queries well. The remaining 30-40% are too complex: they require information from multiple documents, involve reasoning across topics, or are phrased so abstractly that initial retrieval misses the target.
Agentic RAG uses an LLM as a reasoning agent that can iteratively search, evaluate results, and refine its approach. Instead of a fixed pipeline, the agent decides what to do next based on what it has found so far.
12.1 When to Use Agentic RAG
Not every query needs an agent. Agents typically add 1.5-3x latency and 2-5x cost compared to single-shot RAG (varies by iteration count and tools called). Route to the agentic path only when needed.
Routing criteria:
| Signal | Single-Shot | Agentic |
|---|---|---|
| Query complexity classifier | Simple, factual, how-to | Analytical, multi-hop, comparative |
| Estimated retrieval confidence | High (clear topic match) | Low (abstract, ambiguous) |
| Keyword count | < 15 words | > 15 words, multiple clauses |
| Contains comparison words | No | "compare", "difference between", "trade-offs" |
| Contains temporal references | No | "changed since", "before and after", "history of" |
The percentage varies by organization, but typically 10-30% of queries benefit from the agentic path. These are the queries that produce the most value because they are the ones engineers previously could not answer without asking a senior colleague.
Concrete examples of classification:
SIMPLE → single-shot RAG:
"What is the endpoint for the user service?" → direct lookup
"ErrorCode 4032" → exact match
"How to set up local dev for payments service?" → how-to, single topic
COMPLEX → agentic RAG:
"Compare auth v1 vs v2 and what changed after Q3" → multi-part + temporal
"Why is checkout slow?" → analytical, multiple causes
"How does payments talk to user service and what → multi-hop, cross-service
happens when user service is down?"
The classifier is a lightweight LLM call (Haiku/GPT-4o-mini, ~10ms) that returns simple or complex with a confidence score. When confidence is below 0.7, default to single-shot first and fall back to agentic if retrieval confidence is low.
12.2 Query Decomposition
Complex queries are broken into simpler sub-queries that can each be answered independently.
Original: "How does the payments service authenticate with the user service,
and what changed after the Q3 auth migration?"
Decomposed:
Sub-query 1: "Payments service authentication mechanism with user service"
Sub-query 2: "Q3 2025 authentication migration changes"
Sub-query 3: "Payments service auth changes after Q3 migration"
Each sub-query runs through the full retrieval pipeline independently. Results are merged and deduplicated before being passed to the generation step.
Implementation: A single LLM call decomposes the query. Prompt:
Given this complex question, break it into 2-5 simpler sub-questions
that can each be answered independently by searching an internal
knowledge base. Each sub-question should be self-contained.
Question: {original_query}
Estimated cost (as of early 2026): $0.002-0.005 per decomposition using a small model. Latency: 200-400ms.
12.3 Iterative Retrieval with Self-Reflection
After initial retrieval, the agent evaluates whether the retrieved context is sufficient to answer the query. If not, it formulates a follow-up search.
The loop:
- Retrieve context for the query (or sub-query).
- Ask the agent: "Given this context, can you answer the question? If not, what additional information do you need?"
- If the agent identifies gaps, it generates a refined query targeting the missing information.
- Retrieve again with the refined query.
- Repeat up to 3 iterations (configurable).
Why this matters: Initial retrieval often returns documents that are close but not quite right. The agent can recognize "I found the auth documentation but it's for the old system, I need the post-migration docs" and search more specifically.
Latency impact: Each iteration adds 500-800ms (retrieval + evaluation). A 3-iteration query takes 2.5-4.5s total. This is acceptable for complex queries where the alternative is the engineer spending 20 minutes searching manually.
12.4 Multi-Source Routing
The agent decides which knowledge sources to query based on the query type:
| Query Type | Primary Sources | Retrieval Strategy |
|---|---|---|
| API reference | GitHub README, Confluence API docs | Code-aware chunking, exact match boosted |
| Incident investigation | Slack threads, runbooks | Thread-level retrieval, recency-weighted |
| Architecture decisions | RFCs, design docs, ADRs | Semantic chunking, full-section retrieval |
| Setup instructions | Runbooks, READMEs | Recursive chunking, step-by-step retrieval |
Rather than searching the entire 10M-chunk corpus for every query, the agent narrows to relevant source types first. This reduces search space, improves precision, and cuts retrieval latency.
12.5 Tool Use via MCP
The agent interacts with the retrieval system and other data sources through MCP (Model Context Protocol) tools. Each tool is an MCP server, and the agent orchestrator is the MCP client.
Available tools:
| MCP Tool Server | Capabilities | When Used |
|---|---|---|
search-vector | Semantic search on Qdrant with filters | Every retrieval step |
search-keyword | BM25 search on Elasticsearch | Exact term queries, error codes |
search-code | AST-aware code search on GitHub index | API lookups, function signatures |
query-metadata | SQL queries on PostgreSQL metadata | "Who owns this service?", "When was this last updated?" |
calculate | Math operations for sizing/estimation queries | "How much storage does X need?" |
fetch-document | Retrieve full document by ID | When a chunk reference needs full context |
Why MCP over custom tool interfaces? Three reasons:
- Model portability. Write each tool server once. Use it with Claude, GPT, Llama, or any MCP-compatible model. No per-model tool format conversion.
- Security. MCP's OAuth 2.1 support means each tool server can enforce authentication and authorization independently. The code search tool can verify that the requesting user has access to the repo before returning results.
- Capability negotiation. At initialization, the agent discovers what tools are available and their schemas. If the code search tool is down for maintenance, the agent gracefully skips it instead of failing.
For a deeper dive on MCP architecture, transports, and security model, see the MCP server guide.
How does the LLM pick which tool to call? The agent receives tool schemas at initialization and selects based on query intent:
Query: "How does auth work?"
→ Agent selects: search-vector("authentication flow architecture")
Why: conceptual question, semantic search finds best results
Query: "ErrorCode 4032"
→ Agent selects: search-keyword("ErrorCode 4032")
Why: exact term lookup, BM25 finds the precise match
Query: "Who owns the checkout service?"
→ Agent selects: query-metadata("SELECT owner FROM services WHERE name = 'checkout'")
Why: structured data question, SQL is more precise than text search
Query: "How does the retry logic in payments-service/retry.go work?"
→ Agent selects: search-code("retry.go exponential backoff payments-service")
Why: code-specific question, AST-aware search finds function definitions
The LLM does not guess. It receives the tool descriptions and input schemas, then calls the most appropriate tool for each sub-query. Complex queries often call multiple tools in sequence.
LangGraph orchestration: The agent loop is implemented as a LangGraph state machine:
from langgraph.graph import StateGraph, END
class AgentState(TypedDict):
query: str
sub_queries: list[str]
retrieved_chunks: list[Chunk]
iteration: int
max_iterations: int
is_sufficient: bool
answer: str | None
def should_continue(state: AgentState) -> str:
if state["is_sufficient"] or state["iteration"] >= state["max_iterations"]:
return "generate"
return "retrieve"
graph = StateGraph(AgentState)
graph.add_node("decompose", decompose_query)
graph.add_node("retrieve", retrieve_with_tools)
graph.add_node("evaluate", evaluate_sufficiency)
graph.add_node("generate", generate_answer)
graph.set_entry_point("decompose")
graph.add_edge("decompose", "retrieve")
graph.add_edge("retrieve", "evaluate")
graph.add_conditional_edges("evaluate", should_continue)
graph.add_edge("generate", END)Detailed example: a 2-iteration agent trace.
Query: "How does the payments service handle retries, and what changed after the Q3 migration?"
ITERATION 1:
Decompose → 2 sub-queries:
[1] "payments service retry strategy"
[2] "Q3 migration changes payments"
Tool calls:
search-vector("payments service retry strategy") → 4 chunks found
✓ "Exponential backoff with jitter, base delay 2s..."
✓ "Retry budgets: max_retries=3, circuit breaker at 5 failures..."
search-vector("Q3 migration changes payments") → 2 chunks found
~ "Q3 platform migration overview" (too broad)
~ "Service mesh migration timeline" (wrong service)
Evaluate: "I have strong retry docs but the Q3 migration results
are too generic. I need to search for checkout-specific
migration changes."
→ is_sufficient: false
ITERATION 2:
Refined search:
search-keyword("Q3 migration payments retry") → 3 chunks found
✓ "Q3 migration: payments retry policy changed from 5→3 max retries"
✓ "Migration rollback: reverted retry timeout from 10s to 5s"
search-vector("payments Q3 breaking changes") → 1 chunk found
✓ "BREAKING: retry jitter algorithm changed from full to decorrelated"
Evaluate: "Now I have both retry strategy docs AND specific Q3 changes."
→ is_sufficient: true
GENERATE:
Context: 7 chunks (4 from iteration 1 + 3 from iteration 2)
Answer: "The payments service uses exponential backoff with jitter...
After Q3, three things changed: max_retries was reduced from
5 to 3, timeout was halved from 10s to 5s, and the jitter
algorithm switched from full to decorrelated [Source: Q3 Migration Notes]."
Total: 2 iterations, 4 tool calls, 3.2 seconds, $0.04
Notice how the agent reasons about what is missing and refines its search. Single-shot RAG would have returned the retry strategy but missed the Q3 changes entirely.
12.6 Guardrails on Agent Loops
Agentic RAG without guardrails is a cost bomb waiting to go off. An agent that keeps searching and re-searching without finding useful results can burn through tokens and latency.
Hard limits:
| Guardrail | Limit | What Happens at Limit |
|---|---|---|
| Max iterations | 3 | Fall back to best-effort answer with available context |
| Max tool calls | 15 | Stop searching, generate from what you have |
| Max tokens (input) | 50,000 | Budget exhausted, generate immediately |
| Max wall-clock time | 15 seconds | Timeout, return partial answer with apology |
| Cost cap per query | $0.10 | Circuit breaker, fall back to single-shot RAG |
Circuit breaker: If the agent detects it is looping (same retrieval results appearing twice, or evaluation scores not improving), it breaks out and falls back to single-shot RAG with the best results found so far. The user gets a slightly worse answer fast, rather than a perfect answer never.
13. Generation Layer
13.1 Model Routing
Not every query needs a frontier model. A tiered routing strategy can cut costs by 50-70% while maintaining answer quality, depending on your query distribution and how accurately you classify complexity.
| Tier | Model | Query Types | TTFT | Relative Cost | % of Traffic |
|---|---|---|---|---|---|
| Fast | Claude Haiku / GPT-4o-mini | Simple factual, definitions, single-doc answers | 150-300ms | Very low | ~70-80% |
| Standard | Claude Sonnet / GPT-4o | How-to, moderate reasoning, multi-chunk synthesis | 400-800ms | Moderate | ~15-25% |
| Complex | Claude Opus / GPT-4 | Multi-hop reasoning, comparative analysis, complex synthesis | 800-2000ms | High | ~5-10% |
Routing decision: The query complexity classifier (from Section 11.1) determines the tier. When in doubt, route up not down. An overqualified model wastes a few cents. An underqualified model produces a bad answer that erodes trust.
Fallback chain: If the primary provider (say, Anthropic) is down:
- Try the same tier on the secondary provider (OpenAI).
- If both API providers are down, fall back to self-hosted vLLM with Llama 4 70B (or DeepSeek V3.2).
- If self-hosted is also unavailable (hardware failure), return a degraded response: "I found these potentially relevant documents: [links]. Our answer generation service is temporarily unavailable."
The degraded response is still useful. It turns the platform from a Q&A system into a search engine, which is better than a 500 error.
13.2 Prompt Engineering at Scale
The system prompt is the most critical piece of text in the entire platform. It determines citation behavior, hallucination boundaries, and response format.
You are an internal knowledge assistant for engineers. Answer questions
using ONLY the retrieved context provided below. Follow these rules:
1. GROUNDING: Every factual claim must be supported by the retrieved context.
If the context does not contain enough information, say so explicitly.
2. CITATIONS: Reference sources using [Source N] notation inline.
At the end, list all sources with their titles and URLs.
3. UNCERTAINTY: If you are not confident in an answer, say
"Based on the available documentation, [answer], but I'd recommend
verifying with [suggested source or team]."
4. SCOPE: Do not answer questions about topics not covered in the
retrieved context. Do not use your training data as a source.
5. FORMAT: Use markdown. Code blocks for code. Keep answers concise
but complete. Prefer bullet points for multi-step answers.
Retrieved Context:
{retrieved_chunks_with_source_labels}
Conversation History:
{recent_turns}
User Question: {query}
Prompt versioning: Prompts are stored in a version-controlled config file, not hardcoded. Each change gets a version tag. A/B testing compares prompt versions on 5-10% of traffic, measuring answer quality via LLM-as-judge scores and user feedback.
Dynamic prompt assembly: The prompt template varies based on query type:
- Factual queries get a shorter system prompt emphasizing conciseness.
- How-to queries get a prompt emphasizing step-by-step formatting.
- Analytical queries get a prompt emphasizing nuance and trade-off discussion.
13.3 Streaming Architecture
Users should not stare at a blank screen for 2 seconds. Streaming shows tokens as they are generated, reducing perceived latency.
Implementation: Server-Sent Events (SSE) from the backend to the frontend.
- The LLM provider streams tokens via its API.
- The backend forwards each token to the client over an SSE connection.
- The client renders markdown incrementally.
Citation handling in streaming: Citations are tricky to stream because the model might output "[Source 1]" across multiple token chunks. The backend buffers citation markers and only sends them to the client once the full marker is complete. The source URL resolution happens at the end of the stream.
Latency breakdown:
Time to first token (TTFT):
Query understanding: 50-200ms
Cache check: 5-10ms
Retrieval + re-ranking: 100-200ms
Prompt assembly: 5-10ms
LLM TTFT: 200-800ms (depends on model tier)
Total TTFT: 400-1,200ms
Time to last token:
TTFT + generation time: 1.5-5s (depends on answer length)
With streaming, the user sees the first words within 400-1,200ms even though the full answer takes 2-5 seconds. This makes a huge difference in how responsive the tool feels.
13.4 Context Window Management
Context windows are a shared resource. Budget them carefully.
Conversation memory: For multi-turn conversations, include the last 2-3 turns in the prompt. If the conversation is longer, summarize older turns:
Turn 1: User asked about auth flow → Assistant explained OAuth2 integration
Turn 2: User asked about error handling → Assistant listed retry strategies
[Current turn with full context]
Summarization cost is tiny compared to generation. Even at 3 turns per conversation, it barely registers.
When to use large context models: For agentic RAG queries that accumulate 10-15 chunks across multiple retrieval iterations, a 200K context model handles the full context without truncation. But the cost is 5-10x higher per token. Use large context models only for the 5% of queries that actually need them (the "Complex" tier in model routing).
13.5 Self-Hosted LLM Serving
You do not have to use API providers. Every component in this pipeline (LLM generation, embeddings, re-ranking) has open-source alternatives that you can self-host. Open-source models have caught up. On many benchmarks they match or beat the commercial APIs as of early 2026. Here is what is available and how to run it.
Open-source LLMs for generation:
| Model | Params | Architecture | GPU Requirement | Best For |
|---|---|---|---|---|
| Llama 4 70B | 70B | Dense | 4x A100-80GB or 2x H100 | Best ecosystem support, most versatile, strong default |
| DeepSeek V3.2 | 685B MoE (37B active) | MoE | 8x A100-80GB | Reasoning-heavy queries, beats GPT-5 on benchmarks |
| Qwen3-72B | 72B | Dense | 4x A100-80GB | 119 languages, strong reasoning |
| Mistral Large 3 | 675B MoE | MoE | 8x A100-80GB | 92% of GPT-5.2 quality at ~15% the cost |
| Llama 4 8B | 8B | Dense | 1x A100-40GB or RTX 4090 | Fast tier in model routing, sub-200ms TTFT |
| DeepSeek R1 Distill 7B | 7B | Dense | 1x RTX 4090 (24GB) | Strong reasoning for its size, good for query decomposition |
Serving with vLLM: The standard way to serve open-source LLMs in production. PagedAttention reduces KV cache waste from 60-80% down to under 4%. For models too large for your GPUs, quantization (AWQ, GPTQ, or FP8) cuts memory by 2-4x. A 70B model quantized to INT4 fits on 2x RTX 4090s instead of 4x A100s. See the vLLM guide for serving commands, configuration, and quantization details.
API vs self-hosted: when does self-hosting pay off?
| Component | Use API When | Self-Host When |
|---|---|---|
| LLM generation | < 1M queries/month; want zero ops burden | > 5M queries/month; data sovereignty required; need custom fine-tuning |
| Embeddings | < 50M chunks total corpus; infrequent re-indexing | > 50M chunks; continuous re-indexing; data cannot leave your network |
| Re-ranking | Almost always (cheap at $2/1M docs) | Data cannot leave your network; need custom re-ranker fine-tuned on your domain |
14. Hallucination Mitigation
Hallucinations in an internal knowledge assistant are worse than no answer. An engineer who follows a hallucinated API endpoint will break production. Saying "I don't know" is always better than making something up.
14.1 Grounding via Retrieval
The first line of defense: instruct the model to answer only from retrieved context.
System prompt grounding: The prompt (Section 13.2) explicitly says "use ONLY the retrieved context." This reduces hallucinations dramatically compared to an ungrounded model, but it is not foolproof. Models sometimes synthesize information that "sounds right" given the context but is not actually stated.
Confidence scoring: After retrieval and re-ranking, compute an aggregate confidence score:
def compute_confidence(reranked_chunks: list[ScoredChunk]) -> float:
if not reranked_chunks:
return 0.0
top_score = reranked_chunks[0].score
score_gap = top_score - reranked_chunks[1].score if len(reranked_chunks) > 1 else 0
avg_top3 = mean([c.score for c in reranked_chunks[:3]])
# These weights are starting points. Tune them on your corpus.
# High confidence: top result is clearly relevant and well-separated
# Low confidence: top results are all mediocre or tightly clustered
confidence = (top_score * 0.5) + (score_gap * 0.2) + (avg_top3 * 0.3)
return min(confidence, 1.0)Three-tier response modes based on confidence:
| Confidence | Mode | Response Behavior |
|---|---|---|
| > 0.6 | Full answer | Generate grounded answer with citations. Standard path. |
| 0.3 - 0.6 | Hedged answer | Generate answer but prepend: "Based on what I found, [answer]. I'd recommend verifying with [team/owner] directly." Include source links prominently. |
| < 0.3 | Abstention | Skip generation entirely. Return: "I couldn't find reliable information about this. Here are the closest matches: [links]. Try asking in #[relevant-slack-channel]." |
This matters more than it sounds. Most RAG systems only have two modes: answer or error. The hedged middle tier is where you prevent the worst hallucinations while still being useful. About 15-20% of queries land in this tier, and users actually appreciate the honesty.
Multi-model verification (optional, for high-stakes queries): For the Complex tier (5% of queries), run the generated answer through a second, cheaper model with the prompt: "Does this answer accurately reflect the provided context? Flag any claims not supported by the sources." Cost is low per query. This can catch a significant portion of hallucinations that slip past NLI checking, particularly subtle misrepresentations where NLI models see "neutral" but a reasoning LLM recognizes the claim is misleading.
14.2 Citation and Attribution
Every factual claim in the response must cite its source. This is enforced at three levels:
Level 1: Prompt instruction. The system prompt requires inline [Source N] citations.
Level 2: Structured output. For the highest-quality responses, use the LLM's structured output mode to enforce a JSON schema:
{
"answer": "The payments service uses OAuth2 for authentication [Source 1]. After the Q3 migration, it switched to mutual TLS for service-to-service calls [Source 2].",
"citations": [
{"id": 1, "chunk_id": "abc123", "title": "Payments Auth Guide", "url": "https://..."},
{"id": 2, "chunk_id": "def456", "title": "Q3 Auth Migration RFC", "url": "https://..."}
],
"confidence": 0.85,
"needs_verification": false
}Level 3: Post-generation citation verification. After generation, verify that each cited source actually supports the claim made. A fast LLM call checks: "Does [chunk text] support the claim [extracted claim]?" This catches cases where the model cites a source but the cited text does not actually say what the answer claims.
Citation verification adds a small cost per query. Applied to all queries routed to the Standard and Complex tiers (20% of traffic).
14.3 Guardrails and Validation
NLI (Natural Language Inference) checking: Run the generated answer through an NLI model that classifies each claim as "entailed", "neutral", or "contradicted" by the retrieved context.
- Entailed: The context supports the claim. Good.
- Neutral: The context does not address the claim. Flag as potentially hallucinated.
- Contradicted: The context says the opposite. Block the response, re-generate with a stronger grounding instruction.
NLI models like deberta-v3-large can run in under 20ms with a warm model on GPU and add minimal latency to the pipeline. Use NLI on every query (cheap and fast). Use multi-model verification from Section 14.1 only on the Complex tier (expensive but catches subtler misrepresentations that NLI misses).
Entity grounding checks: Extract named entities from the response (service names, API endpoints, configuration keys) and verify they appear in the retrieved context or the broader document corpus. A response that references "the AuthService" when no document mentions that name is likely hallucinating.
Toxicity and safety filters: While less critical for internal tools, filter responses that contain PII, credentials, or secrets that may appear in source documents. A regex-based scanner checks for patterns like API keys, passwords, and internal IP addresses before returning the response.
14.4 Structured Output Enforcement
Force the LLM to output in a structured format (JSON mode or tool calling) to ensure consistent citation formatting. This prevents the model from "forgetting" to cite sources in some responses.
response_schema = {
"type": "object",
"properties": {
"answer_markdown": {"type": "string"},
"sources_used": {
"type": "array",
"items": {
"type": "object",
"properties": {
"chunk_id": {"type": "string"},
"relevance": {"type": "string", "enum": ["high", "medium", "low"]},
"quote": {"type": "string"} # Exact quote from source
}
}
},
"confidence_level": {"type": "string", "enum": ["high", "medium", "low"]},
"follow_up_suggestions": {"type": "array", "items": {"type": "string"}}
},
"required": ["answer_markdown", "sources_used", "confidence_level"]
}The quote field is particularly valuable: it forces the model to ground each citation in a specific passage from the source, making hallucination harder.
15. Evaluation and Feedback System
If you are not measuring answer quality, you are guessing. "It looks good in demos" is not a metric. Evaluation is a production requirement, not a nice-to-have.
15.1 Offline Evaluation
Golden dataset: A curated set of 500-1,000 question-answer pairs with expected source documents. The dataset covers:
- Simple factual questions (40%)
- How-to questions (25%)
- Analytical questions (20%)
- Multi-hop questions (15%)
Each entry includes:
{
"question": "How does the payments service handle idempotency?",
"expected_answer_contains": ["idempotency key", "header", "X-Idempotency-Key"],
"expected_source_docs": ["payments-api-guide", "payments-rfc-042"],
"category": "factual",
"difficulty": "medium"
}What does "correct" mean? It depends on query type. Factual queries need exact accuracy. How-to queries need step completeness. Analytical queries need reasoning coherence. Measure each separately or your aggregate score hides the failures that matter.
Metrics (RAGAS framework):
| Metric | What It Measures | Target |
|---|---|---|
| Faithfulness | Does the answer only contain information from the context? | > 0.90 |
| Answer relevance | Does the answer address the question? | > 0.85 |
| Context relevance | Are the retrieved chunks relevant to the question? | > 0.80 |
| Context recall | Do the retrieved chunks contain the information needed? | > 0.80 |
Retrieval-specific metrics:
| Metric | Definition | Target |
|---|---|---|
| Recall@5 | % of relevant docs in top 5 results | > 75% |
| Recall@10 | % of relevant docs in top 10 results | > 85% |
| MRR (Mean Reciprocal Rank) | Average 1/rank of first relevant result | > 0.70 |
| NDCG@10 | Normalized discounted cumulative gain | > 0.75 |
Regression testing: The golden dataset runs automatically on every change to: chunking logic, embedding model, retrieval parameters, re-ranking model, or system prompt. If any metric drops by more than 2 percentage points, the pipeline blocks deployment and alerts the team. This is the RAG equivalent of a CI test suite.
15.2 Online Evaluation
Production queries are messier, more varied, and more ambiguous than any golden dataset. Online evaluation catches issues that offline benchmarks miss.
User feedback:
- Explicit: Thumbs up/down on every response. Optional text correction ("This is wrong because..."). Copy-to-clipboard events as positive signal.
- Implicit: Time to next query (< 10s suggests the answer was insufficient). Session abandonment (user leaves without interacting). Follow-up questions that rephrase the original (suggests the first answer missed the mark).
LLM-as-judge: Sample 5-10% of production queries and run them through an evaluation prompt:
You are evaluating the quality of a RAG system's response.
Question: {question}
Retrieved Context: {chunks}
System Response: {answer}
Rate on these dimensions (1-5):
1. Correctness: Is the answer factually correct given the context?
2. Completeness: Does the answer fully address the question?
3. Citation quality: Are sources properly cited and relevant?
4. Clarity: Is the answer clear and well-structured?
Also flag:
- Any hallucinated claims (not supported by context)
- Missing information that was in the context but not in the answer
LLM-as-judge cost scales with your sampling rate and chosen model. At 5% sampling on 10M queries/month, that is 500K evaluations. Recalculate with current model pricing.
A/B testing: When testing a new prompt version, embedding model, or retrieval parameter, route 5-10% of traffic to the variant. Compare LLM-as-judge scores and user feedback rates between control and treatment. Require statistical significance (p < 0.05) before rolling out changes.
15.3 Continuous Improvement Loop
Evaluation data feeds back into the system:
- Low-rated answers are reviewed weekly by the platform team. Common failure patterns (bad chunking on a specific document type, missing source, etc.) become tasks.
- User corrections are added to the golden dataset, expanding coverage over time.
- Retrieval failures (queries where none of the top-10 chunks were relevant) trigger chunking quality audits on the source documents.
- Model upgrade evaluation: Before upgrading the LLM (e.g., Sonnet 3.5 to Sonnet 4), run the full golden dataset benchmark and compare. Only upgrade if quality improves or holds steady.
- Feedback into retrieval ranking. This is the loop most teams skip. When a user downvotes an answer, log which chunks were retrieved. Over time, chunks that consistently appear in downvoted answers get a negative signal. This can feed into: (a) chunk quality scoring (deprioritize low-quality chunks during re-ranking), (b) dynamic alpha adjustment (if code queries consistently get bad feedback, shift the BM25 weight higher for that query type), or (c) re-chunking triggers (if a specific document's chunks keep failing, flag it for re-chunking with a different strategy).
Measure retrieval and generation separately. A bad answer has two possible causes: bad retrieval (the right chunks were not found) or bad generation (the right chunks were found but the LLM misused them). If you only measure end-to-end quality, you cannot tell which one failed. Track retrieval metrics (recall@10, MRR) and generation metrics (faithfulness, citation accuracy) independently. When quality drops, check retrieval first. If retrieval recall is fine, the problem is in generation (prompt, model, or guardrails). If retrieval recall dropped, the problem is upstream (chunking, embedding, or index health).
16. Observability
You cannot fix what you cannot see. When a user reports a bad answer, you need to know exactly where the pipeline went wrong: was it retrieval, re-ranking, context assembly, or generation?
16.1 Pipeline Tracing (OpenTelemetry)
Each query generates an OpenTelemetry trace that spans the entire pipeline:
Trace: query-12345
├── Span: query_understanding (50ms)
│ ├── Attribute: query_type = "how-to"
│ ├── Attribute: rewritten = true
│ └── Attribute: cache_hit = false
├── Span: retrieval (130ms)
│ ├── Span: vector_search (40ms)
│ │ └── Attribute: results_count = 50
│ ├── Span: keyword_search (15ms)
│ │ └── Attribute: results_count = 30
│ ├── Span: rrf_merge (2ms)
│ │ └── Attribute: merged_count = 65
│ ├── Span: acl_filter (5ms)
│ │ └── Attribute: filtered_out = 8
│ └── Span: rerank (68ms)
│ └── Attribute: top5_avg_score = 0.82
├── Span: generation (920ms)
│ ├── Attribute: model = "claude-sonnet-4-6"
│ ├── Attribute: input_tokens = 4200
│ ├── Attribute: output_tokens = 380
│ ├── Attribute: cost = $0.012
│ └── Attribute: ttft = 450ms
└── Span: guardrails (25ms)
├── Attribute: nli_check = "passed"
├── Attribute: citations_verified = true
└── Attribute: confidence = 0.85
This trace tells you exactly where time is spent and where quality signals come from. When a user reports a bad answer, pull the trace by query ID and see exactly what happened: what was retrieved, what scores looked like, which model generated the answer, and what the guardrails caught (or missed).
16.2 Key Metrics Dashboard
| Metric | Granularity | Alert Threshold |
|---|---|---|
| End-to-end P50/P95/P99 latency | By model tier, by query type | P95 > 3s |
| TTFT (time to first token) | By model tier | P95 > 2s |
| Retrieval latency | By search type (vector, keyword, hybrid) | P99 > 300ms |
| Re-ranking latency | By re-ranker model | P99 > 200ms |
| Cache hit rate | Overall and by query pattern | Drop below 10% |
| Token usage (input + output) | By model, by team, by query type | Daily total > 2x average |
| Cost per query | By model tier | Exceeds baseline by 2x |
| Retrieval relevance (top-5 avg score) | Overall | Average drops below 0.6 |
| User feedback ratio (thumbs up %) | Overall, rolling 7-day | Drops below 70% |
| Hallucination rate (LLM-as-judge) | Rolling 7-day | Rises above 8% |
| Error rate (5xx) | By pipeline stage | > 0.1% |
| Embedding pipeline lag | Time since last processed document | > 30 minutes |
16.3 Cost Monitoring
LLM costs can spike unexpectedly. A single runaway agent loop, a prompt regression that increases output length, or a cache invalidation event can double costs overnight.
Per-query cost tracking: Every query logs its total cost (embedding + retrieval + re-ranking + generation). This is computed from token counts and model pricing, not estimated.
Cost anomaly alerting: If the rolling 1-hour cost exceeds 2x the expected baseline, trigger an alert. If it exceeds 5x, trigger a circuit breaker that routes all queries to the cheapest model tier until the issue is investigated.
Per-tenant budgets: Each team gets a monthly token budget. When 80% is consumed, the admin gets a warning. At 100%, queries are throttled (increased latency, not denied) unless the budget is increased.
16.4 Embedding Drift Detection
Embedding model behavior can change without warning. Provider-side updates, model deprecation, or subtle API changes can shift the vector space. If the embedding for "authentication" silently shifts by 10%, retrieval quality degrades without any obvious error.
Detection: Weekly, compute embeddings for a fixed set of 100 canonical queries. Compare cosine similarity to the reference embeddings computed when the system was last validated. If any query's embedding shifts by more than a threshold (typically cosine distance > 0.05), alert the team.
17. Production Architecture and Bottlenecks
17.1 Scaling Strategy
| Component | Scaling Dimension | Trigger | Mechanism |
|---|---|---|---|
| Qdrant | Shard count | Vector count > 5M per shard | Add shard, rebalance |
| Elasticsearch | Node count | Index size > 50 GB per node | Add node, rebalance |
| Embedding workers | Worker count | Kafka lag > 10,000 | HPA on Kubernetes |
| LLM inference pool | Concurrent requests | Queue depth > 50 | Scale vLLM replicas |
| API servers | Pod count | QPS > 100 per pod | HPA on CPU/QPS |
Load shedding: Under extreme load (> 1,500 QPS sustained), the platform progressively degrades:
- Disable agentic RAG (route all queries to single-shot).
- Reduce re-ranking from top-20 to top-5.
- Force all queries to the cheapest model tier.
- Disable LLM-as-judge sampling.
- As a last resort, serve cached-only responses and return "system under heavy load" for cache misses.
Each step is triggered by a progressively higher load threshold. The user always gets a response. The response quality degrades gracefully.
17.2 Failure Handling
| Failure | Detection | Mitigation | Recovery |
|---|---|---|---|
| LLM provider outage | API errors > 5% in 1 min | Failover to secondary provider | Auto-retry primary every 30s |
| Qdrant node failure | Health check timeout | Read from replica shard | Node auto-restarts, shard rebalances |
| Elasticsearch down | Health check timeout | Fall back to vector-only search | Cluster self-heals |
| Embedding API outage | API errors | Queue documents in Kafka, process later | Backfill when API recovers |
| Redis cache failure | Connection timeout | Skip cache, full pipeline for every query | Reconnect, warm cache gradually |
| PostgreSQL failure | Connection pool errors | Read from replica for permissions | Primary failover (managed DB) |
Key principle: The system should always return something useful. Never show a blank error page.
Degradation ladder: When components fail, the system steps down through progressively simpler modes. Each level still returns a useful response:
Level 0: Full pipeline (normal operation)
↓ re-ranker timeout or error rate > 10%
Level 1: Skip re-ranking (use RRF scores directly, ~5-10% quality drop)
↓ Qdrant unavailable
Level 2: BM25-only retrieval (keyword search still works, semantic matching lost)
↓ Elasticsearch also down
Level 3: Serve cached responses only (Redis still up, covers 5-25% of queries)
↓ Redis also down
Level 4: Static fallback (return links to top-50 most-accessed docs with search bar)
Circuit breakers per dependency: Each external dependency (LLM API, embedding API, Qdrant, Elasticsearch, Cohere re-ranker) has its own circuit breaker. When error rate exceeds 50% over a 30-second window, the breaker opens and the system skips that component for 60 seconds before retrying. This prevents cascading failures where one slow dependency backs up the entire pipeline.
Multi-region note: This design assumes a single region. For 99.99% availability or regulatory requirements, you need active-passive replication of Qdrant and Elasticsearch across regions, with Kafka MirrorMaker for cross-region event streaming. That adds significant complexity and cost. For most enterprise deployments, 99.9% in a single region is sufficient.
17.3 Multi-Tenant Isolation
Data isolation: Each tenant (engineering team) gets its own namespace in Qdrant and permission-scoped queries in Elasticsearch. Cross-tenant data leakage is prevented by mandatory tenant_id filtering on every query.
Cost isolation: Token usage and costs are tracked per tenant. Monthly budgets are enforced at the API gateway level. One team's spike in usage does not affect other teams' latency because rate limiting is per-tenant.
Performance isolation: Noisy neighbor protection via per-tenant rate limiting (token bucket algorithm). If the platform engineering team decides to index 500K new documents in one day, the ingestion pipeline's Kafka partitioning ensures this does not slow down query serving for other teams.
17.4 Rate Limiting and Backpressure
Query rate limiting: Token bucket per tenant. Default: 10 QPS per team, burstable to 50 QPS for 30 seconds. Adjustable per team based on size and usage patterns.
Ingestion backpressure: When Kafka consumer lag exceeds a threshold, the embedding pipeline signals connectors to slow down. This prevents an unbounded queue from building up during bulk ingestion events.
Priority queues: Three priority levels:
| Priority | Traffic Type | Under Load Behavior |
|---|---|---|
| P0 (critical) | Interactive user queries | Always served; may route to smaller model |
| P1 (normal) | Slack bot queries, IDE integration | Queued; dropped after 10s timeout |
| P2 (background) | Batch eval, re-indexing, cache warming | Paused entirely under load; resumed when QPS drops below threshold |
When the LLM queue grows: If queue depth exceeds 50 pending requests: (1) force all new queries to the cheapest model tier, (2) drop P1 and P2 traffic, (3) if queue still grows past 200, return cached or degraded responses for P0. This prevents a 30-second wait that makes users think the system is broken.
17.5 Bottlenecks and Mitigation
| Bottleneck | Symptom | Relief |
|---|---|---|
| LLM inference at peak | P95 latency > 3s, queue depth grows | Model routing (80% to small model), semantic caching, pre-computed answers for top-100 queries |
| Vector search with complex ACL filters | Retrieval P99 > 200ms | Payload indexes on permission fields, consider per-tenant collections for large tenants |
| Embedding pipeline backlog | Document freshness > 15 min SLA | Scale embedding workers horizontally, batch size optimization, parallel processing |
| Context window limits | Answers miss relevant info that was retrieved | Chunk summarization before context assembly, priority-based chunk selection, larger context models for complex queries |
| Re-ranking latency on large result sets | Re-rank step > 150ms | Reduce initial retrieval from top-100 to top-30, self-hosted GPU re-ranker with batching |
| Cache cold starts (Monday morning, after deployments) | Cache hit rate drops to 0%, latency spikes | Pre-warm cache with top-1000 queries from previous week, gradual rollout after cache invalidation |
| Cross-encoder model loading | First query after cold start takes 5-10s | Keep re-ranker model warm with periodic health check queries, pre-load on pod startup |
18. Cost Analysis
Model pricing changes quarterly. Specific dollar amounts go stale within months. This section covers the cost patterns that hold true even as prices change.
Cost Structure
LLM inference typically represents 60-85% of total platform cost. Everything else — vector databases, Elasticsearch, Kafka, Redis, embedding generation, re-ranking — is a rounding error in comparison. This ratio has held steady even as absolute prices have dropped.
Approximate cost breakdown by category:
| Category | % of Total Cost | What Drives It |
|---|---|---|
| LLM inference | 60-85% | Query volume, model tier mix, average prompt size |
| Infrastructure (vector DB, ES, Kafka, Redis, PG) | 5-15% | Corpus size, query throughput, replication factor |
| Evaluation (LLM-as-judge) | 3-10% | Sampling rate, evaluator model choice |
| Embedding generation | < 1% | Only spikes during full re-indexing |
| Re-ranking | < 1% | Fixed per-query cost, scales linearly |
Model Routing: The Biggest Cost Lever
Not every query needs a frontier model. A tiered routing strategy cuts LLM costs by 50-70% depending on query distribution and classification accuracy.
The key insight: 60-80% of enterprise queries are simple factual lookups that a small, fast model handles perfectly. Only 5-10% require frontier-model reasoning. Route based on query complexity, not uniformly.
| Tier | Query Types | Relative Cost | % of Traffic |
|---|---|---|---|
| Fast | Simple factual, definitions, single-doc answers | 1x (baseline) | ~70-80% |
| Standard | How-to, moderate reasoning, multi-chunk synthesis | 5-10x | ~15-25% |
| Complex | Multi-hop reasoning, comparative analysis | 15-30x | ~5-10% |
Caching: The Second Cost Lever
Semantic caching avoids redundant LLM calls entirely. Enterprise knowledge platforms typically see 5-25% cache hit rates, with higher rates at larger organizations where query patterns repeat. Each cache hit saves the full LLM inference cost for that query.
Self-Hosted vs API: Decision Logic
| Component | Use API When | Self-Host When |
|---|---|---|
| LLM generation | Low-to-moderate query volume; want zero ops burden | High query volume; data sovereignty required; need custom fine-tuning |
| Embeddings | Moderate corpus size; infrequent re-indexing | Large corpus; continuous re-indexing; data cannot leave your network |
| Re-ranking | Almost always — low cost, high value | Data cannot leave your network; need domain-specific fine-tuning |
The crossover point where self-hosting LLM inference becomes cheaper than APIs depends on your query volume, GPU pricing, and engineering ops cost. As a rule of thumb: below ~2M queries/month, APIs are almost always cheaper because you are not paying for idle GPUs. Above ~5M queries/month, self-hosting typically saves 30-60% on LLM costs but requires dedicated MLOps capacity.
Check current pricing from your providers and run the math for your specific scale before committing.
19. Security and Governance
For document-level access control implementation (pre-filtering vs post-filtering, permission sync), see Section 11.4.
Prompt Injection Defense
Internal users are less likely to attempt prompt injection than external users, but it still happens accidentally. An engineer pastes a document containing "ignore all previous instructions" into a query, and the model complies.
Defense layers:
- Input sanitization: Strip known injection patterns from queries. Regex-based, not foolproof, but catches obvious cases.
- Instruction hierarchy: The system prompt uses a clear hierarchy: system instructions > retrieved context > user query. Modern models respect this hierarchy when explicitly told to.
- Output validation: The guardrails layer (Section 14.3) catches responses that deviate from expected format or contain unexpected instructions.
PII and Secrets Handling
Internal documents frequently contain PII, credentials, and secrets. The platform must not amplify their exposure.
Ingestion-time scanning: Before indexing, scan documents for patterns matching API keys, passwords, tokens, and PII (emails, phone numbers). Flag but do not block. Store a contains_sensitive flag on the chunk. During retrieval, warn the user if the answer sources contain sensitive content.
Response-time redaction: Before returning a response to the user, scan for secret patterns (AWS keys, GitHub tokens, database passwords). Redact with [REDACTED - credential detected]. This prevents the LLM from inadvertently surfacing credentials in its answers.
Audit Logging
Every query, retrieval result, generation, and feedback event is logged to an append-only audit store. Logs include:
- Who asked the question (authenticated user ID)
- What documents were retrieved and shown
- What answer was generated
- What feedback was given
- What model and prompt version were used
These logs serve compliance requirements and enable forensic analysis if a data access incident occurs. Retention: 1 year minimum, per organizational policy.
For a deeper dive on LLM security patterns, prompt injection defense, and data privacy architectures, see the LLM data privacy and security guide.
20. Where This System Fails in Production
No architecture post is complete without an honest look at how it breaks. These are the failure chains we have seen or heard about from teams running RAG systems at scale:
| Failure Chain | What Happens | How You Detect It |
|---|---|---|
| Bad chunking -> wrong retrieval -> confident wrong answer | A design doc gets split mid-paragraph. Retrieval returns a chunk that mentions "auth" but lacks the actual flow. The LLM generates a plausible-sounding but incorrect answer, confidently cited. | Low faithfulness scores in LLM-as-judge; user downvotes on specific document types |
| ACL bug -> data leak | A permission sync fails silently. An engineer on team A sees answers sourced from team B's confidential docs. One incident like this kills platform trust permanently. | Cross-tenant query audits; periodic permission reconciliation checks |
| Embedding drift after model upgrade | Provider silently updates the embedding model. Query vectors shift slightly. Retrieval recall degrades by 5-10% with no error, no alert. Answers just get subtly worse. | Weekly canonical query drift detection (Section 16.4) |
| Cache serving stale answers | A runbook gets updated but the cache still serves the old answer for 24 hours. An engineer follows outdated instructions during an incident. | Cache-to-source freshness monitoring; stale-but-serveable flagging |
| Query rewriting makes things worse | The rewriter expands "k8s OOM" to "Kubernetes out of memory error handling best practices." The expanded query retrieves generic content instead of the specific internal runbook. | A/B test rewriter on vs off; monitor retrieval scores with and without rewriting |
| Agent loop burns tokens without converging | An agentic query keeps searching, finding slightly different but never sufficient context. Three iterations later, answer is no better than single-shot. | Per-query cost tracking; agent iteration count dashboards; circuit breaker alerts |
The checklist in Section 21 catches most of these before they hit production. But some will slip through. Your RAG system will fail. The question is whether you find out from your dashboards or from an angry Slack message.
21. Production Readiness Review Checklist
Every major engineering organization runs production readiness reviews before shipping systems. Amazon calls theirs an Operational Readiness Review (ORR). Google's SRE team has a production readiness checklist. Stripe, Uber, and similar companies use internal design review templates. They all evaluate the same dimensions: reliability, scalability, latency, data correctness, security, observability, and cost.
No universal checklist exists for RAG/LLM systems specifically. The ten areas below are adapted from these industry practices, extended with AI-specific concerns (retrieval quality, hallucination control, embedding lifecycle, agent guardrails) that traditional readiness reviews do not cover.
A note on weighting. The point values below are not industry-standard weights. No such standard exists. The importance of each area depends on your system. For RAG/LLM platforms specifically:
| Area | Weight for RAG Systems | Why |
|---|---|---|
| Retrieval quality | Very high | Bad retrieval = bad answers. No LLM fixes this. |
| Hallucination control | Very high | Wrong answers are worse than no answers. |
| Cost and observability | High | LLM inference costs dominate and can spike without warning. |
| Security and governance | High | Internal docs contain sensitive data; access control is non-negotiable. |
| Resilience | Medium-high | Degraded answers are acceptable; total outages are not. |
| Operational readiness | Medium | Important but less unique to RAG than to any production system. |
Treat this checklist as a guiding framework. The scores tell you where your gaps are. The specific thresholds are directional, not pass/fail gates.
How to Score
Each check: 0 (missing), 1 (partial), 2 (complete).
| Score | Readiness |
|---|---|
| 120+ | Production-ready. Ship with confidence. |
| 90-119 | Ship with known gaps documented. Address the gaps within 30 days. |
| 60-89 | Not ready. Critical gaps likely in retrieval quality, safety, or observability. |
| < 60 | Prototype stage. Needs significant work before production traffic. |
21.1 Ingestion and Chunking (20 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 1 | Multiple chunking strategies per document type | Not one-size-fits-all; at least 3 strategies active | 8.6 |
| 2 | Chunk quality validation | Manual review of 100+ random chunks; retrieval recall measured | 8 |
| 3 | Incremental ingestion with change detection | Not full re-index on every update; webhook + polling reconciliation | 7.1 |
| 4 | Document deduplication | Content-hash or SimHash dedup pipeline active | 7.3 |
| 5 | Metadata extraction | Author, date, permissions, source extracted and stored | 7.2 |
| 6 | Embedding model versioned | Version tracked; blue-green re-indexing strategy documented | 9.3 |
| 7 | Ingestion latency SLA defined and monitored | Freshness SLA (e.g., 15 min) measured in dashboards | 16.2 |
| 8 | Failure handling | Poison documents don't block the pipeline; dead letter queue active | 7 |
| 9 | Source connector health monitoring | Each connector reports status; alerts on sync failures or quota exhaustion | 7.1 |
| 10 | Backfill and replay capability | Can re-process any source from a specific date; Kafka replay or equivalent | 7.1 |
21.2 Retrieval Quality (18 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 11 | Hybrid retrieval | BM25 + vector search with score fusion (RRF or weighted) | 11.2 |
| 12 | Re-ranking with cross-encoder | Top-k results re-ranked before context assembly | 11.3 |
| 13 | Retrieval recall measured | Recall@10 > 80% on golden dataset | 15.1 |
| 14 | Access control filtering | Pre-filtering on permissions; no leaked documents in results | 11.4 |
| 15 | Query rewriting/expansion | Ambiguous queries rewritten before retrieval | 11.1 |
| 16 | "Lost in the middle" mitigation | Chunk ordering optimized in context window | 11.5 |
| 17 | Retrieval latency P99 < 350ms | Measured in production dashboards | 16.2 |
| 18 | Recency weighting for time-sensitive queries | Recent docs boosted for queries with temporal signals | 11.1 |
| 19 | Multi-source routing | Agent selects relevant sources instead of searching everything | 12.4 |
21.3 Generation and Hallucination Control (20 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 20 | Model routing active | Not every query hits the most expensive model; at least 2 tiers | 13.1 |
| 21 | Grounding instruction in system prompt | "Answer only from retrieved context" explicitly stated | 14.1 |
| 22 | Inline citations with source links | Every response includes [Source N] with clickable URLs | 14.2 |
| 23 | "I don't know" path | Low confidence triggers explicit uncertainty message | 14.1 |
| 24 | Post-generation factual consistency check | NLI or citation verification on Standard/Complex tier | 14.3 |
| 25 | Streaming responses | SSE streaming with sub-2s TTFT | 13.3 |
| 26 | Context window budget documented | Token allocation for system + context + history + output | 13.4 |
| 27 | Structured output enforcement | JSON mode or schema validation on citations | 14.4 |
| 28 | Prompt versioning with rollback | Prompts stored in version control; can revert within minutes | 13.2 |
| 29 | Conversation memory management | Multi-turn context handled via sliding window or summarization | 13.4 |
21.4 Agentic RAG and Tool Use (12 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 30 | Complex queries routed to agentic path | Query complexity classifier active; not all queries single-shot | 12.1 |
| 31 | Query decomposition for multi-hop questions | LLM decomposes complex queries into sub-queries | 12.2 |
| 32 | Max iteration and token budget caps | Hard limits on iterations (3-5), tool calls (10-15), tokens (50K) | 12.6 |
| 33 | Circuit breaker on agent failure | Fallback to single-shot RAG on loop detection or timeout | 12.6 |
| 34 | Tool interface standardized | MCP or equivalent; tools discoverable and schema-validated | 12.5 |
| 35 | Agent audit trail | Every tool call and reasoning step logged with trace ID | 12.5 |
21.5 Evaluation and Feedback (18 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 36 | Golden dataset maintained | 500+ curated Q&A pairs; updated quarterly | 15.1 |
| 37 | Offline regression tests | Run on every pipeline change; blocks deploy on regression | 15.1 |
| 38 | User feedback collected | Thumbs up/down at minimum; corrections optional | 15.2 |
| 39 | LLM-as-judge on production traffic | 5-10% of queries evaluated automatically | 15.2 |
| 40 | A/B testing framework | Prompt/model/retrieval changes tested on subset of traffic | 15.2 |
| 41 | Weekly failure analysis | Low-rated answers reviewed; action items tracked | 15.3 |
| 42 | Retrieval and generation metrics tracked over time | Not just point-in-time; trend dashboards active | 16.2 |
| 43 | Eval dataset covers all query types proportionally | Factual, how-to, analytical, multi-hop all represented | 15.1 |
| 44 | Domain-specific eval metrics defined | Metrics tailored to your use case beyond generic RAGAS | 15.1 |
21.6 Observability and Cost (20 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 45 | OpenTelemetry traces across full pipeline | Trace spans for every stage: query, retrieval, re-rank, generation | 16.1 |
| 46 | Latency breakdown by stage | Per-stage P50/P95/P99 in dashboards | 16.2 |
| 47 | Token usage and cost per query | Tracked by model, by team, by query type | 16.3 |
| 48 | Cost anomaly alerting | Circuit breaker on cost spike | 16.3 |
| 49 | Cache hit rate monitored | Target 5-25%; alert on sustained drop below 3% | 10.5 |
| 50 | Embedding drift detection | Weekly canonical query check; alert on shift > 0.05 cosine distance | 16.4 |
| 51 | Per-tenant rate limiting | Token bucket enforced at API gateway | 17.4 |
| 52 | Daily cost reports by team/tenant | Automated reporting; budget enforcement active | 16.3 |
| 53 | SLA dashboard for stakeholders | Uptime, latency, quality metrics visible to consumers and leadership | 16.2 |
| 54 | Alerting runbooks per alert type | Every alert has a documented response procedure | 16.2 |
21.7 Resilience and Production Hardening (12 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 55 | Multi-provider LLM failover | Primary + secondary + self-hosted fallback chain | 13.1 |
| 56 | Retrieval fallback chain | Hybrid -> keyword-only -> cached responses | 17.2 |
| 57 | Embedding pipeline failure isolation | Stale index served; freshness SLA breach alerted | 17.2 |
| 58 | Load shedding and graceful degradation | Progressive degradation under extreme load | 17.1 |
| 59 | Multi-tenant data isolation validated | Cross-tenant query returns zero foreign results | 17.3 |
| 60 | Prompt injection defense active | Input sanitization + instruction hierarchy + output validation | 19 |
21.8 Security and Governance (20 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 61 | PII detection on ingested documents | Scanner flags PII before indexing | 19 |
| 62 | Secrets and credential redaction | API keys, tokens, passwords detected and redacted in responses | 19 |
| 63 | Audit logging for all queries | Who asked, what was retrieved, what was generated, all logged | 19 |
| 64 | RBAC on admin operations | Index management, prompt editing, eval dataset changes require authorized roles | 19 |
| 65 | Data retention policy enforced | Query logs, feedback, and cached responses expire per policy | 19 |
| 66 | Source permission sync validated | Permissions propagate to vector store within SLA | 11.4 |
| 67 | LLM provider data processing agreements | Query/response data not used for training; DPAs signed | 19 |
| 68 | Input sanitization beyond regex | Layered injection defense | 19 |
| 69 | Sensitive document flagging | Documents marked contains_sensitive at ingestion | 19 |
| 70 | Compliance review completed | Legal/security team has reviewed data flows and access patterns | 19 |
21.9 Data Quality and Freshness (16 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 71 | Stale document detection | Documents not updated in 6+ months flagged | 7.1 |
| 72 | Source connector health dashboard | Each source shows last sync time, error rate, document count | 7.1 |
| 73 | Broken link and dead reference cleanup | Periodic scan detects deleted/moved content | 7.3 |
| 74 | Document quality scoring | Low-quality chunks detected and quarantined | 8 |
| 75 | Per-source freshness SLA | Each source has a defined freshness target; monitored | 7.1 |
| 76 | Corpus coverage tracking | Dashboard shows document count by source, team, and age | 7 |
| 77 | Re-indexing on schema/format changes | Ingestion pipeline adapts without data loss | 7.2 |
| 78 | Chunk-to-source lineage queryable | Given any chunk, can trace back to exact source doc and version | 8.6 |
21.10 Operational Readiness (14 points max)
| # | Check | Pass Criteria | Section |
|---|---|---|---|
| 79 | Runbooks for top 5 failure scenarios | LLM outage, vector DB degradation, embedding stall, cost spike, permission leak | 17.2 |
| 80 | On-call rotation defined | Named owners for platform incidents; escalation path documented | 17 |
| 81 | Disaster recovery tested | Full restore from backup validated within target RTO | 17.2 |
| 82 | Capacity planning documented | Growth projections for queries, documents, and cost | 6 |
| 83 | Deployment pipeline with rollback | Canary or blue-green deploy; rollback under 5 min | 17.1 |
| 84 | Chaos/failure injection tested | At least one failure scenario tested in staging | 17.2 |
| 85 | Onboarding documentation for new teams | Self-service guide for connecting new document sources | 7 |
Total: 85 checks, 170 points maximum.
This checklist covers ten dimensions of production readiness, adapted from the same areas that Amazon ORR, Google SRE reviews, and Stripe/Uber design reviews evaluate, extended with RAG/LLM-specific concerns. The score is not a certification. It is a gap analysis. The checks you score 0 on tell you exactly where to invest next. Run this review quarterly as your system evolves, your corpus grows, and new model capabilities become available.
Explore the Technologies
Many of the technologies referenced in this post have dedicated deep-dive pages on this site:
- RAG for retrieval-augmented generation fundamentals, chunking strategies, and evaluation frameworks
- Vector Databases for HNSW internals, distance metrics, and vendor comparisons
- vLLM for PagedAttention, continuous batching, and self-hosted inference optimization
- LangChain for LCEL, LangGraph agent workflows, and LangSmith observability
- MCP Server for Model Context Protocol architecture, transports, and security model