RAG
Retrieval-Augmented Generation: give your LLM actual facts instead of letting it guess
Use Cases
Architecture
Why It Exists
LLMs are trained on static snapshots of the internet. They have a knowledge cutoff. They know nothing about a company's internal docs, last week's product changelog, or that one policy update from legal. Ask them about things outside their training data and they will either refuse or, worse, hallucinate. They will state something totally wrong with complete confidence.
Fine-tuning can patch some of these gaps, but it is expensive, slow, and creates yet another static snapshot that starts going stale the moment training finishes.
RAG takes a different approach: decouple knowledge from the model entirely. Instead of baking facts into model weights, the system retrieves the relevant information at query time and hand it to the LLM as context. The model stops being a memorization machine and starts being a reasoning engine over supplied evidence. Facebook AI Research (now Meta AI) formalized this idea in 2020, and it has since become the dominant pattern for getting LLMs into production.
How It Works
The pipeline splits into two phases: indexing (offline, batch) and inference (online, per-query).
Indexing Phase. Load the documents, split them into chunks, and run each chunk through an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or open-source options like bge-large-en-v1.5). Each chunk becomes a dense vector. These vectors go into a vector database alongside the raw text and metadata. The embedding model maps text into a high-dimensional space where semantically similar content lands near each other, measured by cosine similarity or dot product.
Inference Phase. A user submits a query. The system embeds it with the same model to get a query vector. The vector database runs approximate nearest neighbor (ANN) search and returns the top-K most similar chunks. Those chunks get assembled into a prompt template along with the original question, and the LLM generates an answer grounded in the retrieved context.
The devil is in every single detail.
Architecture Deep Dive
Naive RAG is the textbook version: embed, retrieve, generate. It works fine for straightforward factoid questions. It falls apart on anything requiring multi-hop reasoning, nuanced interpretation, or synthesis across multiple documents.
Advanced RAG bolts on pre-retrieval and post-retrieval optimizations. On the pre-retrieval side: query rewriting (use an LLM to reformulate vague queries into precise ones), HyDE (generate a hypothetical answer first, then embed that hypothetical answer for retrieval, which often surfaces better matches), and query decomposition (break complex questions into sub-queries and retrieve for each). On the post-retrieval side: re-ranking with cross-encoder models, context compression (strip irrelevant sentences from retrieved chunks), and chain-of-thought prompting to help the LLM reason over what it has been given.
Modular RAG treats each component as a swappable module. Different retrievers (dense, sparse, hybrid), different re-rankers, different prompt strategies. The pipeline is optimized per use case instead of treating it as a monolith. Agentic RAG goes a step further. An LLM agent decides when to retrieve, which tool to use, and whether to refine the query. The LLM gets control over its own retrieval strategy. Debugging gets painful at this stage.
Graph RAG, introduced by Microsoft in 2024, builds a knowledge graph from the corpus during indexing. At retrieval time it combines graph traversal with vector search. This shines on questions about relationships between entities. "How does company X's pricing strategy connect to their supply chain decisions?" Isolated chunks miss the bigger picture. A graph captures it.
Stripe's support system uses multi-index RAG to answer questions that span API docs, changelog entries, and support tickets at the same time. That kind of multi-source retrieval is where RAG really earns its keep.
Chunking Strategies
Chunking is the most underrated part of the whole pipeline. Get it wrong and nothing downstream can compensate.
Fixed-size chunking (512 tokens, 50-token overlap) is simple, but it splits sentences and ideas mid-thought. It is the "Hello World" of chunking. Fine for prototyping, rarely ideal for production.
Recursive character splitting tries paragraph boundaries first, then sentence boundaries, then word boundaries. Better than fixed-size, still mechanical.
Semantic chunking uses embedding similarity between consecutive sentences. When the similarity drops below a threshold, a new chunk starts. This creates more natural groupings that actually respect topic shifts.
Document-structure-aware chunking uses headings, sections, and formatting to create boundaries. If the documents have decent structure, this is usually the winner.
For code, use AST-based chunking that respects function and class boundaries. For tables, keep entire tables as single chunks with header context included. For Q&A pairs, keep question and answer together (splitting them is a common mistake that tanks retrieval quality). The right strategy depends entirely on the document type. Anyone claiming there is a universal chunker is selling something.
Evaluation
This is where most teams cut corners, and it always comes back to bite them.
RAG evaluation requires measuring three dimensions independently: retrieval quality (were the right chunks found?), context relevance (are those chunks actually useful for the question?), and answer faithfulness (does the generated answer stick to what the retrieved context says, or did the LLM wander off?).
Frameworks like RAGAS, DeepEval, and TruLens automate these metrics, and they are a good starting point. But do not fool yourself into thinking automated metrics are enough. Human evaluation still matters for subjective quality. The numbers correlate with human judgment, but they do not replace it. Budget time for both.
Pros
- • Grounds LLM responses in factual, up-to-date sources
- • Cuts hallucinations drastically compared to raw LLM generation
- • No fine-tuning needed to add domain-specific knowledge
- • Sources are citable, so users can actually verify answers
- • You can update the knowledge base without retraining the model
Cons
- • Retrieval quality caps generation quality. Garbage in, garbage out.
- • Chunking strategy has a huge impact on relevance, and there is no universal answer
- • Latency overhead from embedding, retrieval, and re-ranking adds up
- • Evaluation is tricky. You need to measure retrieval precision, context relevance, and answer faithfulness separately.
- • Cost scales with corpus size (vector storage, embedding API calls, re-ranking)
When to use
- • Your LLM needs access to private, proprietary, or frequently changing data
- • Factual accuracy and source attribution matter
- • Fine-tuning is too expensive or the knowledge base changes too often
- • Domain-specific Q&A where the LLM's training data falls short
When NOT to use
- • Tasks that need pure reasoning or creativity, not factual grounding
- • Tiny knowledge bases where the whole corpus fits in the context window
- • Real-time applications that cannot tolerate retrieval latency
- • Highly structured data that is better served by SQL queries or APIs
Key Points
- •Naive RAG (embed-retrieve-generate) hits roughly 60-70% accuracy. Advanced RAG with query rewriting, HyDE, and re-ranking pushes that to 85-95% on domain benchmarks.
- •Chunk size is the single most impactful hyperparameter. 256-512 tokens with 10-20% overlap is a solid default, but semantic chunking (splitting at paragraph or section boundaries) beats fixed-size on structured documents.
- •Re-ranking with cross-encoder models (Cohere Rerank, bge-reranker, etc.) after initial vector retrieval improves precision by 15-30%, at the cost of 50-100ms latency.
- •Hybrid search combining dense vectors (semantic) and sparse vectors (BM25 keyword matching) consistently outperforms either approach alone by 10-20%.
- •Agentic RAG uses tool-calling LLMs to dynamically decide whether to retrieve, which index to query, and how to reformulate queries. It significantly outperforms static pipelines on multi-hop questions.
Common Mistakes
- ✗Using one monolithic index for everything. Code, prose, tables, and Q&A pairs need different chunking strategies and often separate indices.
- ✗Not evaluating retrieval and generation separately. A correct answer from wrong sources is a false positive. Track retrieval recall, context precision, and faithfulness independently.
- ✗Ignoring metadata filtering. Vector similarity alone will retrieve semantically similar but contextually wrong chunks. Combine it with date, source, and category filters.
- ✗Stuffing too many chunks into context. More is not always better. Beyond 5-10 relevant chunks, noise increases and answer quality drops (the lost-in-the-middle effect).
- ✗Skipping the re-ranking step. Bi-encoder retrieval is fast but imprecise. A cross-encoder re-ranker on the top-20 results dramatically improves the final top-5 relevance.