RAG

Why It Exists

LLMs are trained on static snapshots of the internet. They have a knowledge cutoff. They know nothing about a company's internal docs, last week's product changelog, or that one policy update from legal. Ask them about things outside their training data and they will either refuse or, worse, hallucinate. They will state something totally wrong with complete confidence.

Fine-tuning can patch some of these gaps, but it is expensive, slow, and creates yet another static snapshot that starts going stale the moment training finishes.

RAG takes a different approach: decouple knowledge from the model entirely. Instead of baking facts into model weights, the system retrieves the relevant information at query time and hand it to the LLM as context. The model stops being a memorization machine and starts being a reasoning engine over supplied evidence. Facebook AI Research (now Meta AI) formalized this idea in 2020, and it has since become the dominant pattern for getting LLMs into production.

How It Works

The pipeline splits into two phases: indexing (offline, batch) and inference (online, per-query).

Indexing Phase. Load the documents, split them into chunks, and run each chunk through an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or open-source options like bge-large-en-v1.5). Each chunk becomes a dense vector. These vectors go into a vector database alongside the raw text and metadata. The embedding model maps text into a high-dimensional space where semantically similar content lands near each other, measured by cosine similarity or dot product.

Inference Phase. A user submits a query. The system embeds it with the same model to get a query vector. The vector database runs approximate nearest neighbor (ANN) search and returns the top-K most similar chunks. Those chunks get assembled into a prompt template along with the original question, and the LLM generates an answer grounded in the retrieved context.

The devil is in every single detail.

Architecture Deep Dive

Naive RAG is the textbook version: embed, retrieve, generate. It works fine for straightforward factoid questions. It falls apart on anything requiring multi-hop reasoning, nuanced interpretation, or synthesis across multiple documents.

Advanced RAG bolts on pre-retrieval and post-retrieval optimizations. On the pre-retrieval side: query rewriting (use an LLM to reformulate vague queries into precise ones), HyDE (generate a hypothetical answer first, then embed that hypothetical answer for retrieval, which often surfaces better matches), and query decomposition (break complex questions into sub-queries and retrieve for each). On the post-retrieval side: re-ranking with cross-encoder models, context compression (strip irrelevant sentences from retrieved chunks), and chain-of-thought prompting to help the LLM reason over what it has been given.

Modular RAG treats each component as a swappable module. Different retrievers (dense, sparse, hybrid), different re-rankers, different prompt strategies. The pipeline is optimized per use case instead of treating it as a monolith. Agentic RAG goes a step further. An LLM agent decides when to retrieve, which tool to use, and whether to refine the query. The LLM gets control over its own retrieval strategy. Debugging gets painful at this stage.

Graph RAG, introduced by Microsoft in 2024, builds a knowledge graph from the corpus during indexing. At retrieval time it combines graph traversal with vector search. This shines on questions about relationships between entities. "How does company X's pricing strategy connect to their supply chain decisions?" Isolated chunks miss the bigger picture. A graph captures it.

Stripe's support system uses multi-index RAG to answer questions that span API docs, changelog entries, and support tickets at the same time. That kind of multi-source retrieval is where RAG really earns its keep.

Chunking Strategies

Chunking is the most underrated part of the whole pipeline. Get it wrong and nothing downstream can compensate.

Fixed-size chunking (512 tokens, 50-token overlap) is simple, but it splits sentences and ideas mid-thought. It is the "Hello World" of chunking. Fine for prototyping, rarely ideal for production.

Recursive character splitting tries paragraph boundaries first, then sentence boundaries, then word boundaries. Better than fixed-size, still mechanical.

Semantic chunking uses embedding similarity between consecutive sentences. When the similarity drops below a threshold, a new chunk starts. This creates more natural groupings that actually respect topic shifts.

Document-structure-aware chunking uses headings, sections, and formatting to create boundaries. If the documents have decent structure, this is usually the winner.

For code, use AST-based chunking that respects function and class boundaries. For tables, keep entire tables as single chunks with header context included. For Q&A pairs, keep question and answer together (splitting them is a common mistake that tanks retrieval quality). The right strategy depends entirely on the document type. Anyone claiming there is a universal chunker is selling something.

Evaluation

This is where most teams cut corners, and it always comes back to bite them.

RAG evaluation requires measuring three dimensions independently: retrieval quality (were the right chunks found?), context relevance (are those chunks actually useful for the question?), and answer faithfulness (does the generated answer stick to what the retrieved context says, or did the LLM wander off?).

Frameworks like RAGAS, DeepEval, and TruLens automate these metrics, and they are a good starting point. But do not fool yourself into thinking automated metrics are enough. Human evaluation still matters for subjective quality. The numbers correlate with human judgment, but they do not replace it. Budget time for both.

Why It Exists

Fine-tuning can patch some of these gaps, but it is expensive, slow, and creates yet another static snapshot that starts going stale the moment training finishes.

How It Works

The pipeline splits into two phases: indexing (offline, batch) and inference (online, per-query).

The devil is in every single detail.

Architecture Deep Dive

Chunking Strategies

Chunking is the most underrated part of the whole pipeline. Get it wrong and nothing downstream can compensate.

Recursive character splitting tries paragraph boundaries first, then sentence boundaries, then word boundaries. Better than fixed-size, still mechanical.

Document-structure-aware chunking uses headings, sections, and formatting to create boundaries. If the documents have decent structure, this is usually the winner.

Evaluation

This is where most teams cut corners, and it always comes back to bite them.

Use Cases

Architecture

Why It Exists

How It Works

Architecture Deep Dive

Chunking Strategies

Evaluation

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

RAG

Use Cases

Architecture

Why It Exists

How It Works

Architecture Deep Dive

Chunking Strategies

Evaluation

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies