RAG Pipeline Patterns

RAG (Retrieval-Augmented Generation)

A pattern where the system fetches relevant information at query time and feeds it into the LLM prompt, instead of relying on what the model memorized during training. Like an open-book exam instead of a closed-book one. This grounds the model in real, current data and dramatically reduces hallucination.

Chunking

Breaking documents or code into smaller pieces (chunks) that can be stored and retrieved individually. The chunk size matters a lot: too small and context is lost, too large and retrieval gets noisy. For code, the ideal chunk is one complete function or class. For documents, 200-500 tokens per chunk is typical.

AST-Aware Chunking

A code-specific strategy for splitting source files into searchable pieces. Instead of cutting at arbitrary token boundaries (which might split a function in half), the system uses the code's syntax tree to chunk at natural boundaries: one function per chunk, one class per chunk. A complete 30-line function is one chunk. Splitting it at token 512 produces two halves that are useless on their own.

Chunk Overlap

Including a few sentences or lines from the end of one chunk at the beginning of the next, so context isn't lost at chunk boundaries. Typically 10-20% overlap. Without it, a concept that spans two chunks might not be retrieved because neither chunk alone captures enough of it. Trade-off: more overlap means more storage and slightly more embedding cost.

Embedding Model

A specialized AI model (separate from the main LLM) that converts text or code into a list of numbers called a vector (e.g., 1024 numbers). Similar content produces similar vectors, which enables semantic search. Like converting words into GPS coordinates so that related concepts are geographically close to each other.

Vector Database

A database designed to store vectors (lists of numbers) and quickly find the most similar ones to a query vector. Think of it as a search engine for meaning, not just keywords. Examples: Qdrant, Pinecone, pgvector. Uses HNSW index for fast approximate nearest-neighbor search across millions of vectors.

HNSW Index

Hierarchical Navigable Small World graph, the standard algorithm for finding similar vectors fast. Instead of comparing a query against every stored vector (too slow), HNSW builds a multi-layer graph that narrows down candidates quickly, like a skip list for vectors. Key parameters: m (connections per node, affects recall vs memory), ef (search width, affects speed vs accuracy).

Cosine Similarity

A way to measure how similar two vectors are, based on the angle between them. Score of 1.0 means identical direction (very similar content), 0.0 means completely unrelated. Like comparing which direction two arrows point. Typical retrieval threshold: 0.75+ (include in results). Semantic cache hit threshold: 0.95+ (confident enough to reuse a cached answer).

Hybrid Retrieval

Using two search methods together for better results. Vector search finds semantically similar content ('handle payment failures' finds retryPayment()). Keyword search finds exact matches ('processPayment' finds the function by name). Combining both via Reciprocal Rank Fusion gives 5-15% better results than either method alone.

Re-Ranking

A second, more expensive pass that improves result quality. After the initial retrieval returns top-20 candidates (fast, approximate), a cross-encoder model (e.g., Cohere Rerank) reads each candidate alongside the query and assigns a more accurate relevance score. Like a quick scan to find 20 candidates, then a careful read to pick the best 5.

Semantic Cache

Caching LLM responses keyed by the semantic meaning of the query, not just exact text match. If someone asks 'how do payments work?' and the cache has an answer for 'explain the payment flow', the system checks cosine similarity between the query embeddings. Above 0.95 similarity, the cached answer is returned without calling the LLM. Saves cost and latency for repetitive question patterns.

Query Expansion (HyDE)

Hypothetical Document Embeddings, a technique where the LLM generates a hypothetical answer to the query before searching. The system then embeds this hypothetical answer and searches for real documents similar to it. This bridges the vocabulary gap between how people ask questions and how documents are written. Adds 200-400ms latency (one LLM call) but can improve recall by 5-15%.

Matryoshka Embeddings

Named after Russian nesting dolls, these are embeddings designed to work at multiple dimension sizes. The full embedding might be 3072 numbers, but truncating to 1024 numbers still captures most of the meaning (less than 2% quality loss). This saves 66% on storage. text-embedding-3-large supports this natively.

Evaluation Metrics (RAGAS)

A framework for measuring RAG quality with four key metrics. Faithfulness: is the answer grounded in the retrieved chunks? Answer relevance: does it address the question? Context relevance: are the retrieved chunks actually relevant? Context recall: did retrieval find all the needed information? Target above 0.80 for each. Without these metrics, RAG quality degrades silently over time.

Vector Storage Math

How to calculate how much disk space a vector database needs. Formula: chunks x dimensions x bytes_per_value + 30% HNSW overhead. Example: 10M chunks x 1024 dimensions x 1 byte (int8 quantization) = 10 GB for vectors, plus 3 GB for the HNSW graph, plus metadata. The math scales linearly with chunk count.