Vector Databases
Specialized databases for storing, indexing, and querying high-dimensional vector embeddings at scale
Use Cases
Architecture
Why It Exists
Here is the core problem: traditional databases are built for exact matches. Find rows where status = 'active' or price < 100. AI applications work differently. They deal in embeddings, high-dimensional vectors (768 to 3072 dimensions) where meaning lives in proximity. The question is not "find this exact value" but "find the 10 most similar items."
Brute-force is the naive approach. Compute cosine similarity against every vector, O(N×D) time. At 10 million vectors with 1536 dimensions, that is 15 billion floating-point operations per query. Completely impractical for anything real-time.
Vector databases fix this with Approximate Nearest Neighbor (ANN) algorithms. They drop search complexity from O(N) to O(log N) by building specialized index structures. The tradeoff is a small accuracy hit (typically 95-99% recall) in exchange for going from seconds to single-digit milliseconds. In practice, that recall gap rarely matters.
How It Works
Embedding and Ingestion: Data (text, images, audio) gets converted into dense vector embeddings using a model. OpenAI's text-embedding-3-large, for example, produces 3072-dimensional vectors. These vectors get inserted along with the source content and metadata into the database. The database builds or updates its ANN index during ingestion.
Index Construction: Two ANN algorithms dominate: HNSW and IVF-PQ.
HNSW builds a multi-layer proximity graph. The bottom layer holds every vector. Each higher layer holds a shrinking subset. Search starts at the top (few nodes, big jumps) and works down through layers (more nodes, shorter jumps). Think of it like a skip list, but in high-dimensional space. The M parameter controls how many connections each node gets (higher means better accuracy and more memory), and ef_construction controls how broadly the algorithm searches during build time.
IVF-PQ takes a different approach. First it clusters vectors into partitions using k-means (the IVF part), then compresses each vector with product quantization (the PQ part). At query time, the search only covers the nearest clusters and compute distances on the quantized representations. This uses far less memory than HNSW, but recall takes a hit.
Query Execution: The client sends a query vector along with optional metadata filters. The database searches the ANN index for candidate vectors, applies the filters, computes exact distances on the survivors, and returns the top-K results with similarity scores.
Architecture Deep Dive
Pinecone is fully managed and serverless. Create an index, upsert vectors, query. That is it. No clusters to manage, no sharding to think about. It runs a proprietary distributed architecture that handles replication and scaling automatically. Pricing is based on storage and query volume. If the team does not want to run infrastructure, Pinecone is the obvious choice. The downside is cost at scale and less control over index tuning.
Qdrant is open-source and written in Rust. Where it really stands out is filtered vector search. Its HNSW implementation supports payload-based filtering during graph traversal, not as a post-processing step. This matters enormously when metadata filters are highly selective. It also supports multi-vector points (multiple vectors per entity), binary quantization for memory savings, and snapshot-based backups.
Milvus is open-source (Go/C++) and built for billion-scale deployments. The architecture separates storage, indexing, and query processing into independent microservices, so each one scales independently. It supports GPU-accelerated indexing and search. For handling 10 billion+ vectors, Milvus is probably the only realistic open-source option.
Weaviate is open-source (Go) and differentiates with built-in vectorization modules. Raw text or images can be sent directly, and Weaviate calls the embedding model automatically. It also supports generative search (basically RAG built into the database layer) and multimodal search. Convenient for a more batteries-included experience, though the extra abstraction can get in the way when fine-grained control is needed.
pgvector extends PostgreSQL with vector similarity search using ivfflat or HNSW indexes. For teams already running PostgreSQL with a dataset under 5 million vectors, pgvector saves the trouble of adding another database to the stack. That is a real operational win. But it does not keep up with purpose-built vector databases at scale. Know its ceiling before committing.
Production Considerations
Dimension count directly affects memory, storage, and search performance. Text embeddings range from 384 (all-MiniLM-L6-v2) to 3072 (text-embedding-3-large). More dimensions capture more nuance, but they cost more across the board. Matryoshka embeddings allow truncating dimensions (use 512 out of 3072, for instance) with a graceful accuracy dropoff. This is worth exploring before defaulting to max dimensions.
Plan for re-indexing from day one. Embedding models get better regularly. When switching models, every single vector needs to be recomputed. Build the ingestion pipeline so it can run a full re-index while still serving live queries. At 10 million documents, re-indexing through an API-based embedding model takes 8-12 hours and costs $50-200 in API calls. Without planning for this upfront, the pain is real.
Monitor recall quality over time. As the data distribution shifts, index parameters tuned for the original dataset can degrade. Run periodic recall benchmarks against a ground-truth evaluation set. This is easy to skip and painful to debug when search quality silently drops.
Pros
- • Sub-millisecond similarity search across billions of vectors
- • ANN indexes dramatically outperform brute-force scanning
- • Native metadata filtering combined with vector search
- • Managed cloud options reduce operational overhead
- • Support for hybrid search (dense + sparse vectors)
Cons
- • Results are approximate. ANN algorithms trade recall for speed.
- • Index build time grows steep for large datasets (hours at the billion-vector scale)
- • Memory-hungry. HNSW indexes hold graph structures in RAM.
- • No standard query language. Every database ships its own API.
- • Changing your embedding model means re-indexing your entire corpus
When to use
- • Semantic similarity search where keyword matching falls short
- • RAG applications that need fast retrieval over large document sets
- • Recommendation systems built on learned embeddings
- • Any application where items are represented as dense vectors
When NOT to use
- • Exact match or structured queries (use a relational database)
- • Small datasets under 10K vectors (brute-force cosine similarity works fine)
- • Frequently swapping embedding models (re-indexing cost adds up fast)
- • Workloads that require ACID transactions on vector data
Key Points
- •HNSW (Hierarchical Navigable Small World) is the dominant ANN algorithm. It builds a multi-layer graph where each layer forms a navigable small-world network, hitting O(log N) search complexity with 95%+ recall.
- •IVF-PQ (Inverted File with Product Quantization) compresses vectors down to 1/32nd their original size. It splits each vector into subvectors and quantizes each one independently, making billion-scale search possible on commodity hardware.
- •Metadata filtering order matters a lot. Pre-filtering narrows the search space (faster, but risks missing relevant vectors). Post-filtering searches broadly first, then removes mismatches (more accurate, but slower).
- •Pick the distance metric carefully: cosine similarity for normalized text embeddings, L2/Euclidean for spatial data, inner product for MIPS in recommendation systems.
- •Pinecone, Qdrant, Milvus, Weaviate, and Chroma lead the space. Pinecone is fully managed, Qdrant is best at filtered search, Milvus handles the largest scale, Weaviate does multimodal well, and Chroma optimizes for simplicity.
Common Mistakes
- ✗Skipping benchmarks on actual data. ANN performance shifts dramatically with dimensionality, cluster structure, and query patterns. Synthetic benchmarks (ANN-Benchmarks) will not predict how a production workload behaves.
- ✗Using cosine similarity on unnormalized vectors. If the embeddings are not L2-normalized, cosine similarity and dot product produce different rankings. Always normalize first or match the metric to the embedding model.
- ✗Ignoring the recall-latency tradeoff in HNSW. Cranking up ef_search improves recall but increases latency roughly linearly. Profile this tradeoff against the actual accuracy needs.
- ✗Storing vectors without the source text. Always co-locate the original text and metadata alongside the vector. Round-tripping to a second database for content adds latency and unnecessary complexity.
- ✗Not planning for embedding model upgrades. When moving from text-embedding-ada-002 to text-embedding-3-large, every vector in the database needs to be recomputed. Design the pipeline for periodic full re-indexing from the start.