Vector Databases

Why It Exists

Here is the core problem: traditional databases are built for exact matches. Find rows where status = 'active' or price < 100. AI applications work differently. They deal in embeddings, high-dimensional vectors (768 to 3072 dimensions) where meaning lives in proximity. The question is not "find this exact value" but "find the 10 most similar items."

Brute-force is the naive approach. Compute cosine similarity against every vector, O(N×D) time. At 10 million vectors with 1536 dimensions, that is 15 billion floating-point operations per query. Completely impractical for anything real-time.

Vector databases fix this with Approximate Nearest Neighbor (ANN) algorithms. They drop search complexity from O(N) to O(log N) by building specialized index structures. The tradeoff is a small accuracy hit (typically 95-99% recall) in exchange for going from seconds to single-digit milliseconds. In practice, that recall gap rarely matters.

How It Works

Embedding and Ingestion: Data (text, images, audio) gets converted into dense vector embeddings using a model. OpenAI's text-embedding-3-large, for example, produces 3072-dimensional vectors. These vectors get inserted along with the source content and metadata into the database. The database builds or updates its ANN index during ingestion.

Index Construction: Two ANN algorithms dominate: HNSW and IVF-PQ.

HNSW builds a multi-layer proximity graph. The bottom layer holds every vector. Each higher layer holds a shrinking subset. Search starts at the top (few nodes, big jumps) and works down through layers (more nodes, shorter jumps). Think of it like a skip list, but in high-dimensional space. The M parameter controls how many connections each node gets (higher means better accuracy and more memory), and ef_construction controls how broadly the algorithm searches during build time.

IVF-PQ takes a different approach. First it clusters vectors into partitions using k-means (the IVF part), then compresses each vector with product quantization (the PQ part). At query time, the search only covers the nearest clusters and compute distances on the quantized representations. This uses far less memory than HNSW, but recall takes a hit.

Query Execution: The client sends a query vector along with optional metadata filters. The database searches the ANN index for candidate vectors, applies the filters, computes exact distances on the survivors, and returns the top-K results with similarity scores.

Architecture Deep Dive

Pinecone is fully managed and serverless. Create an index, upsert vectors, query. That is it. No clusters to manage, no sharding to think about. It runs a proprietary distributed architecture that handles replication and scaling automatically. Pricing is based on storage and query volume. If the team does not want to run infrastructure, Pinecone is the obvious choice. The downside is cost at scale and less control over index tuning.

Qdrant is open-source and written in Rust. Where it really stands out is filtered vector search. Its HNSW implementation supports payload-based filtering during graph traversal, not as a post-processing step. This matters enormously when metadata filters are highly selective. It also supports multi-vector points (multiple vectors per entity), binary quantization for memory savings, and snapshot-based backups.

Milvus is open-source (Go/C++) and built for billion-scale deployments. The architecture separates storage, indexing, and query processing into independent microservices, so each one scales independently. It supports GPU-accelerated indexing and search. For handling 10 billion+ vectors, Milvus is probably the only realistic open-source option.

Weaviate is open-source (Go) and differentiates with built-in vectorization modules. Raw text or images can be sent directly, and Weaviate calls the embedding model automatically. It also supports generative search (basically RAG built into the database layer) and multimodal search. Convenient for a more batteries-included experience, though the extra abstraction can get in the way when fine-grained control is needed.

pgvector extends PostgreSQL with vector similarity search using ivfflat or HNSW indexes. For teams already running PostgreSQL with a dataset under 5 million vectors, pgvector saves the trouble of adding another database to the stack. That is a real operational win. But it does not keep up with purpose-built vector databases at scale. Know its ceiling before committing.

Production Considerations

Dimension count directly affects memory, storage, and search performance. Text embeddings range from 384 (all-MiniLM-L6-v2) to 3072 (text-embedding-3-large). More dimensions capture more nuance, but they cost more across the board. Matryoshka embeddings allow truncating dimensions (use 512 out of 3072, for instance) with a graceful accuracy dropoff. This is worth exploring before defaulting to max dimensions.

Plan for re-indexing from day one. Embedding models get better regularly. When switching models, every single vector needs to be recomputed. Build the ingestion pipeline so it can run a full re-index while still serving live queries. At 10 million documents, re-indexing through an API-based embedding model takes 8-12 hours and costs $50-200 in API calls. Without planning for this upfront, the pain is real.

Monitor recall quality over time. As the data distribution shifts, index parameters tuned for the original dataset can degrade. Run periodic recall benchmarks against a ground-truth evaluation set. This is easy to skip and painful to debug when search quality silently drops.

Use Cases

Architecture

Why It Exists

How It Works

Architecture Deep Dive

Production Considerations

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

Vector Databases

Use Cases

Architecture

Why It Exists

How It Works

Architecture Deep Dive

Production Considerations

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies