Vector Database Architecture: How Vector Search Powers RAG Systems

I built my first vector search system with a flat numpy array and brute-force cosine similarity. Three hundred fifty chunks, 1024 dimensions, under 2MB. Search completed in microseconds. That works fine for a few hundred documents. It stops working when you hit millions of vectors, need sub-10ms latency at thousands of queries per second, and your index no longer fits in memory on a single node. That is where vector databases earn their place: they solve the hard problem of approximate nearest neighbor search at scale, and they form the retrieval backbone of every serious RAG (Retrieval-Augmented Generation) system in production today.

This article is not a getting-started tutorial. It is an architecture reference covering how vector search actually works under the hood: the indexing algorithms, the distance metrics, the embedding models, the database landscape, and the RAG pipeline patterns that tie everything together. If you are evaluating vector databases for a production system or debugging why your RAG pipeline returns irrelevant results, this is the reference you want.

What Vector Databases Actually Do

A vector database stores, indexes, and searches high-dimensional numerical representations of data. Every piece of content (a document, an image, a code snippet, an audio clip) gets converted into a dense vector through an embedding model. That vector captures the semantic meaning of the content in a fixed-length array of floating-point numbers, typically 256 to 3072 dimensions.

The core operation is similarity search: given a query vector, find the k vectors in the database that are closest to it according to some distance metric. This is fundamentally different from what traditional databases do.

The Embedding Representation Problem

Traditional databases search for exact matches or pattern matches. You query for WHERE status = 'active' or WHERE title LIKE '%architecture%'. These operations work on discrete, structured data. They answer the question: "does this record match my criteria?"

Vector search answers a different question: "what is most similar to this?" A user searching for "how do I reduce my AWS bill" should find documents about cost optimization, reserved instances, savings plans, and right-sizing, even if none of those documents contain the exact phrase "reduce my AWS bill." Embedding models encode that semantic relationship into vector space. Documents about similar topics cluster together; unrelated documents land far apart.

Why B-Trees and Hash Indexes Fail

Traditional database indexes exploit the structure of their data types. B-trees work because numbers and strings have a natural ordering. Hash indexes work because you can compute a deterministic hash for exact lookups. Neither property holds in high-dimensional vector space.

Property	B-Tree / Hash Index	Vector Index
Data type	Scalars, strings	Dense float arrays (256-3072 dims)
Query type	Exact match, range scan	Nearest neighbor similarity
Ordering	Natural total order	No meaningful total order
Dimensionality	1-dimensional keys	Hundreds to thousands of dimensions
Curse of dimensionality	Not applicable	Dominates performance above ~20 dims
Result guarantee	Exact	Approximate (with recall guarantees)

The curse of dimensionality is the central challenge. In low dimensions, you can partition space efficiently with tree structures (kd-trees work well up to about 20 dimensions). Above that threshold, the volume of space grows so fast that spatial partitioning loses its advantage. Every partition contains too many points, or you need to search too many partitions to maintain recall. The entire field of approximate nearest neighbor (ANN) search exists to solve this problem.

Distance Metrics: Choosing the Right Similarity Measure

Three distance metrics dominate vector search. Each measures "closeness" differently, and the choice directly affects retrieval quality.

Metric	Formula	Range	Best For	Key Property
Cosine Similarity	dot(A,B) / (norm(A) * norm(B))	[-1, 1]	Text similarity, semantic search	Ignores vector magnitude
Dot Product	sum(A[i] * B[i])	(-inf, inf)	Recommendation systems, MaxSim	Considers both direction and magnitude
Euclidean (L2)	sqrt(sum((A[i]-B[i])^2))	[0, inf)	Clustering, anomaly detection	Measures absolute spatial distance

Cosine Similarity

Cosine similarity measures the angle between two vectors, ignoring their lengths entirely. Two documents about AWS Lambda will have high cosine similarity regardless of whether one is 500 words and the other is 5,000 words, because the embedding model encodes topic similarity into direction, not magnitude.

This is the default choice for text search and the metric most embedding models are trained with. If you are building a RAG system over documents, start here.

Dot Product

Dot product combines directional similarity with magnitude. If your embedding model produces vectors where magnitude carries meaning (user engagement scores, document importance weights), dot product captures both signals. When vectors are L2-normalized (which most text embedding models do by default), dot product and cosine similarity produce identical rankings.

Euclidean Distance

Euclidean distance measures straight-line distance in vector space. It works well for spatial data and clustering applications where absolute position matters. For text similarity, it performs worse than cosine in most benchmarks because it penalizes magnitude differences that carry no semantic meaning.

Matching Your Metric to Your Embedding Model

The single most important rule: use the distance metric your embedding model was trained with. OpenAI's text-embedding-3 models were trained with cosine similarity. Voyage AI's models support cosine by default. Using Euclidean distance with a model trained on cosine loss will produce subtly wrong rankings that are painful to debug because the results look plausible but are not optimal.

Every major embedding provider documents the intended metric. Check it. Match it. Do not guess.

Indexing Algorithms: How Vector Search Gets Fast

Brute-force search (compare the query against every vector in the database) gives perfect results but scales linearly. At 1 million vectors with 1024 dimensions, a single query requires 1 million dot products of 1024-element arrays. A modern CPU handles this in about 100ms. At 100 million vectors, that becomes 10 seconds per query. Indexing algorithms trade a small amount of recall for dramatic speedups.

Algorithm	Search Complexity	Build Time	Memory Overhead	Supports Updates	Best Scale
Flat (brute-force)	O(n * d)	O(n)	None	Yes	< 100K vectors
IVF	O(nprobe * n/nlist * d)	O(n d iterations)	Centroids only	Rebuild required	1M-100M
HNSW	O(log n * d)	O(n log n d)	Graph edges (significant)	Yes	1M-100M
Product Quantization	O(n * m)	O(n d iterations)	Compressed vectors	Rebuild required	100M+
IVF-PQ	O(nprobe * n/nlist * m)	O(n d iterations)	Centroids + codebooks	Rebuild required	100M+

Brute-Force (Flat) Search

Flat search compares the query against every stored vector. Perfect recall. Linear time. No training step. This is the right choice when your dataset is small enough (under 100,000 vectors) that linear scan completes within your latency budget. I used it for my CMS search index with 350 chunks and it works perfectly. Do not let "approximate nearest neighbor" hype push you into complexity you do not need.

IVF (Inverted File Index)

IVF partitions the vector space into clusters using k-means. During index construction, it groups all vectors by their nearest cluster centroid. During search, it identifies the closest centroids to the query vector, then searches only within those clusters.

The key parameter is n_probe: how many clusters to search. Low n_probe (1-5) gives fast but potentially inaccurate results. High n_probe (50-100) approaches brute-force recall at reduced speed. The number of clusters (n_list) controls the granularity; a common heuristic is sqrt(n) clusters for n vectors.

IVF's weakness: it requires a training phase on representative data, and adding new vectors may require rebuilding the index if the data distribution shifts significantly. For static or slowly-changing corpora, this trade-off is acceptable.

HNSW (Hierarchical Navigable Small World)

HNSW is the dominant algorithm in production vector databases today. Pinecone, Qdrant, Weaviate, Milvus, pgvector (as of version 0.5.0), and OpenSearch all support HNSW as their primary or default index type. The next section covers it in detail.

Product Quantization

Product quantization (PQ) compresses vectors to dramatically reduce memory consumption. It splits each vector into m sub-vectors, then quantizes each sub-vector to its nearest centroid in a learned codebook. A 1024-dimensional float32 vector (4KB) can compress to 128 bytes with 8 sub-vectors and 256 centroids per codebook.

The compression is lossy. Retrieval accuracy decreases as compression increases. PQ works best as a second stage: use IVF or HNSW to narrow candidates, then use PQ for the final distance computations. This is the IVF-PQ and HNSW-PQ combination that large-scale systems rely on.

Combining Approaches: IVF-PQ and HNSW-PQ

At 100 million vectors and above, memory becomes the binding constraint. A 1024-dimensional float32 index at 100M vectors requires roughly 400GB of RAM for flat storage. IVF-PQ reduces that to under 20GB by compressing vectors within each cluster. HNSW-PQ applies the same compression to the HNSW graph's stored vectors while maintaining the graph structure for navigation.

These combinations allow billion-vector indexes on commodity hardware. Milvus and FAISS both support IVF-PQ natively, and it is the standard approach for indexes that exceed available RAM.

HNSW Deep Dive: The Algorithm That Powers Most Production Systems

HNSW combines two ideas: navigable small-world graphs (where any node can reach any other node in a small number of hops) and skip lists (where hierarchical layers allow logarithmic search). The result is an algorithm with O(log n) search complexity, no training phase, and support for incremental updates.

Layer Construction and the Probability Function

HNSW builds a multi-layer graph. Each layer is a subset of the layer below it, with the bottom layer (layer 0) containing all vectors. Higher layers contain exponentially fewer vectors with longer-range connections that enable fast coarse navigation.

When inserting a vector, the algorithm assigns it a maximum layer using an exponential probability function:

layer = floor(-ln(uniform_random(0,1)) * (1/ln(M)))

Where M is the number of connections per node. This produces an exponential distribution: approximately 96.8% of vectors exist only in layer 0, about 3% reach layer 1, 0.1% reach layer 2, and so on. A 1-million-vector index typically has 4-5 layers, with a single entry point at the top.

HNSW multi-layer structure showing search path from entry point to nearest neighbor

Search: Greedy Routing Through Layers

Search starts at the entry point in the top layer. The algorithm greedily moves to the neighbor closest to the query vector, repeating until no closer neighbor exists (a local minimum). It then drops to the next layer, using the same node as the starting point, and repeats. In the bottom layer, it expands the search to maintain a dynamic candidate list of size efSearch, returning the top-k results.

This hierarchical approach is analogous to zooming in on a map. The top layers provide coarse geographic navigation (continent to country), middle layers narrow the region (country to city), and the bottom layer finds the exact neighborhood.

Tuning Parameters: M, efConstruction, efSearch

Three parameters control the recall/speed/memory trade-off:

Parameter	What It Controls	Low Value	High Value	Typical Range
M	Connections per node	Faster build, less memory, lower recall	Slower build, more memory, higher recall	12-48
efConstruction	Candidate pool during build	Faster build, lower recall	Slower build, higher recall	100-500
efSearch	Candidate pool during query	Faster search, lower recall	Slower search, higher recall	50-300

M directly controls memory usage. At M=16, each vector stores 16 bidirectional edges in layer 0 (and 32 at layer 0 due to the M_max0 = 2*M convention). Doubling M roughly doubles the graph's memory footprint. For text search with 1024-dimensional embeddings, M=16 provides a good balance. For high-recall requirements, increase to 32 or 48.

efConstruction affects build-time quality. Higher values produce better-connected graphs but take longer to build. For a 1M-vector index, efConstruction=200 typically builds in 2-5 minutes on an 8-core machine; efConstruction=500 may take 10-15 minutes. Set it once and forget it.

efSearch is the only parameter you tune at query time. It directly controls the recall/latency trade-off. Start at efSearch=100 and increase until your recall target is met. For most RAG applications, 95%+ recall at efSearch=128-256 is achievable with sub-5ms latency on million-scale indexes.

A common mistake: setting efSearch lower than k (the number of results requested). If you ask for the top 10 nearest neighbors with efSearch=5, the algorithm cannot possibly explore enough candidates. Always set efSearch >= k, and in practice, set it to at least 2-4x your k value.

Memory and Performance Trade-offs

HNSW's primary cost is memory. The graph edges alone, independent of the stored vectors, consume significant RAM:

For 1M vectors with M=16: graph overhead is approximately 1M 16 2 4 bytes (int32 node IDs) = 128MB. Add the vectors themselves (1M 1024 * 4 bytes for float32 = 4GB), and you need about 4.1GB total. At 100M vectors, that scales to 410GB, which is why PQ compression matters at scale.

HNSW's advantage over IVF: it supports incremental insertion and deletion without rebuilding the entire index. Insert a new document, embed it, add the vector to the graph. Delete a document, mark the node as deleted. The graph stays functional. IVF requires periodic retraining of centroids when the data distribution changes, which means scheduled downtime or a shadow index swap.

The Vector Database Landscape

The market has stratified into three categories, each serving different operational needs.

Purpose-Built Vector Databases

Database	Language	Hosting	Index Types	Max Scale	Hybrid Search	Notable Strength
Pinecone	Rust/Python	Managed only	Proprietary	Billions	Yes (sparse-dense)	Zero-ops managed service
Qdrant	Rust	Self-hosted + Cloud	HNSW	Billions	Yes (payload filtering)	Filtering performance
Weaviate	Go	Self-hosted + Cloud	HNSW	Billions	Yes (BM25 + vector)	Built-in hybrid search
Milvus	Go/C++	Self-hosted + Zilliz Cloud	IVF, HNSW, DiskANN	Billions	Yes	Most index type options
Chroma	Python	Self-hosted + Cloud	HNSW	Millions	No	Developer experience

Pinecone removes all operational burden. No index tuning, no infrastructure management, no capacity planning. You send vectors; it handles everything. The cost is vendor lock-in and a pricing model that can surprise you at scale: $0.33/GB/month storage plus per-operation charges that add up with high query volumes.

Qdrant (written in Rust) excels at filtered vector search. When your query combines similarity search with metadata predicates ("find similar articles tagged with 'AWS' published after 2025"), Qdrant's architecture handles both in a single pass rather than filtering after retrieval. This matters operationally because post-retrieval filtering can discard most of your top-k results, leaving you with poor candidates.

Milvus offers the broadest set of indexing algorithms (IVF, HNSW, DiskANN, GPU-accelerated indexes) and scales to billions of vectors on Kubernetes. It is the heaviest operationally but the most flexible for workloads that need fine-grained index tuning.

Weaviate bakes hybrid search (BM25 keyword + dense vector) directly into the query API. No external search engine required. For RAG applications that benefit from combining keyword precision with semantic recall, this reduces architectural complexity.

Database Extensions

pgvector adds vector search to PostgreSQL. If your application already runs on Postgres and your vector count stays under 10-50 million, pgvector avoids introducing a new database into your stack. It supports both IVF and HNSW indexes. The trade-off: it shares resources with your transactional workload, and at scale (50M+ vectors), purpose-built systems outperform it by 5-10x on throughput.

OpenSearch and Elasticsearch both support vector search through k-NN plugins (see my AWS OpenSearch Service: An Architecture Deep-Dive for OpenSearch internals). These make sense when you already run a search cluster and want to add semantic search alongside full-text search. They are not competitive on pure vector search performance at high scale, but the operational simplicity of one fewer system to manage has real value.

In-Memory Libraries

FAISS (Facebook AI Similarity Search) is a library, not a database. No persistence layer, no query API, no distributed architecture. You load vectors into memory, build an index, and search. It powers the vector search inside several databases (Milvus uses FAISS indexes internally). Use FAISS directly when you need maximum control, your index fits in memory, and you are willing to build your own persistence and serving layer.

ScaNN (Google) and Annoy (Spotify) fill similar niches. ScaNN excels on quantized searches; Annoy provides memory-mapped indexes that load fast from disk.

When to Use What

Vector database selection decision tree

Embedding Models: The Foundation of Your Vector Pipeline

The embedding model determines the quality ceiling of your entire search system. A perfect index with a mediocre embedding model will return mediocre results. Choose the model first, then choose the database.

Model	Provider	Dimensions	Context Window	MTEB Score	Pricing (per 1M tokens)	Key Advantage
text-embedding-3-large	OpenAI	3072 (or 256-3072)	8,191 tokens	64.6	$0.13	Flexible dimensionality
voyage-3-large	Voyage AI	1024 (default)	32,000 tokens	Top across 8 domains	$0.18	32K context, Matryoshka support
embed-v4	Cohere	1024	512 tokens	65.2	$0.10	Multilingual strength
voyage-3-lite	Voyage AI	512	32,000 tokens	Competitive	$0.02	Cost-efficient for large corpora
mxbai-embed-large	Mixedbread	1024	512 tokens	64.7	Free (open-source)	No API dependency

Dimensionality and Storage Trade-offs

Higher dimensions capture more semantic nuance but cost more to store and search. The relationship is not linear in value: going from 256 to 1024 dimensions produces a measurable quality improvement; going from 1024 to 3072 dimensions produces a smaller improvement at 3x the storage cost.

For most RAG applications, 1024 dimensions hits the sweet spot. (I used voyage-3-lite at 1024 dimensions for the semantic search system I built into this site's CMS; see AWS Lambda Container Images: An Architecture Deep-Dive for the Lambda container architecture behind it.) At 1M vectors with float32 storage: 1024 dims = 4GB, 3072 dims = 12GB. The quality difference rarely justifies the 3x storage and memory cost.

Matryoshka Embeddings and Quantization

Voyage AI's voyage-3-large supports Matryoshka representation learning, which means you can truncate embeddings to lower dimensions (512, 256, even 128) after generation with minimal quality loss. The first N dimensions of a Matryoshka embedding carry the most information; you are not randomly discarding signal.

Combined with quantization (float32 to int8 or binary), the storage savings compound. A 512-dimensional binary embedding from voyage-3-large outperforms OpenAI's full 3072-dimensional float32 embedding while requiring 200x less storage. This is not a typo. The combination of better model architecture, Matryoshka learning, and quantization-aware training produces genuinely dramatic efficiency gains.

RAG Architecture: From Query to Answer

RAG (Retrieval-Augmented Generation) solves a fundamental LLM limitation: language models know only what was in their training data. RAG gives them access to your private data at query time by retrieving relevant context and injecting it into the prompt.

RAG pipeline: from user query to synthesized answer

The Core Pipeline

The RAG pipeline has two phases: offline indexing and online retrieval.

Offline indexing happens at build time or on a schedule. Documents are chunked into segments, each segment is embedded into a vector, and the vectors are stored in the database with metadata (source document, section heading, page number, tags). This runs once per document update.

Online retrieval happens at query time. The user's question is embedded with the same model (using the query input type if the model supports it), the vector database returns the top-k most similar chunks, and those chunks are assembled into an LLM prompt along with the original question. The LLM synthesizes an answer grounded in the retrieved context.

The critical constraint: the embedding model used for queries must be the same model used for indexing. Mixing models (embedding documents with OpenAI but queries with Voyage) produces vectors in incompatible spaces. The similarity scores become meaningless.

Chunking Strategies

How you split documents into chunks determines retrieval precision. Chunk too large and you dilute relevant content with irrelevant context. Chunk too small and you lose the coherence needed for the LLM to synthesize a useful answer.

Strategy	How It Works	Pros	Cons	Best For
Fixed-size	Split every N tokens	Simple, predictable	Breaks mid-sentence, ignores structure	Uniform documents
Recursive	Split by section, then paragraph, then sentence	Preserves document structure	Uneven chunk sizes	Structured documents (articles, docs)
Semantic	Embed sentences, split at semantic boundaries	Meaning-aware boundaries	Requires embedding every sentence	Varied content types
Heading-based	Split at H2/H3 boundaries	Preserves topical coherence	Depends on document having headings	Technical documentation
Sliding window	Fixed size with N% overlap	Reduces boundary artifacts	Duplicate content increases index size	Dense technical content

For most RAG systems, recursive chunking at heading boundaries with a 400-800 token target and 10-15% overlap provides the best baseline. Research from Chroma shows recursive splitting delivers 85-90% recall at 400 tokens, while semantic chunking reaches 91-92%. That 2-3% improvement costs embedding every individual sentence during indexing, which is significant at scale.

My recommendation: start with heading-based recursive chunking. Prepend the document title and section heading to each chunk for context. Only move to semantic chunking if your recall metrics show clear deficiencies with the simpler approach.

Hybrid Search: Combining Dense and Sparse Retrieval

Dense vector search captures semantic similarity but can miss exact keyword matches. A search for "ECS task definition" might return results about "container configuration" (semantically similar) while missing a document that uses the exact phrase "ECS task definition" in a table.

Hybrid search combines BM25 keyword search (sparse retrieval) with vector similarity search (dense retrieval), then merges the results. This catches both semantic matches and exact keyword matches. In practice, hybrid search improves recall by 5-15% over pure vector search for technical documentation where specific terminology matters.

Weaviate and OpenSearch support hybrid search natively. For other databases, you can implement it by running both searches in parallel and applying reciprocal rank fusion (RRF) to merge results.

Re-ranking for Precision

Vector search retrieves candidates; re-ranking sorts them by actual relevance. A re-ranker (like Cohere Rerank or a cross-encoder model) takes the query and each candidate chunk, processes them together (not independently like bi-encoder embeddings), and produces a fine-grained relevance score.

The pattern: retrieve 20-50 candidates with vector search (fast, approximate), then re-rank to find the top 5 (slow, precise). This two-stage approach gets you cross-encoder accuracy at vector-search speed.

Re-ranking adds 50-200ms of latency and an additional API call per query. For interactive applications where answer quality matters more than raw speed, the trade-off is worth it. For high-throughput batch processing, skip it.

Production Failure Modes and Operational Lessons

Vector databases fail differently from traditional databases. The failures are subtle: instead of throwing errors, the system returns results that look plausible but are wrong. Debugging requires different tools and different intuitions.

Index Drift and Stale Embeddings

When you update your embedding model (switching from text-embedding-ada-002 to text-embedding-3-large, or upgrading from voyage-2 to voyage-3), every vector in your index becomes stale. The new model produces vectors in a different space. Old embeddings and new query embeddings are incompatible, even though they have the same dimensionality.

The fix is a full re-index. There is no shortcut. Partial re-indexing with mixed model versions produces an index where some vectors are close for the wrong reasons. I have seen teams spend weeks debugging "degraded relevance" before discovering they had a mix of two embedding model versions in the same index.

Plan for model migrations from the start. Store the embedding model name and version as metadata on every vector. Build your indexing pipeline to support full re-indexing on demand.

The Silent Relevance Degradation Problem

Traditional database failures are loud: queries fail, connections time out, transactions abort. Vector search degrades silently. As your corpus grows, as data distributions shift, as user query patterns change, retrieval quality erodes without any metric alerting you.

Monitor these signals:

Average similarity score of top-k results. A downward trend indicates growing irrelevance.
Click-through rate on retrieved documents (if applicable). Users voting with their clicks is your most reliable relevance signal.
LLM hallucination rate in RAG answers. When the retrieved context is poor, the LLM hallucinates more. Track the ratio of answers grounded in retrieved context vs. answers the LLM generates from its own knowledge.
Query-to-result embedding distance distribution. Plot the distribution monthly. A rightward shift (increasing distances) means your index is drifting from your query distribution.

Scaling Inflection Points

Vector databases hit performance cliffs at specific scale thresholds:

10M vectors: pgvector starts showing latency spikes under concurrent load. Purpose-built databases handle this comfortably.
50M vectors: Single-node deployments hit memory limits with float32 HNSW indexes. You need either PQ compression or a distributed architecture.
500M vectors: Distributed architectures are mandatory. Shard management, cross-shard queries, and rebalancing become operational concerns.
1B+ vectors: DiskANN or IVF-PQ on specialized hardware. Only Milvus/Zilliz, Pinecone, and Qdrant credibly operate at this scale.

The common mistake: choosing infrastructure for your current scale without considering the next 10x growth. A pgvector instance works great at 5M vectors. At 50M, you face a migration to a different database under production pressure.

Cost Curve Management

Vector database costs grow with three axes: storage (GB of vectors), compute (QPS and indexing throughput), and memory (working set that must fit in RAM).

At small scale (under 10M vectors), managed services cost $25-100/month and are clearly worth it. At medium scale (10-100M vectors), costs range from $200-2,000/month depending on provider and query volume. At large scale (100M+), the managed vs. self-hosted decision becomes significant:

Pinecone at 1B vectors, 100 QPS: approximately $3,500/month
Weaviate Cloud at 1B vectors, 100 QPS: approximately $2,200/month
Self-hosted Qdrant or Milvus at 1B vectors, 100 QPS: approximately $800/month plus operational costs

The break-even point for self-hosting typically lands around 60-80 million queries per month. Below that, the operational cost of running your own cluster exceeds the managed service premium.

Key Patterns and Recommendations

After building several vector search systems at different scales, these are the patterns that consistently matter:

Start with brute-force if you can. Under 100K vectors, flat search is fast, exact, and has zero tuning parameters. Do not over-engineer.
Match your embedding model to your use case. For English text RAG, voyage-3-large or text-embedding-3-large. For cost-sensitive large corpora, voyage-3-lite at 512 dimensions. For multilingual, Cohere embed-v4.
Chunk at semantic boundaries. Heading-based recursive chunking with document title prepended to each chunk. 400-800 tokens per chunk. 10-15% overlap.
Use hybrid search for technical content. Dense vector search alone misses exact terminology. BM25 + vector with reciprocal rank fusion catches both.
Add re-ranking when answer quality matters. Retrieve 20-50 candidates, re-rank to top 5. The latency cost (100-200ms) is worth the precision gain for interactive applications.
Monitor similarity scores, not just uptime. Silent relevance degradation is the primary failure mode. Set alerts on average top-k similarity score trends.
Plan for embedding model migrations. Store model version metadata on every vector. Build re-indexing capability from day one.
Choose your database for your next 10x, not your current scale. Migrating vector databases under production pressure is significantly harder than migrating relational databases, because you are also migrating your indexing pipeline, your embedding model configuration, and your metadata schema.
Use cosine similarity unless you have a specific reason not to. It is the default metric for text embedding models, it is invariant to document length, and it is what most models are trained with.
Keep your chunks self-contained. A chunk that requires context from adjacent chunks to be understood will produce poor retrieval results. Prepend section headings and document titles. Include enough context in each chunk to stand alone.

What Vector Databases Actually Do

The Embedding Representation Problem

Why B-Trees and Hash Indexes Fail

Distance Metrics: Choosing the Right Similarity Measure

Cosine Similarity

Dot Product

Euclidean Distance

Matching Your Metric to Your Embedding Model

Indexing Algorithms: How Vector Search Gets Fast

Brute-Force (Flat) Search

IVF (Inverted File Index)

HNSW (Hierarchical Navigable Small World)

Product Quantization

Combining Approaches: IVF-PQ and HNSW-PQ

HNSW Deep Dive: The Algorithm That Powers Most Production Systems

Layer Construction and the Probability Function

Search: Greedy Routing Through Layers

Tuning Parameters: M, efConstruction, efSearch

Memory and Performance Trade-offs

The Vector Database Landscape

Purpose-Built Vector Databases

Database Extensions

In-Memory Libraries

When to Use What

Embedding Models: The Foundation of Your Vector Pipeline

Dimensionality and Storage Trade-offs

Matryoshka Embeddings and Quantization

RAG Architecture: From Query to Answer

The Core Pipeline

Chunking Strategies

Hybrid Search: Combining Dense and Sparse Retrieval

Re-ranking for Precision

Production Failure Modes and Operational Lessons

Index Drift and Stale Embeddings

The Silent Relevance Degradation Problem

Scaling Inflection Points

Cost Curve Management

Key Patterns and Recommendations

Additional Resources