About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
I built my first vector search system with a flat numpy array and brute-force cosine similarity. Three hundred fifty chunks, 1024 dimensions, under 2MB. Search completed in microseconds. That works fine for a few hundred documents. It stops working when you hit millions of vectors, need sub-10ms latency at thousands of queries per second, and your index no longer fits in memory on a single node. That is where vector databases earn their place: they solve the hard problem of approximate nearest neighbor search at scale, and they form the retrieval backbone of every serious RAG (Retrieval-Augmented Generation) system in production today.
This article is not a getting-started tutorial. It is an architecture reference covering how vector search actually works under the hood: the indexing algorithms, the distance metrics, the embedding models, the database landscape, and the RAG pipeline patterns that tie everything together. If you are evaluating vector databases for a production system or debugging why your RAG pipeline returns irrelevant results, this is the reference you want.

What Vector Databases Actually Do
A vector database stores, indexes, and searches high-dimensional numerical representations of data. Every piece of content (a document, an image, a code snippet, an audio clip) gets converted into a dense vector through an embedding model. That vector captures the semantic meaning of the content in a fixed-length array of floating-point numbers, typically 256 to 3072 dimensions.
The core operation is similarity search: given a query vector, find the k vectors in the database that are closest to it according to some distance metric. This is fundamentally different from what traditional databases do.
The Embedding Representation Problem
Traditional databases search for exact matches or pattern matches. You query for WHERE status = 'active' or WHERE title LIKE '%architecture%'. These operations work on discrete, structured data. They answer the question: "does this record match my criteria?"
Vector search answers a different question: "what is most similar to this?" A user searching for "how do I reduce my AWS bill" should find documents about cost optimization, reserved instances, savings plans, and right-sizing, even if none of those documents contain the exact phrase "reduce my AWS bill." Embedding models encode that semantic relationship into vector space. Documents about similar topics cluster together; unrelated documents land far apart.
Why B-Trees and Hash Indexes Fail
Traditional database indexes exploit the structure of their data types. B-trees work because numbers and strings have a natural ordering. Hash indexes work because you can compute a deterministic hash for exact lookups. Neither property holds in high-dimensional vector space.
| Property | B-Tree / Hash Index | Vector Index |
|---|---|---|
| Data type | Scalars, strings | Dense float arrays (256-3072 dims) |
| Query type | Exact match, range scan | Nearest neighbor similarity |
| Ordering | Natural total order | No meaningful total order |
| Dimensionality | 1-dimensional keys | Hundreds to thousands of dimensions |
| Curse of dimensionality | Not applicable | Dominates performance above ~20 dims |
| Result guarantee | Exact | Approximate (with recall guarantees) |
The curse of dimensionality is the central challenge. In low dimensions, you can partition space efficiently with tree structures (kd-trees work well up to about 20 dimensions). Above that threshold, the volume of space grows so fast that spatial partitioning loses its advantage. Every partition contains too many points, or you need to search too many partitions to maintain recall. The entire field of approximate nearest neighbor (ANN) search exists to solve this problem.
Distance Metrics: Choosing the Right Similarity Measure
Three distance metrics dominate vector search. Each measures "closeness" differently, and the choice directly affects retrieval quality.
| Metric | Formula | Range | Best For | Key Property |
|---|---|---|---|---|
| Cosine Similarity | dot(A,B) / (norm(A) * norm(B)) | [-1, 1] | Text similarity, semantic search | Ignores vector magnitude |
| Dot Product | sum(A[i] * B[i]) | (-inf, inf) | Recommendation systems, MaxSim | Considers both direction and magnitude |
| Euclidean (L2) | sqrt(sum((A[i]-B[i])^2)) | [0, inf) | Clustering, anomaly detection | Measures absolute spatial distance |
Cosine Similarity
Cosine similarity measures the angle between two vectors, ignoring their lengths entirely. Two documents about AWS Lambda will have high cosine similarity regardless of whether one is 500 words and the other is 5,000 words, because the embedding model encodes topic similarity into direction, not magnitude.
This is the default choice for text search and the metric most embedding models are trained with. If you are building a RAG system over documents, start here.
Dot Product
Dot product combines directional similarity with magnitude. If your embedding model produces vectors where magnitude carries meaning (user engagement scores, document importance weights), dot product captures both signals. When vectors are L2-normalized (which most text embedding models do by default), dot product and cosine similarity produce identical rankings.
Euclidean Distance
Euclidean distance measures straight-line distance in vector space. It works well for spatial data and clustering applications where absolute position matters. For text similarity, it performs worse than cosine in most benchmarks because it penalizes magnitude differences that carry no semantic meaning.
Matching Your Metric to Your Embedding Model
The single most important rule: use the distance metric your embedding model was trained with. OpenAI's text-embedding-3 models were trained with cosine similarity. Voyage AI's models support cosine by default. Using Euclidean distance with a model trained on cosine loss will produce subtly wrong rankings that are painful to debug because the results look plausible but are not optimal.
Every major embedding provider documents the intended metric. Check it. Match it. Do not guess.
Indexing Algorithms: How Vector Search Gets Fast
Brute-force search (compare the query against every vector in the database) gives perfect results but scales linearly. At 1 million vectors with 1024 dimensions, a single query requires 1 million dot products of 1024-element arrays. A modern CPU handles this in about 100ms. At 100 million vectors, that becomes 10 seconds per query. Indexing algorithms trade a small amount of recall for dramatic speedups.
| Algorithm | Search Complexity | Build Time | Memory Overhead | Supports Updates | Best Scale |
|---|---|---|---|---|---|
| Flat (brute-force) | O(n * d) | O(n) | None | Yes | < 100K vectors |
| IVF | O(nprobe * n/nlist * d) | O(n d iterations) | Centroids only | Rebuild required | 1M-100M |
| HNSW | O(log n * d) | O(n log n d) | Graph edges (significant) | Yes | 1M-100M |
| Product Quantization | O(n * m) | O(n d iterations) | Compressed vectors | Rebuild required | 100M+ |
| IVF-PQ | O(nprobe * n/nlist * m) | O(n d iterations) | Centroids + codebooks | Rebuild required | 100M+ |
Brute-Force (Flat) Search
Flat search compares the query against every stored vector. Perfect recall. Linear time. No training step. This is the right choice when your dataset is small enough (under 100,000 vectors) that linear scan completes within your latency budget. I used it for my CMS search index with 350 chunks and it works perfectly. Do not let "approximate nearest neighbor" hype push you into complexity you do not need.
IVF (Inverted File Index)
IVF partitions the vector space into clusters using k-means. During index construction, it groups all vectors by their nearest cluster centroid. During search, it identifies the closest centroids to the query vector, then searches only within those clusters.
The key parameter is n_probe: how many clusters to search. Low n_probe (1-5) gives fast but potentially inaccurate results. High n_probe (50-100) approaches brute-force recall at reduced speed. The number of clusters (n_list) controls the granularity; a common heuristic is sqrt(n) clusters for n vectors.
IVF's weakness: it requires a training phase on representative data, and adding new vectors may require rebuilding the index if the data distribution shifts significantly. For static or slowly-changing corpora, this trade-off is acceptable.
HNSW (Hierarchical Navigable Small World)
HNSW is the dominant algorithm in production vector databases today. Pinecone, Qdrant, Weaviate, Milvus, pgvector (as of version 0.5.0), and OpenSearch all support HNSW as their primary or default index type. The next section covers it in detail.
Product Quantization
Product quantization (PQ) compresses vectors to dramatically reduce memory consumption. It splits each vector into m sub-vectors, then quantizes each sub-vector to its nearest centroid in a learned codebook. A 1024-dimensional float32 vector (4KB) can compress to 128 bytes with 8 sub-vectors and 256 centroids per codebook.
The compression is lossy. Retrieval accuracy decreases as compression increases. PQ works best as a second stage: use IVF or HNSW to narrow candidates, then use PQ for the final distance computations. This is the IVF-PQ and HNSW-PQ combination that large-scale systems rely on.
Combining Approaches: IVF-PQ and HNSW-PQ
At 100 million vectors and above, memory becomes the binding constraint. A 1024-dimensional float32 index at 100M vectors requires roughly 400GB of RAM for flat storage. IVF-PQ reduces that to under 20GB by compressing vectors within each cluster. HNSW-PQ applies the same compression to the HNSW graph's stored vectors while maintaining the graph structure for navigation.
These combinations allow billion-vector indexes on commodity hardware. Milvus and FAISS both support IVF-PQ natively, and it is the standard approach for indexes that exceed available RAM.
HNSW Deep Dive: The Algorithm That Powers Most Production Systems
HNSW combines two ideas: navigable small-world graphs (where any node can reach any other node in a small number of hops) and skip lists (where hierarchical layers allow logarithmic search). The result is an algorithm with O(log n) search complexity, no training phase, and support for incremental updates.
Layer Construction and the Probability Function
HNSW builds a multi-layer graph. Each layer is a subset of the layer below it, with the bottom layer (layer 0) containing all vectors. Higher layers contain exponentially fewer vectors with longer-range connections that enable fast coarse navigation.
When inserting a vector, the algorithm assigns it a maximum layer using an exponential probability function:
layer = floor(-ln(uniform_random(0,1)) * (1/ln(M)))
Where M is the number of connections per node. This produces an exponential distribution: approximately 96.8% of vectors exist only in layer 0, about 3% reach layer 1, 0.1% reach layer 2, and so on. A 1-million-vector index typically has 4-5 layers, with a single entry point at the top.
Search: Greedy Routing Through Layers
Search starts at the entry point in the top layer. The algorithm greedily moves to the neighbor closest to the query vector, repeating until no closer neighbor exists (a local minimum). It then drops to the next layer, using the same node as the starting point, and repeats. In the bottom layer, it expands the search to maintain a dynamic candidate list of size efSearch, returning the top-k results.
This hierarchical approach is analogous to zooming in on a map. The top layers provide coarse geographic navigation (continent to country), middle layers narrow the region (country to city), and the bottom layer finds the exact neighborhood.
Tuning Parameters: M, efConstruction, efSearch
Three parameters control the recall/speed/memory trade-off:
| Parameter | What It Controls | Low Value | High Value | Typical Range |
|---|---|---|---|---|
| M | Connections per node | Faster build, less memory, lower recall | Slower build, more memory, higher recall | 12-48 |
| efConstruction | Candidate pool during build | Faster build, lower recall | Slower build, higher recall | 100-500 |
| efSearch | Candidate pool during query | Faster search, lower recall | Slower search, higher recall | 50-300 |
M directly controls memory usage. At M=16, each vector stores 16 bidirectional edges in layer 0 (and 32 at layer 0 due to the M_max0 = 2*M convention). Doubling M roughly doubles the graph's memory footprint. For text search with 1024-dimensional embeddings, M=16 provides a good balance. For high-recall requirements, increase to 32 or 48.
efConstruction affects build-time quality. Higher values produce better-connected graphs but take longer to build. For a 1M-vector index, efConstruction=200 typically builds in 2-5 minutes on an 8-core machine; efConstruction=500 may take 10-15 minutes. Set it once and forget it.
efSearch is the only parameter you tune at query time. It directly controls the recall/latency trade-off. Start at efSearch=100 and increase until your recall target is met. For most RAG applications, 95%+ recall at efSearch=128-256 is achievable with sub-5ms latency on million-scale indexes.
Memory and Performance Trade-offs
HNSW's primary cost is memory. The graph edges alone, independent of the stored vectors, consume significant RAM:
For 1M vectors with M=16: graph overhead is approximately 1M 16 2 4 bytes (int32 node IDs) = 128MB. Add the vectors themselves (1M 1024 * 4 bytes for float32 = 4GB), and you need about 4.1GB total. At 100M vectors, that scales to 410GB, which is why PQ compression matters at scale.
HNSW's advantage over IVF: it supports incremental insertion and deletion without rebuilding the entire index. Insert a new document, embed it, add the vector to the graph. Delete a document, mark the node as deleted. The graph stays functional. IVF requires periodic retraining of centroids when the data distribution changes, which means scheduled downtime or a shadow index swap.
The Vector Database Landscape
The market has stratified into three categories, each serving different operational needs.
Purpose-Built Vector Databases
| Database | Language | Hosting | Index Types | Max Scale | Hybrid Search | Notable Strength |
|---|---|---|---|---|---|---|
| Pinecone | Rust/Python | Managed only | Proprietary | Billions | Yes (sparse-dense) | Zero-ops managed service |
| Qdrant | Rust | Self-hosted + Cloud | HNSW | Billions | Yes (payload filtering) | Filtering performance |
| Weaviate | Go | Self-hosted + Cloud | HNSW | Billions | Yes (BM25 + vector) | Built-in hybrid search |
| Milvus | Go/C++ | Self-hosted + Zilliz Cloud | IVF, HNSW, DiskANN | Billions | Yes | Most index type options |
| Chroma | Python | Self-hosted + Cloud | HNSW | Millions | No | Developer experience |
Pinecone removes all operational burden. No index tuning, no infrastructure management, no capacity planning. You send vectors; it handles everything. The cost is vendor lock-in and a pricing model that can surprise you at scale: $0.33/GB/month storage plus per-operation charges that add up with high query volumes.
Qdrant (written in Rust) excels at filtered vector search. When your query combines similarity search with metadata predicates ("find similar articles tagged with 'AWS' published after 2025"), Qdrant's architecture handles both in a single pass rather than filtering after retrieval. This matters operationally because post-retrieval filtering can discard most of your top-k results, leaving you with poor candidates.
Milvus offers the broadest set of indexing algorithms (IVF, HNSW, DiskANN, GPU-accelerated indexes) and scales to billions of vectors on Kubernetes. It is the heaviest operationally but the most flexible for workloads that need fine-grained index tuning.
Weaviate bakes hybrid search (BM25 keyword + dense vector) directly into the query API. No external search engine required. For RAG applications that benefit from combining keyword precision with semantic recall, this reduces architectural complexity.
Database Extensions
pgvector adds vector search to PostgreSQL. If your application already runs on Postgres and your vector count stays under 10-50 million, pgvector avoids introducing a new database into your stack. It supports both IVF and HNSW indexes. The trade-off: it shares resources with your transactional workload, and at scale (50M+ vectors), purpose-built systems outperform it by 5-10x on throughput.
OpenSearch and Elasticsearch both support vector search through k-NN plugins (see my AWS OpenSearch Service: An Architecture Deep-Dive for OpenSearch internals). These make sense when you already run a search cluster and want to add semantic search alongside full-text search. They are not competitive on pure vector search performance at high scale, but the operational simplicity of one fewer system to manage has real value.
In-Memory Libraries
FAISS (Facebook AI Similarity Search) is a library, not a database. No persistence layer, no query API, no distributed architecture. You load vectors into memory, build an index, and search. It powers the vector search inside several databases (Milvus uses FAISS indexes internally). Use FAISS directly when you need maximum control, your index fits in memory, and you are willing to build your own persistence and serving layer.
ScaNN (Google) and Annoy (Spotify) fill similar niches. ScaNN excels on quantized searches; Annoy provides memory-mapped indexes that load fast from disk.
When to Use What
Embedding Models: The Foundation of Your Vector Pipeline
The embedding model determines the quality ceiling of your entire search system. A perfect index with a mediocre embedding model will return mediocre results. Choose the model first, then choose the database.
| Model | Provider | Dimensions | Context Window | MTEB Score | Pricing (per 1M tokens) | Key Advantage |
|---|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 (or 256-3072) | 8,191 tokens | 64.6 | $0.13 | Flexible dimensionality |
| voyage-3-large | Voyage AI | 1024 (default) | 32,000 tokens | Top across 8 domains | $0.18 | 32K context, Matryoshka support |
| embed-v4 | Cohere | 1024 | 512 tokens | 65.2 | $0.10 | Multilingual strength |
| voyage-3-lite | Voyage AI | 512 | 32,000 tokens | Competitive | $0.02 | Cost-efficient for large corpora |
| mxbai-embed-large | Mixedbread | 1024 | 512 tokens | 64.7 | Free (open-source) | No API dependency |
Dimensionality and Storage Trade-offs
Higher dimensions capture more semantic nuance but cost more to store and search. The relationship is not linear in value: going from 256 to 1024 dimensions produces a measurable quality improvement; going from 1024 to 3072 dimensions produces a smaller improvement at 3x the storage cost.
For most RAG applications, 1024 dimensions hits the sweet spot. (I used voyage-3-lite at 1024 dimensions for the semantic search system I built into this site's CMS; see AWS Lambda Container Images: An Architecture Deep-Dive for the Lambda container architecture behind it.) At 1M vectors with float32 storage: 1024 dims = 4GB, 3072 dims = 12GB. The quality difference rarely justifies the 3x storage and memory cost.
Matryoshka Embeddings and Quantization
Voyage AI's voyage-3-large supports Matryoshka representation learning, which means you can truncate embeddings to lower dimensions (512, 256, even 128) after generation with minimal quality loss. The first N dimensions of a Matryoshka embedding carry the most information; you are not randomly discarding signal.
Combined with quantization (float32 to int8 or binary), the storage savings compound. A 512-dimensional binary embedding from voyage-3-large outperforms OpenAI's full 3072-dimensional float32 embedding while requiring 200x less storage. This is not a typo. The combination of better model architecture, Matryoshka learning, and quantization-aware training produces genuinely dramatic efficiency gains.
RAG Architecture: From Query to Answer
RAG (Retrieval-Augmented Generation) solves a fundamental LLM limitation: language models know only what was in their training data. RAG gives them access to your private data at query time by retrieving relevant context and injecting it into the prompt.
The Core Pipeline
The RAG pipeline has two phases: offline indexing and online retrieval.
Offline indexing happens at build time or on a schedule. Documents are chunked into segments, each segment is embedded into a vector, and the vectors are stored in the database with metadata (source document, section heading, page number, tags). This runs once per document update.
Online retrieval happens at query time. The user's question is embedded with the same model (using the query input type if the model supports it), the vector database returns the top-k most similar chunks, and those chunks are assembled into an LLM prompt along with the original question. The LLM synthesizes an answer grounded in the retrieved context.
The critical constraint: the embedding model used for queries must be the same model used for indexing. Mixing models (embedding documents with OpenAI but queries with Voyage) produces vectors in incompatible spaces. The similarity scores become meaningless.
Chunking Strategies
How you split documents into chunks determines retrieval precision. Chunk too large and you dilute relevant content with irrelevant context. Chunk too small and you lose the coherence needed for the LLM to synthesize a useful answer.
| Strategy | How It Works | Pros | Cons | Best For |
|---|---|---|---|---|
| Fixed-size | Split every N tokens | Simple, predictable | Breaks mid-sentence, ignores structure | Uniform documents |
| Recursive | Split by section, then paragraph, then sentence | Preserves document structure | Uneven chunk sizes | Structured documents (articles, docs) |
| Semantic | Embed sentences, split at semantic boundaries | Meaning-aware boundaries | Requires embedding every sentence | Varied content types |
| Heading-based | Split at H2/H3 boundaries | Preserves topical coherence | Depends on document having headings | Technical documentation |
| Sliding window | Fixed size with N% overlap | Reduces boundary artifacts | Duplicate content increases index size | Dense technical content |
For most RAG systems, recursive chunking at heading boundaries with a 400-800 token target and 10-15% overlap provides the best baseline. Research from Chroma shows recursive splitting delivers 85-90% recall at 400 tokens, while semantic chunking reaches 91-92%. That 2-3% improvement costs embedding every individual sentence during indexing, which is significant at scale.
My recommendation: start with heading-based recursive chunking. Prepend the document title and section heading to each chunk for context. Only move to semantic chunking if your recall metrics show clear deficiencies with the simpler approach.
Hybrid Search: Combining Dense and Sparse Retrieval
Dense vector search captures semantic similarity but can miss exact keyword matches. A search for "ECS task definition" might return results about "container configuration" (semantically similar) while missing a document that uses the exact phrase "ECS task definition" in a table.
Hybrid search combines BM25 keyword search (sparse retrieval) with vector similarity search (dense retrieval), then merges the results. This catches both semantic matches and exact keyword matches. In practice, hybrid search improves recall by 5-15% over pure vector search for technical documentation where specific terminology matters.
Weaviate and OpenSearch support hybrid search natively. For other databases, you can implement it by running both searches in parallel and applying reciprocal rank fusion (RRF) to merge results.
Re-ranking for Precision
Vector search retrieves candidates; re-ranking sorts them by actual relevance. A re-ranker (like Cohere Rerank or a cross-encoder model) takes the query and each candidate chunk, processes them together (not independently like bi-encoder embeddings), and produces a fine-grained relevance score.
The pattern: retrieve 20-50 candidates with vector search (fast, approximate), then re-rank to find the top 5 (slow, precise). This two-stage approach gets you cross-encoder accuracy at vector-search speed.
Re-ranking adds 50-200ms of latency and an additional API call per query. For interactive applications where answer quality matters more than raw speed, the trade-off is worth it. For high-throughput batch processing, skip it.
Production Failure Modes and Operational Lessons
Vector databases fail differently from traditional databases. The failures are subtle: instead of throwing errors, the system returns results that look plausible but are wrong. Debugging requires different tools and different intuitions.
Index Drift and Stale Embeddings
When you update your embedding model (switching from text-embedding-ada-002 to text-embedding-3-large, or upgrading from voyage-2 to voyage-3), every vector in your index becomes stale. The new model produces vectors in a different space. Old embeddings and new query embeddings are incompatible, even though they have the same dimensionality.
The fix is a full re-index. There is no shortcut. Partial re-indexing with mixed model versions produces an index where some vectors are close for the wrong reasons. I have seen teams spend weeks debugging "degraded relevance" before discovering they had a mix of two embedding model versions in the same index.
Plan for model migrations from the start. Store the embedding model name and version as metadata on every vector. Build your indexing pipeline to support full re-indexing on demand.
The Silent Relevance Degradation Problem
Traditional database failures are loud: queries fail, connections time out, transactions abort. Vector search degrades silently. As your corpus grows, as data distributions shift, as user query patterns change, retrieval quality erodes without any metric alerting you.
Monitor these signals:
- Average similarity score of top-k results. A downward trend indicates growing irrelevance.
- Click-through rate on retrieved documents (if applicable). Users voting with their clicks is your most reliable relevance signal.
- LLM hallucination rate in RAG answers. When the retrieved context is poor, the LLM hallucinates more. Track the ratio of answers grounded in retrieved context vs. answers the LLM generates from its own knowledge.
- Query-to-result embedding distance distribution. Plot the distribution monthly. A rightward shift (increasing distances) means your index is drifting from your query distribution.
Scaling Inflection Points
Vector databases hit performance cliffs at specific scale thresholds:
- 10M vectors: pgvector starts showing latency spikes under concurrent load. Purpose-built databases handle this comfortably.
- 50M vectors: Single-node deployments hit memory limits with float32 HNSW indexes. You need either PQ compression or a distributed architecture.
- 500M vectors: Distributed architectures are mandatory. Shard management, cross-shard queries, and rebalancing become operational concerns.
- 1B+ vectors: DiskANN or IVF-PQ on specialized hardware. Only Milvus/Zilliz, Pinecone, and Qdrant credibly operate at this scale.
The common mistake: choosing infrastructure for your current scale without considering the next 10x growth. A pgvector instance works great at 5M vectors. At 50M, you face a migration to a different database under production pressure.
Cost Curve Management
Vector database costs grow with three axes: storage (GB of vectors), compute (QPS and indexing throughput), and memory (working set that must fit in RAM).
At small scale (under 10M vectors), managed services cost $25-100/month and are clearly worth it. At medium scale (10-100M vectors), costs range from $200-2,000/month depending on provider and query volume. At large scale (100M+), the managed vs. self-hosted decision becomes significant:
- Pinecone at 1B vectors, 100 QPS: approximately $3,500/month
- Weaviate Cloud at 1B vectors, 100 QPS: approximately $2,200/month
- Self-hosted Qdrant or Milvus at 1B vectors, 100 QPS: approximately $800/month plus operational costs
The break-even point for self-hosting typically lands around 60-80 million queries per month. Below that, the operational cost of running your own cluster exceeds the managed service premium.
Key Patterns and Recommendations
After building several vector search systems at different scales, these are the patterns that consistently matter:
- Start with brute-force if you can. Under 100K vectors, flat search is fast, exact, and has zero tuning parameters. Do not over-engineer.
- Match your embedding model to your use case. For English text RAG, voyage-3-large or text-embedding-3-large. For cost-sensitive large corpora, voyage-3-lite at 512 dimensions. For multilingual, Cohere embed-v4.
- Chunk at semantic boundaries. Heading-based recursive chunking with document title prepended to each chunk. 400-800 tokens per chunk. 10-15% overlap.
- Use hybrid search for technical content. Dense vector search alone misses exact terminology. BM25 + vector with reciprocal rank fusion catches both.
- Add re-ranking when answer quality matters. Retrieve 20-50 candidates, re-rank to top 5. The latency cost (100-200ms) is worth the precision gain for interactive applications.
- Monitor similarity scores, not just uptime. Silent relevance degradation is the primary failure mode. Set alerts on average top-k similarity score trends.
- Plan for embedding model migrations. Store model version metadata on every vector. Build re-indexing capability from day one.
- Choose your database for your next 10x, not your current scale. Migrating vector databases under production pressure is significantly harder than migrating relational databases, because you are also migrating your indexing pipeline, your embedding model configuration, and your metadata schema.
- Use cosine similarity unless you have a specific reason not to. It is the default metric for text embedding models, it is invariant to document length, and it is what most models are trained with.
- Keep your chunks self-contained. A chunk that requires context from adjacent chunks to be understood will produce poor retrieval results. Prepend section headings and document titles. Include enough context in each chunk to stand alone.
Additional Resources
- HNSW Algorithm Deep Dive (Pinecone)
- Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (Original Paper)
- Vector Database Benchmarks (Qdrant)
- Chunking Strategies for RAG (Weaviate)
- Distance Metrics in Vector Search (Weaviate)
- Voyage-3-Large Announcement (Voyage AI)
- The Ultimate RAG Blueprint 2025/2026 (LangWatch)
- Vector Similarity Explained (Pinecone)
- Retrieval-Augmented Generation: A Comprehensive Survey (arXiv)
- VectorDBBench: Open-Source Benchmark Suite (GitHub)
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

