The Production RAG Stack We Run Today
If you’re landing here from “RAG in production,” skip the preamble — here’s the stack we’ve converged on after shipping RAG for document Q&A, compliance analysis, knowledge bases, and content search. The detailed reasoning is below; this table is the short version.
| Layer | Our default | Why |
|---|---|---|
| Embeddings | text-embedding-3-small (1536 dim) | Best cost/quality ratio at $0.02 per million tokens |
| Chunking | Document-structure, 300–600 tokens, heading prepended | Dominates fixed-size and semantic on real-world docs |
| Vector DB | pgvector on Supabase | $25/mo handles ~500K vectors, transactional with your app data |
| Retrieval | Top-30 vector search → Cohere rerank → top 5–8 to LLM | +15–25% precision lift for ~$0.10 per 1,000 queries |
| Generation | The smallest LLM that passes your eval (see LLM selection for production) | Retrieval quality, not the model, is usually the ceiling |
| Eval | 50–100 golden Q&A pairs built before the retrieval code | Without this you cannot tell which change helped |
Four decisions determine quality: embedding model, chunking strategy, whether you rerank, and how you measure correctness. Get those four right and the rest is tuning. RAG is also increasingly the retrieval backbone inside AI agent systems, where the same four choices apply inside the agent’s search tool.
The rest of this post is the why behind every line in that table — the benchmarks, the trade-offs, the things that broke in production, and why we stopped using Pinecone for new projects.
Embedding Model Selection
This is the first decision you’ll make and one of the hardest to change later. Switching embedding models means re-embedding your entire corpus, rebuilding your index, and re-running your evaluation suite.
What We’ve Tested
| Model | Dimensions | Speed | Multilingual | Our Verdict |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | Fast | Decent | Default choice. Best cost/quality ratio |
| text-embedding-3-large | 3072 | Medium | Decent | Marginal quality gain, 2× cost. Rarely worth it |
| BGE-M3 | 1024 | Slow | Excellent | Best for multilingual content |
| Cohere embed-v3 | 1024 | Fast | Good | Best when paired with Cohere reranker |
| Voyage AI voyage-2 | 1024 | Medium | Good | Strong on code and technical content |
The Decision Framework
Start with text-embedding-3-small. It handles 90% of use cases well enough, it’s cheap ($0.02 per million tokens), and it’s fast. You can always upgrade later, but in practice, we’ve rarely needed to.
Switch to BGE-M3 when: Your corpus is multilingual or your users query in multiple languages. In multilingual corpora, text-embedding-3-small’s retrieval accuracy drops significantly on non-English queries. BGE-M3 closes that gap substantially.
Switch to Cohere embed-v3 when: You’re using a reranking step (and you probably should be). Cohere’s embedding + reranking pipeline is the most coherent end-to-end system we’ve used.
The Counterintuitive Finding
Embedding model quality matters less than you think. Based on publicly available benchmarks and our own testing, the difference between the best and worst model on the same corpus is typically about 8-12% in retrieval accuracy. The difference between good and bad chunking on the same model can be 20-35%.
Chunking strategy dominates embedding model choice. Fix your chunking first.
Chunking: The Unsexy Foundation
Nobody writes blog posts about chunking. It’s not glamorous. But it’s where most RAG pipelines succeed or fail.
Strategies We’ve Tried
Fixed-size chunking (500 tokens, 100 token overlap)
The default everyone starts with. Simple, predictable, easy to implement.
When it works: Homogeneous content where every section is roughly equally important. API documentation, FAQ pages, standardised reports.
When it fails: Content with variable structure: legal documents where one clause is 50 words and another is 2,000 words. The 50-word clause gets padded with irrelevant context from surrounding chunks. The 2,000-word clause gets split mid-sentence.
Semantic chunking
Split at topic boundaries using sentence similarity. Chunks represent coherent ideas rather than arbitrary token counts.
When it works: Long-form content with distinct sections: research papers, blog posts, narrative documents.
When it fails: Structured data, tables, lists. The semantic boundary detector doesn’t understand that a bullet-point list is one logical unit.
Implementation cost: Higher. You need an additional embedding pass to detect boundaries. Adds 30-40% to indexing time.
Document-structure chunking (our default)
Split on the document’s own structure: headings, sections, paragraphs. Respect the author’s information architecture.
def chunk_by_structure(doc):
chunks = []
current_chunk = {"heading": "", "content": "", "metadata": {}}
for element in doc.elements:
if element.type == "heading":
if current_chunk["content"]:
chunks.append(current_chunk)
current_chunk = {
"heading": element.text,
"content": "",
"metadata": {"level": element.level}
}
else:
current_chunk["content"] += element.text + "\n"
# Handle chunks that are too large
for chunk in chunks:
if token_count(chunk["content"]) > 800:
yield from split_large_chunk(chunk, max_tokens=500)
else:
yield chunk
When it works: Almost everything with structure: documents, articles, manuals, policies. The document’s existing structure is usually the best chunking strategy because the author already organised information into logical units.
When it fails: Unstructured content: chat transcripts, raw notes, OCR’d documents with no formatting.
The Chunk Size Sweet Spot
Based on our experience and widely reported best practices:
- Too small (< 200 tokens): Not enough context for meaningful retrieval. The embedding captures a fragment, not an idea.
- Too large (> 1000 tokens): The embedding becomes a blurry average of multiple ideas. Retrieval precision drops.
- Sweet spot: 300-600 tokens with the heading/section title prepended.
The heading prepend trick is significant. A chunk that says “Employees are entitled to 21 days of paid leave per year” retrieves poorly for the query “what’s the leave policy?” But prepend the section heading (“Leave Policy: Employees are entitled to 21 days…”) and retrieval accuracy jumps 15-20%.
pgvector vs Pinecone: The Real Trade-offs
We started with Pinecone. It’s the obvious choice: purpose-built for vector search, managed infrastructure, good documentation.
After 3 projects on Pinecone, we moved to pgvector for new projects. Here’s why.
Where Pinecone Wins
Query latency at scale. At 5M+ vectors, Pinecone’s p95 latency is 15-30ms. pgvector with HNSW indexing is 40-80ms at the same scale. For real-time applications where every millisecond matters, Pinecone is faster.
Managed infrastructure. Zero ops overhead. No index tuning, no vacuum operations, no connection pooling to manage.
Metadata filtering. Pinecone’s metadata filtering is fast and doesn’t degrade performance significantly. pgvector’s filtering with WHERE clauses on large tables can be slower.
Where pgvector Wins
Cost. A Supabase Pro instance ($25/month) handles 500K vectors comfortably. Pinecone’s equivalent is $70/month on the starter plan, and scales to $200+/month quickly. For a startup running multiple RAG systems, the difference compounds.
Operational simplicity. Your data is already in Postgres. Your application already connects to Postgres. Adding vector search to an existing database is one extension and one column. No new service to manage, no new SDK to learn, no new auth system to configure.
Transactional consistency. When you update a document, you can update the embedding and the metadata in the same transaction. With Pinecone, you have two systems to keep in sync, and they will drift.
SQL-native queries. You can combine vector similarity search with traditional SQL in the same query. “Find the most relevant policy documents from the HR department uploaded in the last 6 months” is one SQL query with pgvector. With Pinecone, it’s a vector search followed by application-level filtering.
For more on this, read our guide on Vector Databases Compared.
SELECT content, 1 - (embedding <=> $1) as similarity
FROM documents
WHERE department = 'HR'
AND uploaded_at > NOW() - INTERVAL '6 months'
ORDER BY embedding <=> $1
LIMIT 5;
Our Decision Matrix
| Factor | pgvector | Pinecone |
|---|---|---|
| Vectors < 2M | ✅ | Overkill |
| Vectors > 5M | Workable | ✅ |
| Budget-conscious | ✅ | ❌ |
| Need < 20ms p95 | ❌ | ✅ |
| Already using Postgres | ✅ | Redundant |
| Multi-tenant SaaS | ✅ (row-level security) | ✅ (namespaces) |
| Team knows SQL | ✅ | Neutral |
Our default: pgvector for most projects. Pinecone when the use case specifically demands sub-20ms latency at millions of vectors.
Retrieval: The Part Everyone Gets Wrong
You’ve picked your embedding model, chunked your documents, set up your vector database. Now you run a query and… the results are mediocre.
This is where most teams start tweaking the embedding model or adjusting chunk sizes. That’s usually the wrong move. The problem is almost always in the retrieval pipeline.
The Three-Stage Retrieval Pipeline
Stage 1: Initial retrieval (vector search)
Pull the top 20-30 candidates by cosine similarity. Cast a wide net. You want high recall here, not high precision.
Stage 2: Reranking
Run the 20-30 candidates through a cross-encoder reranker (we use Cohere Rerank or a fine-tuned BGE reranker). This re-scores each candidate based on its relevance to the query with much higher accuracy than embedding similarity alone.
This single step typically improves precision by 15-25% based on published research and our own testing. It’s the highest-ROI improvement you can make to a RAG pipeline.
Stage 3: Context assembly
Take the top 5-8 reranked results and assemble them into the LLM context. Order matters: put the most relevant chunks first. Include the source metadata (document name, section, page number) so the LLM can cite its sources. The generation model you choose has a measurable effect on answer quality — for the full framework on LLM selection for production RAG systems, we’ve covered that decision separately.
The Reranking Tax
Reranking adds latency (100-300ms per query) and cost ($0.10 per 1000 queries for Cohere). For most applications, this is worth it. The alternative is serving bad answers fast.
In our experience, adding reranking typically improves answer accuracy by 15–25%. The 200ms latency increase is invisible to users. The accuracy increase is not.
Evaluation: The Non-Negotiable
Here’s the lesson that cost us the most pain: build your evaluation pipeline before your RAG pipeline.
The Golden Dataset
For every RAG project, we build a dataset of 50-100 question-answer pairs before writing any retrieval code.
{
"question": "What is the company's policy on remote work?",
"expected_answer": "Employees may work remotely up to 3 days per week with manager approval.",
"source_document": "employee-handbook-2026.pdf",
"source_section": "Section 4.2: Remote Work Policy"
}
These golden pairs become the ground truth for every experiment. Change the chunking strategy? Run the eval. Switch embedding models? Run the eval. Add reranking? Run the eval.
What We Measure
| Metric | What It Tells You | Target |
|---|---|---|
| Retrieval recall@5 | Do the right chunks appear in the top 5 results? | > 85% |
| Retrieval precision@5 | What fraction of top 5 results are actually relevant? | > 60% |
| Answer accuracy | Does the LLM’s answer match the expected answer? | > 85% |
| Faithfulness | Does the answer come from the retrieved context (not hallucination)? | > 95% |
| Latency (p95) | How long does the full pipeline take? | < 3s |
The Accuracy Ceiling
Every RAG system has an accuracy ceiling determined by the retrieval quality. If your retrieval recall@5 is 70%, your answer accuracy will never exceed ~70% no matter how good your LLM is. The LLM can’t answer correctly from the wrong context.
This is why we benchmark retrieval separately from generation. If retrieval recall is low, fix the retrieval pipeline. If retrieval is good but answers are wrong, fix the prompt or the model.
Production Monitoring
The RAG pipeline that works perfectly in development will degrade in production. Here’s what we monitor.
Query patterns: Are users asking questions you didn’t anticipate? New query patterns often reveal gaps in your corpus.
Retrieval scores: Track the average similarity score of retrieved chunks. If it drops over time, your corpus may have shifted or your users are asking about topics you haven’t indexed.
Answer feedback: Even simple thumbs up/down feedback creates a dataset for continuous improvement.
Latency distribution: Not just average, but p50, p95, and p99. A p95 of 4 seconds means 1 in 20 users waits 4+ seconds. That’s noticeable.
Cost per query: Track your embedding API, reranking API, and LLM API costs per query. We’ve seen costs range from $0.002 to $0.15 per query depending on the pipeline complexity.
FAQ
When should I use RAG vs fine-tuning?
Use RAG when your application needs to query a specific corpus of documents that changes over time, or when you need source attribution for answers. Fine-tuning is better when you need the model to adopt a particular style, follow a specific output format, or perform a task type the base model handles poorly. For most enterprise document Q&A use cases, RAG is the right starting point because the corpus is easier to update and debug than a fine-tuned model.
What vector database should I use for my first RAG system?
Start with pgvector if you are already running Postgres. It handles up to 2 million vectors without meaningful performance issues and keeps your architecture simple by avoiding a separate managed service. If you have no existing database or need sub-20ms query latency at 5 million or more vectors, Pinecone is the faster path to production.
How much data do I need to build a useful RAG system?
There is no minimum document count, but you need enough coverage that the corpus can actually answer the questions users will ask. A 50-document corpus with well-structured, high-quality content will outperform a 10,000-document corpus with noisy, poorly formatted text. Before indexing, audit your documents for completeness: if the answer is not in the corpus, no retrieval strategy will find it.
How do I know if my RAG system is accurate enough to ship?
Build a golden dataset of 50 to 100 representative question-answer pairs before writing any retrieval code, then measure retrieval recall at 5 and answer accuracy against that dataset. A retrieval recall above 85% and answer accuracy above 85% on the golden set is a reasonable bar for most production use cases. Below those thresholds, shipping creates a poor user experience that is difficult to recover from once users lose confidence in the system.
What should I budget for a production RAG system?
A typical setup using pgvector on Supabase ($25 per month), text-embedding-3-small for embeddings (under $5 per month at moderate volume), Cohere reranking ($0.10 per 1,000 queries), and GPT-4o for generation can run $100 to $500 per month for a product with a few thousand daily queries. The biggest cost variable is the generation model: switching from GPT-4o to GPT-4o-mini typically cuts generation costs by 15 to 20 times with manageable quality trade-offs for most document Q&A tasks. Always benchmark quality on your golden dataset before switching models to confirm the trade-off is acceptable.
Building a RAG system? Book a technical call and I’ll walk through how we’d architect yours: including the chunking strategy, model selection, and evaluation pipeline.