Technical

March 15, 2026 · 13 min read

RAG in Production: What Works, What Doesn't, and Why We Stopped Using Pinecone

Q: When should I use RAG vs fine-tuning?

Use RAG when your application needs to query a specific corpus of documents that changes over time, or when you need source attribution for answers. Fine-tuning is better when you need the model to adopt a particular style, follow a specific output format, or perform a task type the base model handles poorly. For most enterprise document Q&A use cases, RAG is the right starting point because the corpus is easier to update and debug than a fine-tuned model.

Q: What vector database should I use for my first RAG system?

Start with pgvector if you are already running Postgres. It handles up to 2 million vectors without meaningful performance issues and keeps your architecture simple by avoiding a separate managed service. If you have no existing database or need sub-20ms query latency at 5 million or more vectors, Pinecone is the faster path to production.

Q: How much data do I need to build a useful RAG system?

There is no minimum document count, but you need enough coverage that the corpus can actually answer the questions users will ask. A 50-document corpus with well-structured, high-quality content will outperform a 10,000-document corpus with noisy, poorly formatted text. Before indexing, audit your documents for completeness: if the answer is not in the corpus, no retrieval strategy will find it.

Q: How do I know if my RAG system is accurate enough to ship?

Build a golden dataset of 50 to 100 representative question-answer pairs before writing any retrieval code, then measure retrieval recall at 5 and answer accuracy against that dataset. A retrieval recall above 85% and answer accuracy above 85% on the golden set is a reasonable bar for most production use cases. Below those thresholds, shipping creates a poor user experience that is difficult to recover from once users lose confidence in the system.

Q: What should I budget for a production RAG system?

A typical setup using pgvector on Supabase ($25 per month), text-embedding-3-small for embeddings (under $5 per month at moderate volume), Cohere reranking ($0.10 per 1,000 queries), and GPT-4o for generation can run $100 to $500 per month for a product with a few thousand daily queries. The biggest cost variable is the generation model: switching from GPT-4o to GPT-4o-mini typically cuts generation costs by 15 to 20 times with manageable quality trade-offs for most document Q&A tasks. Always benchmark quality on your golden dataset before switching models to confirm the trade-off is acceptable. --- *Building a RAG system? Book a technical call and I'll walk through how we'd architect yours: including the chunking strategy, model selection, and evaluation pipeline.*

What we've learned building RAG systems for clients: embedding models, chunking strategies, retrieval accuracy, and why pgvector beat Pinecone for most of our use cases.

Anil Gulecha

Ex-HackerRank, Ex-Google

RAG in Production: What Works, What Doesn't, and Why We Stopped Using Pinecone

TL;DR

pgvector won over Pinecone for most of our RAG deployments: operational simplicity and cost savings outweighed Pinecone's marginal latency advantage
text-embedding-3-small is our default. BGE-M3 for multilingual. Cohere embed-v3 when reranking matters
Chunking strategy matters more than embedding model choice: get this wrong and no amount of model upgrades will fix retrieval quality
Build your evaluation pipeline before your RAG pipeline. 50 golden Q&A pairs will save you weeks of debugging

On this page

The Production RAG Stack We Run Today

If you’re landing here from “RAG in production,” skip the preamble — here’s the stack we’ve converged on after shipping RAG for document Q&A, compliance analysis, knowledge bases, and content search. The detailed reasoning is below; this table is the short version.

Layer	Our default	Why
Embeddings	text-embedding-3-small (1536 dim)	Best cost/quality ratio at $0.02 per million tokens
Chunking	Document-structure, 300–600 tokens, heading prepended	Dominates fixed-size and semantic on real-world docs
Vector DB	pgvector on Supabase	$25/mo handles ~500K vectors, transactional with your app data
Retrieval	Top-30 vector search → Cohere rerank → top 5–8 to LLM	+15–25% precision lift for ~$0.10 per 1,000 queries
Generation	The smallest LLM that passes your eval (see LLM selection for production)	Retrieval quality, not the model, is usually the ceiling
Eval	50–100 golden Q&A pairs built before the retrieval code	Without this you cannot tell which change helped

Four decisions determine quality: embedding model, chunking strategy, whether you rerank, and how you measure correctness. Get those four right and the rest is tuning. RAG is also increasingly the retrieval backbone inside AI agent systems, where the same four choices apply inside the agent’s search tool.

The rest of this post is the why behind every line in that table — the benchmarks, the trade-offs, the things that broke in production, and why we stopped using Pinecone for new projects.

Embedding Model Selection

This is the first decision you’ll make and one of the hardest to change later. Switching embedding models means re-embedding your entire corpus, rebuilding your index, and re-running your evaluation suite.

What We’ve Tested

Model	Dimensions	Speed	Multilingual	Our Verdict
text-embedding-3-small	1536	Fast	Decent	Default choice. Best cost/quality ratio
text-embedding-3-large	3072	Medium	Decent	Marginal quality gain, 2× cost. Rarely worth it
BGE-M3	1024	Slow	Excellent	Best for multilingual content
Cohere embed-v3	1024	Fast	Good	Best when paired with Cohere reranker
Voyage AI voyage-2	1024	Medium	Good	Strong on code and technical content

The Decision Framework

Start with text-embedding-3-small. It handles 90% of use cases well enough, it’s cheap ($0.02 per million tokens), and it’s fast. You can always upgrade later, but in practice, we’ve rarely needed to.

Switch to BGE-M3 when: Your corpus is multilingual or your users query in multiple languages. In multilingual corpora, text-embedding-3-small’s retrieval accuracy drops significantly on non-English queries. BGE-M3 closes that gap substantially.

Switch to Cohere embed-v3 when: You’re using a reranking step (and you probably should be). Cohere’s embedding + reranking pipeline is the most coherent end-to-end system we’ve used.

The Counterintuitive Finding

Embedding model quality matters less than you think. Based on publicly available benchmarks and our own testing, the difference between the best and worst model on the same corpus is typically about 8-12% in retrieval accuracy. The difference between good and bad chunking on the same model can be 20-35%.

Chunking strategy dominates embedding model choice. Fix your chunking first.

Chunking: The Unsexy Foundation

Nobody writes blog posts about chunking. It’s not glamorous. But it’s where most RAG pipelines succeed or fail.

Strategies We’ve Tried

Fixed-size chunking (500 tokens, 100 token overlap)

The default everyone starts with. Simple, predictable, easy to implement.

When it works: Homogeneous content where every section is roughly equally important. API documentation, FAQ pages, standardised reports.

When it fails: Content with variable structure: legal documents where one clause is 50 words and another is 2,000 words. The 50-word clause gets padded with irrelevant context from surrounding chunks. The 2,000-word clause gets split mid-sentence.

Semantic chunking

Split at topic boundaries using sentence similarity. Chunks represent coherent ideas rather than arbitrary token counts.

When it works: Long-form content with distinct sections: research papers, blog posts, narrative documents.

When it fails: Structured data, tables, lists. The semantic boundary detector doesn’t understand that a bullet-point list is one logical unit.

Implementation cost: Higher. You need an additional embedding pass to detect boundaries. Adds 30-40% to indexing time.

Document-structure chunking (our default)

Split on the document’s own structure: headings, sections, paragraphs. Respect the author’s information architecture.

def chunk_by_structure(doc):
    chunks = []
    current_chunk = {"heading": "", "content": "", "metadata": {}}
    
    for element in doc.elements:
        if element.type == "heading":
            if current_chunk["content"]:
                chunks.append(current_chunk)
            current_chunk = {
                "heading": element.text,
                "content": "",
                "metadata": {"level": element.level}
            }
        else:
            current_chunk["content"] += element.text + "\n"
    
    # Handle chunks that are too large
    for chunk in chunks:
        if token_count(chunk["content"]) > 800:
            yield from split_large_chunk(chunk, max_tokens=500)
        else:
            yield chunk

When it works: Almost everything with structure: documents, articles, manuals, policies. The document’s existing structure is usually the best chunking strategy because the author already organised information into logical units.

When it fails: Unstructured content: chat transcripts, raw notes, OCR’d documents with no formatting.

The Chunk Size Sweet Spot

Based on our experience and widely reported best practices:

Too small (< 200 tokens): Not enough context for meaningful retrieval. The embedding captures a fragment, not an idea.
Too large (> 1000 tokens): The embedding becomes a blurry average of multiple ideas. Retrieval precision drops.
Sweet spot: 300-600 tokens with the heading/section title prepended.

The heading prepend trick is significant. A chunk that says “Employees are entitled to 21 days of paid leave per year” retrieves poorly for the query “what’s the leave policy?” But prepend the section heading (“Leave Policy: Employees are entitled to 21 days…”) and retrieval accuracy jumps 15-20%.

pgvector vs Pinecone: The Real Trade-offs

We started with Pinecone. It’s the obvious choice: purpose-built for vector search, managed infrastructure, good documentation.

After 3 projects on Pinecone, we moved to pgvector for new projects. Here’s why.

Where Pinecone Wins

Query latency at scale. At 5M+ vectors, Pinecone’s p95 latency is 15-30ms. pgvector with HNSW indexing is 40-80ms at the same scale. For real-time applications where every millisecond matters, Pinecone is faster.

Managed infrastructure. Zero ops overhead. No index tuning, no vacuum operations, no connection pooling to manage.

Metadata filtering. Pinecone’s metadata filtering is fast and doesn’t degrade performance significantly. pgvector’s filtering with WHERE clauses on large tables can be slower.

Where pgvector Wins

Cost. A Supabase Pro instance ($25/month) handles 500K vectors comfortably. Pinecone’s equivalent is $70/month on the starter plan, and scales to $200+/month quickly. For a startup running multiple RAG systems, the difference compounds.

Operational simplicity. Your data is already in Postgres. Your application already connects to Postgres. Adding vector search to an existing database is one extension and one column. No new service to manage, no new SDK to learn, no new auth system to configure.

Transactional consistency. When you update a document, you can update the embedding and the metadata in the same transaction. With Pinecone, you have two systems to keep in sync, and they will drift.

SQL-native queries. You can combine vector similarity search with traditional SQL in the same query. “Find the most relevant policy documents from the HR department uploaded in the last 6 months” is one SQL query with pgvector. With Pinecone, it’s a vector search followed by application-level filtering.

For more on this, read our guide on Vector Databases Compared.

SELECT content, 1 - (embedding <=> $1) as similarity
FROM documents
WHERE department = 'HR'
  AND uploaded_at > NOW() - INTERVAL '6 months'
ORDER BY embedding <=> $1
LIMIT 5;

Our Decision Matrix

Factor	pgvector	Pinecone
Vectors < 2M	✅	Overkill
Vectors > 5M	Workable	✅
Budget-conscious	✅	❌
Need < 20ms p95	❌	✅
Already using Postgres	✅	Redundant
Multi-tenant SaaS	✅ (row-level security)	✅ (namespaces)
Team knows SQL	✅	Neutral

Our default: pgvector for most projects. Pinecone when the use case specifically demands sub-20ms latency at millions of vectors.

Retrieval: The Part Everyone Gets Wrong

You’ve picked your embedding model, chunked your documents, set up your vector database. Now you run a query and… the results are mediocre.

This is where most teams start tweaking the embedding model or adjusting chunk sizes. That’s usually the wrong move. The problem is almost always in the retrieval pipeline.

The Three-Stage Retrieval Pipeline

Stage 1: Initial retrieval (vector search)

Pull the top 20-30 candidates by cosine similarity. Cast a wide net. You want high recall here, not high precision.

Stage 2: Reranking

Run the 20-30 candidates through a cross-encoder reranker (we use Cohere Rerank or a fine-tuned BGE reranker). This re-scores each candidate based on its relevance to the query with much higher accuracy than embedding similarity alone.

This single step typically improves precision by 15-25% based on published research and our own testing. It’s the highest-ROI improvement you can make to a RAG pipeline.

Stage 3: Context assembly

Take the top 5-8 reranked results and assemble them into the LLM context. Order matters: put the most relevant chunks first. Include the source metadata (document name, section, page number) so the LLM can cite its sources. The generation model you choose has a measurable effect on answer quality — for the full framework on LLM selection for production RAG systems, we’ve covered that decision separately.

The Reranking Tax

Reranking adds latency (100-300ms per query) and cost ($0.10 per 1000 queries for Cohere). For most applications, this is worth it. The alternative is serving bad answers fast.

In our experience, adding reranking typically improves answer accuracy by 15–25%. The 200ms latency increase is invisible to users. The accuracy increase is not.

Evaluation: The Non-Negotiable

Here’s the lesson that cost us the most pain: build your evaluation pipeline before your RAG pipeline.

The Golden Dataset

For every RAG project, we build a dataset of 50-100 question-answer pairs before writing any retrieval code.

{
  "question": "What is the company's policy on remote work?",
  "expected_answer": "Employees may work remotely up to 3 days per week with manager approval.",
  "source_document": "employee-handbook-2026.pdf",
  "source_section": "Section 4.2: Remote Work Policy"
}

These golden pairs become the ground truth for every experiment. Change the chunking strategy? Run the eval. Switch embedding models? Run the eval. Add reranking? Run the eval.

What We Measure

Metric	What It Tells You	Target
Retrieval recall@5	Do the right chunks appear in the top 5 results?	> 85%
Retrieval precision@5	What fraction of top 5 results are actually relevant?	> 60%
Answer accuracy	Does the LLM’s answer match the expected answer?	> 85%
Faithfulness	Does the answer come from the retrieved context (not hallucination)?	> 95%
Latency (p95)	How long does the full pipeline take?	< 3s

The Accuracy Ceiling

Every RAG system has an accuracy ceiling determined by the retrieval quality. If your retrieval recall@5 is 70%, your answer accuracy will never exceed ~70% no matter how good your LLM is. The LLM can’t answer correctly from the wrong context.

This is why we benchmark retrieval separately from generation. If retrieval recall is low, fix the retrieval pipeline. If retrieval is good but answers are wrong, fix the prompt or the model.

Production Monitoring

The RAG pipeline that works perfectly in development will degrade in production. Here’s what we monitor.

Query patterns: Are users asking questions you didn’t anticipate? New query patterns often reveal gaps in your corpus.

Retrieval scores: Track the average similarity score of retrieved chunks. If it drops over time, your corpus may have shifted or your users are asking about topics you haven’t indexed.

Answer feedback: Even simple thumbs up/down feedback creates a dataset for continuous improvement.

Latency distribution: Not just average, but p50, p95, and p99. A p95 of 4 seconds means 1 in 20 users waits 4+ seconds. That’s noticeable.

Cost per query: Track your embedding API, reranking API, and LLM API costs per query. We’ve seen costs range from $0.002 to $0.15 per query depending on the pipeline complexity.

FAQ

When should I use RAG vs fine-tuning?

Use RAG when your application needs to query a specific corpus of documents that changes over time, or when you need source attribution for answers. Fine-tuning is better when you need the model to adopt a particular style, follow a specific output format, or perform a task type the base model handles poorly. For most enterprise document Q&A use cases, RAG is the right starting point because the corpus is easier to update and debug than a fine-tuned model.

What vector database should I use for my first RAG system?

Start with pgvector if you are already running Postgres. It handles up to 2 million vectors without meaningful performance issues and keeps your architecture simple by avoiding a separate managed service. If you have no existing database or need sub-20ms query latency at 5 million or more vectors, Pinecone is the faster path to production.

How much data do I need to build a useful RAG system?

There is no minimum document count, but you need enough coverage that the corpus can actually answer the questions users will ask. A 50-document corpus with well-structured, high-quality content will outperform a 10,000-document corpus with noisy, poorly formatted text. Before indexing, audit your documents for completeness: if the answer is not in the corpus, no retrieval strategy will find it.

How do I know if my RAG system is accurate enough to ship?

Build a golden dataset of 50 to 100 representative question-answer pairs before writing any retrieval code, then measure retrieval recall at 5 and answer accuracy against that dataset. A retrieval recall above 85% and answer accuracy above 85% on the golden set is a reasonable bar for most production use cases. Below those thresholds, shipping creates a poor user experience that is difficult to recover from once users lose confidence in the system.

What should I budget for a production RAG system?

A typical setup using pgvector on Supabase ($25 per month), text-embedding-3-small for embeddings (under $5 per month at moderate volume), Cohere reranking ($0.10 per 1,000 queries), and GPT-4o for generation can run $100 to $500 per month for a product with a few thousand daily queries. The biggest cost variable is the generation model: switching from GPT-4o to GPT-4o-mini typically cuts generation costs by 15 to 20 times with manageable quality trade-offs for most document Q&A tasks. Always benchmark quality on your golden dataset before switching models to confirm the trade-off is acceptable.

Building a RAG system? Book a technical call and I’ll walk through how we’d architect yours: including the chunking strategy, model selection, and evaluation pipeline.

#RAG#pgvector#Pinecone#embeddings#vector database#LLM#production

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Written by

Anil Gulecha

Ex-HackerRank, Ex-Google

Anil reviews every architecture decision at Kalvium Labs. He's the engineer who still ships code — making technical trade-offs on RAG vs fine-tuning, model selection, and infrastructure choices. When a CTO evaluates us, Anil is the reason they trust the work.

LinkedIn GitHub · About us →

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

Kalvium Labs

AI products for startups

Keep reading

Technical

LangGraph vs LangChain in Production: When Each Makes Sense

Technical

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

dharini@kalviumlabs.ai WhatsApp

What happens on the call:

You describe your AI product idea

5 min: vision, users, constraints

We ask the hard questions

10 min: what happens when the AI gets it wrong

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

RAG in Production: What Works, What Doesn't, and Why We Stopped Using Pinecone

Want us to sketch what this looks like for you?

See it in production: Meeting Intelligence Tool

Free: AI PRD Generator

The Production RAG Stack We Run Today

Embedding Model Selection

What We’ve Tested

The Decision Framework

The Counterintuitive Finding

Chunking: The Unsexy Foundation

Strategies We’ve Tried

The Chunk Size Sweet Spot

pgvector vs Pinecone: The Real Trade-offs

Where Pinecone Wins

Where pgvector Wins

Our Decision Matrix

Retrieval: The Part Everyone Gets Wrong

The Three-Stage Retrieval Pipeline

The Reranking Tax

Evaluation: The Non-Negotiable

The Golden Dataset

What We Measure

The Accuracy Ceiling

Production Monitoring

FAQ

When should I use RAG vs fine-tuning?

What vector database should I use for my first RAG system?

How much data do I need to build a useful RAG system?

How do I know if my RAG system is accurate enough to ship?

What should I budget for a production RAG system?

One engineering tradeoff, every Tuesday.

Anil Gulecha

Keep reading

LangGraph vs LangChain in Production: When Each Makes Sense

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking.
The only thing left is a conversation.

What happens on the call:

Want us to sketch what this looks like for you?

See it in production: Meeting Intelligence Tool

Free: AI PRD Generator

The Production RAG Stack We Run Today

Embedding Model Selection

What We’ve Tested

The Decision Framework

The Counterintuitive Finding

Chunking: The Unsexy Foundation

Strategies We’ve Tried

The Chunk Size Sweet Spot

pgvector vs Pinecone: The Real Trade-offs

Where Pinecone Wins

Where pgvector Wins

Our Decision Matrix

Retrieval: The Part Everyone Gets Wrong

The Three-Stage Retrieval Pipeline

The Reranking Tax

Evaluation: The Non-Negotiable

The Golden Dataset

What We Measure

The Accuracy Ceiling

Production Monitoring

FAQ

When should I use RAG vs fine-tuning?

What vector database should I use for my first RAG system?

How much data do I need to build a useful RAG system?

How do I know if my RAG system is accurate enough to ship?

What should I budget for a production RAG system?

One engineering tradeoff, every Tuesday.

Anil Gulecha

Keep reading

LangGraph vs LangChain in Production: When Each Makes Sense

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking. The only thing left is a conversation.

What happens on the call:

You've read the thinking.
The only thing left is a conversation.