The Question I Keep Getting
“Can we just plug ChatGPT into our app?”
I hear this at least once a week. A founder has seen ChatGPT do something impressive, and they want that same magic inside their product. Makes sense, the demo is genuinely mind-blowing.
Here’s the thing, though. The gap between “ChatGPT answering a question” and “a chatbot your users actually trust” is enormous. And most teams underestimate it by about 10x.
I’ve built chatbots for knowledge bases, customer support, and data analytics. Every single one started with the same assumption (“this should be straightforward”) and every single one hit the same walls. Here are the walls, and what we do about them.
Wall #1: Hallucination Is the Default
Out of the box, an LLM will confidently make things up. Ask it about your company’s refund policy and it’ll generate a perfectly reasonable-sounding answer that has nothing to do with your actual policy.
For a fun side project? Fine. For a chatbot handling customer queries about their account? Catastrophic.
This is why RAG exists. Retrieval-Augmented Generation grounds the LLM’s responses in your actual data. Instead of generating from its training data, the model retrieves relevant documents first, then generates an answer based on what it found.
But here’s what most tutorials skip. RAG has its own failure modes:
Retrieval failures
Your chatbot is only as good as its retrieval. If the search step returns irrelevant chunks, the LLM will either hallucinate on top of bad context (worse than no RAG) or give a vague non-answer.
What we actually do:
-
Chunking strategy matters more than embedding model. We’ve seen teams obsess over which embedding model to use while chunking their documents at arbitrary 512-token boundaries. The chunk boundaries should follow the logical structure of your content: by section, by paragraph, by FAQ entry. Not by token count.
-
Benchmark retrieval before you touch the LLM. Build a set of 50 test queries with expected source documents. Measure retrieval accuracy independently. If retrieval is returning the wrong chunks, no amount of prompt engineering will save you.
-
Hybrid search beats pure vector search. For most business use cases, combining vector similarity with keyword matching (BM25) gives better results than either alone. A user searching for “invoice #4521” needs exact keyword match, not semantic similarity.
The embedding model decision
We’ve benchmarked several embedding models across client projects. The short version:
- text-embedding-3-small (OpenAI): Fast, cheap, good enough for English-only content. Our default starting point.
- BGE-M3: Best for multilingual content. Higher latency (340ms vs 120ms for OpenAI’s small model) but significantly better cross-lingual retrieval.
- Domain-specific fine-tuned models: Almost never worth it unless you have 100K+ domain-specific query-document pairs for training.
The decision isn’t “which is best.” It’s “which is best for your specific query patterns and latency requirements.” The same thinking applies to your primary generation model — for a deeper look at choosing LLMs for production, we’ve covered that decision in full.
Wall #2: No Guardrails = No Trust
This one hit me hard on a real project. We built a knowledge base chatbot that worked beautifully in testing. The client deployed it. Within two days, a user asked a question outside the knowledge base scope, and the chatbot generated a plausible but completely wrong answer about a compliance topic.
That’s when I learned: guardrails aren’t a feature you add in sprint 3. They’re the first thing you design.
What guardrails actually look like
1. Scope boundaries Define what the chatbot should and shouldn’t answer. Sounds obvious, but most teams skip this.
System prompt: "You are a support assistant for [Company]. You answer questions
about [specific topics]. If a question is outside these topics, say: 'I can only
help with [topics]. For other questions, please contact support@company.com.'"
This isn’t enough on its own. LLMs can be prompt-injected past system instructions. But it’s the starting layer.
2. Retrieval confidence thresholds If the retrieval step returns documents with low similarity scores, the chatbot should say “I don’t have enough information to answer that” instead of guessing.
We typically set a similarity threshold of 0.7 for cosine similarity. Below that, the chatbot declines to answer. This catches most out-of-scope questions.
3. Output validation Before sending the response to the user, validate it:
- Does it reference source documents? (If not, it might be hallucinating)
- Does it contain any content from a blocklist? (Pricing, legal advice, medical guidance: whatever your domain requires)
- Is the response length reasonable? (Extremely short or extremely long responses are often failure modes)
4. Fallback to human Always, always, always have a handoff path. “I’m not confident in this answer. Let me connect you with a human” is infinitely better than a confident wrong answer.
Wall #3: The “It Works on My Machine” Problem
The demo works. Your five test questions get perfect answers. You ship it.
Then real users ask questions you never thought of. They phrase things differently. They ask follow-up questions that require context from three messages ago. They paste in long documents and ask “summarize this.” They type in Hindi when your knowledge base is in English.
This is why you need an evaluation pipeline before you build the chatbot.
What we build before the chatbot
-
A test suite of 100+ query-answer pairs. Not 10, not 20. 100 minimum. Covering happy paths, edge cases, out-of-scope questions, adversarial inputs, and multilingual queries if relevant.
-
Automated retrieval evaluation. For each test query: does retrieval return the right source document? Measure recall@5 and recall@10.
-
Automated answer evaluation. For each test query: is the generated answer correct? We use an LLM-as-judge approach: a separate model evaluates whether the answer is factually consistent with the source documents.
-
Regression testing on every change. Changed the chunking strategy? Swapped the embedding model? Updated the prompt? Run the full test suite. Every time.
This sounds like a lot of work. It is. But it’s dramatically less work than debugging production issues with angry users.
Wall #4: Multi-Turn Conversations Are Hard
Most chatbot tutorials show single-turn interactions: user asks, bot answers, done. Real conversations aren’t like that.
“Show me orders from last month” → “Which ones were refunded?” → “Cancel the largest one.”
Each message depends on the previous context. The chatbot needs to:
- Maintain conversation history
- Resolve references (“the largest one” = the largest refunded order from last month)
- Handle context window limits (long conversations exceed token limits)
What actually works
- Session-based memory with a sliding window. Keep the last N messages in context. For most use cases, 10-15 turns is enough.
- Summarization for long conversations. If the conversation exceeds the context window, summarize earlier messages and keep recent ones verbatim.
- Don’t over-engineer memory. I’ve seen teams build vector-store-backed conversation memory with semantic retrieval over chat history. For 95% of use cases, a simple array of messages works fine. Add complexity only when simple fails.
The Stack That Actually Ships
After building several production chatbots, here’s what we’ve settled on:
| Layer | What we use | Why |
|---|---|---|
| Embedding | text-embedding-3-small (default) or BGE-M3 (multilingual) | Cost-performance sweet spot |
| Vector store | pgvector | Runs alongside your existing Postgres. No extra service to manage |
| LLM | Claude 3.5 Sonnet (complex reasoning) or Haiku (high volume, simple queries) | Best instruction following for chatbot use cases |
| Framework | Custom: no LangChain in production | We need control over retry logic, streaming, error handling |
| Search | Hybrid (pgvector + pg_trgm for keyword) | Best of both worlds |
| Evaluation | Custom test suite + LLM-as-judge | Non-negotiable before shipping |
When NOT to Build an AI Chatbot
This might be the most valuable section. Not every problem needs an AI chatbot.
Don’t build one if:
- Your FAQ has fewer than 50 entries. A search bar and a well-organized help page will outperform any chatbot.
- The answers require real-time data from systems you can’t connect. A chatbot that says “I don’t have access to that” is worse than no chatbot.
- The domain is high-stakes (medical, legal, financial advice) and you can’t guarantee accuracy. The liability isn’t worth it.
- Your users are technical and prefer searching documentation. Engineers don’t want to chat. They want
Ctrl+F.
Build one if:
- You have a large knowledge base (100+ documents) and users struggle to find answers.
- Your support team answers the same 50 questions repeatedly. A chatbot handles the common ones, humans handle the exceptions.
- Users need to query structured data in natural language — agentic SQL chatbots handle these queries naturally without requiring users to know SQL.
- You need multilingual support without translating your entire knowledge base.
The Real Cost
Most people think the cost of an AI chatbot is the API bill. The API bill is the smallest cost.
The real costs:
- Building the evaluation pipeline: 30% of the total effort
- Curating and maintaining the knowledge base: ongoing, never done
- Handling edge cases and failures gracefully: this is where the engineering time goes
- Monitoring and improving: post-launch, you need someone watching the logs, catching failures, updating the test suite
The API cost? For most business chatbots on Haiku, it’s $50-200/month. The engineering cost to build it right is 10-100x that.
That’s the real answer to “can we just plug ChatGPT into our app?” You can. But what you’ll get is a demo, not a product. And the distance between those two things is where the actual work lives.
FAQ
How long does it take to build a production AI chatbot?
A working prototype with RAG and a basic interface takes roughly 72 hours. A production-ready chatbot with proper evaluation pipelines, guardrails, monitoring, and edge-case handling typically takes 4 to 8 weeks. The timeline depends more on knowledge base quality and how clearly scope is defined than on the technical stack itself.
Do I need RAG for my chatbot?
If the chatbot needs to answer questions based on your specific documents, policies, or internal data, then yes. Without RAG, the model generates from its training data, which has nothing to do with your business and will produce confident but wrong answers. If you only need a general-purpose assistant with no business-specific knowledge requirements, RAG may not be necessary.
What does it cost to build a production AI chatbot?
API running costs are usually modest, often $50 to $200 per month for a business chatbot using a fast, lower-cost model. The build cost covers the evaluation pipeline, knowledge base curation, guardrail design, and post-launch monitoring. A well-scoped chatbot built by a specialist team typically runs $5,000 to $25,000 depending on complexity and integration requirements.
How do I know if my knowledge base is ready for a chatbot?
The minimum bar is roughly 50 to 100 documents or FAQ entries that cover the questions your users actually ask. If your content is poorly organized, outdated, or scattered across disconnected systems, plan to spend time on cleanup before the chatbot can use it reliably. The quality ceiling of any RAG-based chatbot is set directly by the quality of the documents underneath it.
Can I switch LLM providers later without rebuilding everything?
Yes, if the system is architected correctly from the start with the LLM call behind an abstraction layer. We design chatbots so that swapping Claude for GPT-4o, or any future model, requires changing one configuration file rather than rewriting the codebase. This matters because model pricing and performance shift frequently, and staying provider-flexible protects the investment over time.
Building an AI chatbot for your product? Book a 30-minute call: we’ll tell you honestly whether a chatbot is the right solution, and if so, what a 72-hour prototype looks like.