Technical

March 24, 2026 · 10 min read

AI Chatbot Development: Beyond 'Just Add ChatGPT'

Most AI chatbots fail because they're built like demos, not products. Here's what actually goes into a chatbot that users trust: from RAG architecture to guardrails to the evaluation pipeline you're probably skipping.

Abraham Jeron

AI products & system architecture — from prototype to production

AI Chatbot Development: Beyond 'Just Add ChatGPT'

TL;DR

Wrapping ChatGPT in a UI is a demo, not a product. The real engineering starts after the first 'wow' moment
RAG isn't optional for business chatbots: without grounding, your chatbot will confidently hallucinate and your users will stop trusting it
Guardrails aren't a feature you add later. They're the first thing you design: what happens when the AI doesn't know the answer?
If you don't have an evaluation pipeline, you're shipping blind. Build the test suite before you build the chatbot

On this page

The Question I Keep Getting

“Can we just plug ChatGPT into our app?”

I hear this at least once a week. A founder has seen ChatGPT do something impressive, and they want that same magic inside their product. Makes sense, the demo is genuinely mind-blowing.

Here’s the thing, though. The gap between “ChatGPT answering a question” and “a chatbot your users actually trust” is enormous. And most teams underestimate it by about 10x.

I’ve built chatbots for knowledge bases, customer support, and data analytics. Every single one started with the same assumption (“this should be straightforward”) and every single one hit the same walls. Here are the walls, and what we do about them.

Wall #1: Hallucination Is the Default

Out of the box, an LLM will confidently make things up. Ask it about your company’s refund policy and it’ll generate a perfectly reasonable-sounding answer that has nothing to do with your actual policy.

For a fun side project? Fine. For a chatbot handling customer queries about their account? Catastrophic.

This is why RAG exists. Retrieval-Augmented Generation grounds the LLM’s responses in your actual data. Instead of generating from its training data, the model retrieves relevant documents first, then generates an answer based on what it found.

But here’s what most tutorials skip. RAG has its own failure modes:

Retrieval failures

Your chatbot is only as good as its retrieval. If the search step returns irrelevant chunks, the LLM will either hallucinate on top of bad context (worse than no RAG) or give a vague non-answer.

What we actually do:

Chunking strategy matters more than embedding model. We’ve seen teams obsess over which embedding model to use while chunking their documents at arbitrary 512-token boundaries. The chunk boundaries should follow the logical structure of your content: by section, by paragraph, by FAQ entry. Not by token count.
Benchmark retrieval before you touch the LLM. Build a set of 50 test queries with expected source documents. Measure retrieval accuracy independently. If retrieval is returning the wrong chunks, no amount of prompt engineering will save you.
Hybrid search beats pure vector search. For most business use cases, combining vector similarity with keyword matching (BM25) gives better results than either alone. A user searching for “invoice #4521” needs exact keyword match, not semantic similarity.

The embedding model decision

We’ve benchmarked several embedding models across client projects. The short version:

text-embedding-3-small (OpenAI): Fast, cheap, good enough for English-only content. Our default starting point.
BGE-M3: Best for multilingual content. Higher latency (340ms vs 120ms for OpenAI’s small model) but significantly better cross-lingual retrieval.
Domain-specific fine-tuned models: Almost never worth it unless you have 100K+ domain-specific query-document pairs for training.

The decision isn’t “which is best.” It’s “which is best for your specific query patterns and latency requirements.” The same thinking applies to your primary generation model — for a deeper look at choosing LLMs for production, we’ve covered that decision in full.

Wall #2: No Guardrails = No Trust

This one hit me hard on a real project. We built a knowledge base chatbot that worked beautifully in testing. The client deployed it. Within two days, a user asked a question outside the knowledge base scope, and the chatbot generated a plausible but completely wrong answer about a compliance topic.

That’s when I learned: guardrails aren’t a feature you add in sprint 3. They’re the first thing you design.

What guardrails actually look like

1. Scope boundaries Define what the chatbot should and shouldn’t answer. Sounds obvious, but most teams skip this.

System prompt: "You are a support assistant for [Company]. You answer questions
about [specific topics]. If a question is outside these topics, say: 'I can only
help with [topics]. For other questions, please contact support@company.com.'"

This isn’t enough on its own. LLMs can be prompt-injected past system instructions. But it’s the starting layer.

2. Retrieval confidence thresholds If the retrieval step returns documents with low similarity scores, the chatbot should say “I don’t have enough information to answer that” instead of guessing.

We typically set a similarity threshold of 0.7 for cosine similarity. Below that, the chatbot declines to answer. This catches most out-of-scope questions.

3. Output validation Before sending the response to the user, validate it:

Does it reference source documents? (If not, it might be hallucinating)
Does it contain any content from a blocklist? (Pricing, legal advice, medical guidance: whatever your domain requires)
Is the response length reasonable? (Extremely short or extremely long responses are often failure modes)

4. Fallback to human Always, always, always have a handoff path. “I’m not confident in this answer. Let me connect you with a human” is infinitely better than a confident wrong answer.

Wall #3: The “It Works on My Machine” Problem

The demo works. Your five test questions get perfect answers. You ship it.

Then real users ask questions you never thought of. They phrase things differently. They ask follow-up questions that require context from three messages ago. They paste in long documents and ask “summarize this.” They type in Hindi when your knowledge base is in English.

This is why you need an evaluation pipeline before you build the chatbot.

What we build before the chatbot

A test suite of 100+ query-answer pairs. Not 10, not 20. 100 minimum. Covering happy paths, edge cases, out-of-scope questions, adversarial inputs, and multilingual queries if relevant.
Automated retrieval evaluation. For each test query: does retrieval return the right source document? Measure recall@5 and recall@10.
Automated answer evaluation. For each test query: is the generated answer correct? We use an LLM-as-judge approach: a separate model evaluates whether the answer is factually consistent with the source documents.
Regression testing on every change. Changed the chunking strategy? Swapped the embedding model? Updated the prompt? Run the full test suite. Every time.

This sounds like a lot of work. It is. But it’s dramatically less work than debugging production issues with angry users.

Wall #4: Multi-Turn Conversations Are Hard

Most chatbot tutorials show single-turn interactions: user asks, bot answers, done. Real conversations aren’t like that.

“Show me orders from last month” → “Which ones were refunded?” → “Cancel the largest one.”

Each message depends on the previous context. The chatbot needs to:

Maintain conversation history
Resolve references (“the largest one” = the largest refunded order from last month)
Handle context window limits (long conversations exceed token limits)

What actually works

Session-based memory with a sliding window. Keep the last N messages in context. For most use cases, 10-15 turns is enough.
Summarization for long conversations. If the conversation exceeds the context window, summarize earlier messages and keep recent ones verbatim.
Don’t over-engineer memory. I’ve seen teams build vector-store-backed conversation memory with semantic retrieval over chat history. For 95% of use cases, a simple array of messages works fine. Add complexity only when simple fails.

The Stack That Actually Ships

After building several production chatbots, here’s what we’ve settled on:

Layer	What we use	Why
Embedding	text-embedding-3-small (default) or BGE-M3 (multilingual)	Cost-performance sweet spot
Vector store	pgvector	Runs alongside your existing Postgres. No extra service to manage
LLM	Claude 3.5 Sonnet (complex reasoning) or Haiku (high volume, simple queries)	Best instruction following for chatbot use cases
Framework	Custom: no LangChain in production	We need control over retry logic, streaming, error handling
Search	Hybrid (pgvector + pg_trgm for keyword)	Best of both worlds
Evaluation	Custom test suite + LLM-as-judge	Non-negotiable before shipping

When NOT to Build an AI Chatbot

This might be the most valuable section. Not every problem needs an AI chatbot.

Don’t build one if:

Your FAQ has fewer than 50 entries. A search bar and a well-organized help page will outperform any chatbot.
The answers require real-time data from systems you can’t connect. A chatbot that says “I don’t have access to that” is worse than no chatbot.
The domain is high-stakes (medical, legal, financial advice) and you can’t guarantee accuracy. The liability isn’t worth it.
Your users are technical and prefer searching documentation. Engineers don’t want to chat. They want Ctrl+F.

Build one if:

You have a large knowledge base (100+ documents) and users struggle to find answers.
Your support team answers the same 50 questions repeatedly. A chatbot handles the common ones, humans handle the exceptions.
Users need to query structured data in natural language — agentic SQL chatbots handle these queries naturally without requiring users to know SQL.
You need multilingual support without translating your entire knowledge base.

The Real Cost

Most people think the cost of an AI chatbot is the API bill. The API bill is the smallest cost.

The real costs:

Building the evaluation pipeline: 30% of the total effort
Curating and maintaining the knowledge base: ongoing, never done
Handling edge cases and failures gracefully: this is where the engineering time goes
Monitoring and improving: post-launch, you need someone watching the logs, catching failures, updating the test suite

The API cost? For most business chatbots on Haiku, it’s $50-200/month. The engineering cost to build it right is 10-100x that.

That’s the real answer to “can we just plug ChatGPT into our app?” You can. But what you’ll get is a demo, not a product. And the distance between those two things is where the actual work lives.

FAQ

How long does it take to build a production AI chatbot?

A working prototype with RAG and a basic interface takes roughly 72 hours. A production-ready chatbot with proper evaluation pipelines, guardrails, monitoring, and edge-case handling typically takes 4 to 8 weeks. The timeline depends more on knowledge base quality and how clearly scope is defined than on the technical stack itself.

Do I need RAG for my chatbot?

If the chatbot needs to answer questions based on your specific documents, policies, or internal data, then yes. Without RAG, the model generates from its training data, which has nothing to do with your business and will produce confident but wrong answers. If you only need a general-purpose assistant with no business-specific knowledge requirements, RAG may not be necessary.

What does it cost to build a production AI chatbot?

API running costs are usually modest, often $50 to $200 per month for a business chatbot using a fast, lower-cost model. The build cost covers the evaluation pipeline, knowledge base curation, guardrail design, and post-launch monitoring. A well-scoped chatbot built by a specialist team typically runs $5,000 to $25,000 depending on complexity and integration requirements.

How do I know if my knowledge base is ready for a chatbot?

The minimum bar is roughly 50 to 100 documents or FAQ entries that cover the questions your users actually ask. If your content is poorly organized, outdated, or scattered across disconnected systems, plan to spend time on cleanup before the chatbot can use it reliably. The quality ceiling of any RAG-based chatbot is set directly by the quality of the documents underneath it.

Can I switch LLM providers later without rebuilding everything?

Yes, if the system is architected correctly from the start with the LLM call behind an abstraction layer. We design chatbots so that swapping Claude for GPT-4o, or any future model, requires changing one configuration file rather than rewriting the codebase. This matters because model pricing and performance shift frequently, and staying provider-flexible protects the investment over time.

Building an AI chatbot for your product? Book a 30-minute call: we’ll tell you honestly whether a chatbot is the right solution, and if so, what a 72-hour prototype looks like.

#ai chatbot#chatbot development#RAG#LLM#production#guardrails

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

LinkedIn · About us →

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

Kalvium Labs

AI products for startups

Keep reading

Technical

LangGraph vs LangChain in Production: When Each Makes Sense

Technical

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

dharini@kalviumlabs.ai WhatsApp

What happens on the call:

You describe your AI product idea

5 min: vision, users, constraints

We ask the hard questions

10 min: what happens when the AI gets it wrong

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

AI Chatbot Development: Beyond 'Just Add ChatGPT'

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

The Question I Keep Getting

Wall #1: Hallucination Is the Default

Retrieval failures

The embedding model decision

Wall #2: No Guardrails = No Trust

What guardrails actually look like

Wall #3: The “It Works on My Machine” Problem

What we build before the chatbot

Wall #4: Multi-Turn Conversations Are Hard

What actually works

The Stack That Actually Ships

When NOT to Build an AI Chatbot

The Real Cost

FAQ

How long does it take to build a production AI chatbot?

Do I need RAG for my chatbot?

What does it cost to build a production AI chatbot?

How do I know if my knowledge base is ready for a chatbot?

Can I switch LLM providers later without rebuilding everything?

One engineering tradeoff, every Tuesday.

Abraham Jeron

Keep reading

LangGraph vs LangChain in Production: When Each Makes Sense

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking.
The only thing left is a conversation.

What happens on the call:

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

The Question I Keep Getting

Wall #1: Hallucination Is the Default

Retrieval failures

The embedding model decision

Wall #2: No Guardrails = No Trust

What guardrails actually look like

Wall #3: The “It Works on My Machine” Problem

What we build before the chatbot

Wall #4: Multi-Turn Conversations Are Hard

What actually works

The Stack That Actually Ships

When NOT to Build an AI Chatbot

The Real Cost

FAQ

How long does it take to build a production AI chatbot?

Do I need RAG for my chatbot?

What does it cost to build a production AI chatbot?

How do I know if my knowledge base is ready for a chatbot?

Can I switch LLM providers later without rebuilding everything?

One engineering tradeoff, every Tuesday.

Abraham Jeron

Keep reading

LangGraph vs LangChain in Production: When Each Makes Sense

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking. The only thing left is a conversation.

What happens on the call:

You've read the thinking.
The only thing left is a conversation.