Technical

March 18, 2026 · 10 min read

Building AI Agents: Architecture, Trade-offs, and What We've Learned

A technical deep-dive into how we architect AI agents for production. LangChain vs custom, model selection, tool-calling patterns, and the mistakes that cost us time.

Anil Gulecha

Ex-HackerRank, Ex-Google

Building AI Agents: Architecture, Trade-offs, and What We've Learned

TL;DR

We stopped using LangChain for production agents after the third project: the abstraction cost exceeded the convenience
Claude 3.5 Sonnet is our default for agents. GPT-4o when we need structured output reliability. Open-source models for cost-sensitive batch work
Tool-calling architecture matters more than model selection: a bad tool design makes even the best model fail
Invest in evaluation infrastructure before you build the agent. You can't improve what you can't measure

On this page

The Agent Ecosystem Is Noisy. Here’s What Actually Works.

Every week, a new “agent framework” drops on Twitter. AutoGen, CrewAI, LangGraph, custom orchestrators: the ecosystem moves fast and the marketing moves faster.

We’ve built agent-based systems for clients across compliance, analytics, content generation, and customer support — including an AI data analyst you can query in plain English. Here’s what we’ve learned, not from tutorials, but from production.

Framework Choice: LangChain vs Custom

Our Journey with LangChain

We started with LangChain. Most teams do. It’s the obvious choice: large community, lots of examples, abstractions for everything.

By our third production agent, we stopped using it for new projects.

For more on this, read our guide on RAG in Production. The problem isn’t that LangChain is bad. The problem is that LangChain optimises for getting a demo working in 30 minutes. Production agents need different things: observability, error recovery, deterministic tool routing, and the ability to debug why the agent made a specific decision at 3am when it handled a customer query incorrectly.

Where LangChain Breaks Down

1. Debugging is painful. When an agent chain fails, the error trace passes through multiple abstraction layers. A simple “the model returned JSON in the wrong format” becomes a stack trace 40 lines deep.

2. Version instability. LangChain’s API has changed significantly across versions. Code that worked on 0.1.x doesn’t work on 0.2.x. In a production system, dependency instability is a liability.

3. Abstraction cost. LangChain wraps LLM APIs in classes that add method overhead without adding reliability. For a demo, this doesn’t matter. For a system handling 500 requests/day, you want to know exactly what’s being sent to the API.

What We Use Instead

Our production agents use a minimal custom framework:

# Simplified version of our agent loop
async def run_agent(task: str, tools: list[Tool], max_steps: int = 10):
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    messages.append({"role": "user", "content": task})
    
    for step in range(max_steps):
        response = await llm.chat(messages, tools=tools)
        
        if response.stop_reason == "end_turn":
            return response.content
        
        if response.stop_reason == "tool_use":
            for tool_call in response.tool_calls:
                result = await execute_tool(tool_call)
                messages.append(tool_call_message(tool_call, result))
                log_step(step, tool_call, result)  # observability
    
    raise MaxStepsExceeded(messages)  # explicit failure

Why this works better:

Every tool call is logged with input, output, and latency
Error handling is explicit: we control what happens when a tool fails
No hidden abstractions: what you see is what the API receives
Easy to add guardrails, rate limiting, and cost tracking

When we still use LangChain: Internal tools, rapid prototyping, and situations where the agent logic is simple enough that the abstraction overhead doesn’t matter.

Model Selection for Agents

This is where opinions get strong. Here’s ours, based on production data.

Claude 3.5 Sonnet: Our Default

For most agent use cases, Claude 3.5 Sonnet wins on:

Instruction following: Handles complex multi-step instructions more reliably than GPT-4o
Tool-calling accuracy: Fewer malformed tool calls, better parameter extraction (see Anthropic’s tool use docs for how the API works)
Context utilisation: Better at using information from earlier in the conversation
Cost: Competitive with GPT-4o at similar or better quality

GPT-4o: Structured Output

When we need guaranteed JSON output (e.g., structured data extraction, form filling), GPT-4o’s structured output mode is unbeatable. You define a JSON schema, and the model will return valid JSON matching that schema. Every time.

This matters for agents that need to populate databases, generate reports, or interface with typed APIs.

Open-Source (Llama 3.1 70B, Mixtral): Batch Work

For cost-sensitive batch processing (e.g., processing 10,000 documents overnight), we use open-source models on vLLM or Together AI. The quality is 80-90% of Claude/GPT-4o at 10-20% of the cost.

We don’t use open-source for interactive agents. The latency and reliability gap still matters for real-time user-facing systems.

Model Selection Matrix

Use Case	Model	Why
Interactive agent (default)	Claude 3.5 Sonnet	Best instruction following + tool use
Structured data extraction	GPT-4o (structured output)	Guaranteed valid JSON
Batch document processing	Llama 3.1 70B	10× cost reduction
Simple classification	Claude Haiku / GPT-4o-mini	Fast + cheap for simple tasks
Code generation within agents	Claude 3.5 Sonnet	Superior code quality

Tool Design: The Most Underrated Decision

Here’s what most teams get wrong: they spend weeks picking the model and 30 minutes designing the tools. It should be the opposite.

A well-designed tool with a mediocre model outperforms a poorly-designed tool with the best model. Every time.

Principles We Follow

1. One tool = one action. A tool called manage_database that can create, read, update, and delete records will confuse the model. Four separate tools (create_record, get_record, update_record, delete_record) work better.

2. Descriptive names and parameters. The model reads the tool name and description to decide when to use it. search_knowledge_base(query: str, max_results: int) is better than search(q: str, n: int). When the tool is a retrieval function, the same RAG principles that govern standalone knowledge bases apply — chunking strategy, embedding choice, and reranking all affect what the agent can actually find.

3. Return structured errors. When a tool fails, return a clear error message the model can use to recover. Not a stack trace, but a sentence: “No records found matching that query. Try broadening the search terms.”

4. Limit the tool set. More tools = more confusion. We’ve found that agents with 5-8 tools significantly outperform agents with 15-20 tools, even when the larger set is technically more capable.

Example: A Compliance Agent

For a sales call compliance agent we built, the final tool set was:

1. search_call_transcripts(query, date_range, agent_name)
2. get_compliance_rules(category)
3. score_compliance(transcript_id, rule_ids)
4. generate_report(agent_name, date_range, format)
5. flag_violation(transcript_id, rule_id, severity, notes)

Five tools. The agent handles compliance analysis across thousands of calls. Adding more tools (like summarize_call or compare_agents) actually degraded performance because the model would use them unnecessarily.

Evaluation: Build This Before the Agent

This is the lesson that cost us the most time to learn.

Build your evaluation pipeline before you build the agent. Not after. Not “when we have time.” Before.

What We Measure

For every agent, we track:

Task completion rate: Did the agent accomplish what the user asked?
Tool accuracy: Did it call the right tools with the right parameters?
Step efficiency: How many steps did it take vs the minimum required?
Latency (p50 and p95): How long did the user wait?
Cost per task: Total API cost including all model calls and tool executions
Failure modes: When it fails, why? (model error, tool error, ambiguous input, max steps exceeded)

The Eval Dataset

We build a dataset of 50-100 test cases per agent before deployment:

{
  "task": "Find all compliance violations for Agent Smith in March 2026",
  "expected_tools": ["search_call_transcripts", "get_compliance_rules", "score_compliance"],
  "expected_output_contains": ["violation", "Smith", "March"],
  "max_acceptable_steps": 5,
  "max_acceptable_latency_ms": 8000
}

This dataset becomes the ground truth for every iteration. Change the prompt? Run the eval. Switch models? Run the eval. Add a new tool? Run the eval.

Without this, you’re guessing. And in production AI, guessing is expensive.

Common Mistakes (We Made All of Them)

Mistake 1: Overloading the System Prompt

A 3,000-word system prompt that tries to cover every edge case will underperform a 500-word prompt that clearly states the agent’s role, available tools, and constraints.

What works: Role statement (2 sentences) + tool usage guidelines (1 paragraph) + output format (1 example) + explicit constraints (“Never do X”).

Mistake 2: Not Handling Partial Failures

When one tool call in a multi-step sequence fails, what happens? Most teams let the agent figure it out. In production, the agent will often hallucinate a tool result or get stuck in a retry loop.

What works: Explicit fallback logic. If search_call_transcripts returns no results, the agent should tell the user, not try 5 different query reformulations.

Mistake 3: Ignoring Cost

A complex agent that makes 8 tool calls per query, each requiring a model call, can cost $0.50–$2.00 per interaction. At 1,000 queries/day, that’s $500–$2,000/day in API costs alone.

What works: Set cost budgets per task. Track cost per query from day one. Use cheaper models for simple routing decisions and reserve expensive models for complex reasoning.

FAQ

When should we use LangChain versus building a custom agent framework?

LangChain is a reasonable choice for prototypes, internal tools, and agent logic that is simple enough that the abstraction overhead does not matter. For production systems handling real user requests, the abstraction cost shows up as debugging friction, version instability, and limited visibility into what the model actually receives. If you need full observability, deterministic tool routing, and traceable decision logs, a thin custom framework is the better starting point.

How long does it take to build a production-ready AI agent?

A working prototype with a defined tool set and a baseline eval dataset takes 3 to 5 days. Getting to production quality, with error handling, cost tracking, latency monitoring, and a passing eval suite, typically takes 3 to 6 weeks depending on integration complexity. The biggest variable is the quality of the APIs and data sources the agent depends on, not the agent code itself.

Which model should we use for our first production agent?

Start with Claude 3.5 Sonnet as the default for most agent tasks. It handles complex multi-step instructions reliably and produces well-formed tool calls without heavy prompt engineering overhead. If guaranteed JSON output is a hard requirement for database writes or typed API integrations, GPT-4o with structured output mode removes that uncertainty.

For more on this, read our guide on Prompt Engineering Is Dead. Prompt Architecture Matters..

How do we measure whether our AI agent is performing correctly in production?

You need an eval dataset of 50 to 100 test cases covering expected tool calls, acceptable step counts, and required output characteristics before you ship anything to users. Run the eval on every prompt change, model update, or tool modification. Without that baseline, you have no way to tell whether a change improved or degraded agent behavior.

What does it actually cost to run an AI agent at scale?

A complex agent making 6 to 8 model calls per task can cost between $0.20 and $2.00 per interaction depending on model choice and context size. At 1,000 queries per day, that is $200 to $2,000 in daily API costs, so cost tracking from day one is not optional. The most effective approach is to route simple classification and decision steps through cheaper models and reserve your primary model only for the steps that require genuine multi-step reasoning.

Building an AI agent? We prototype agent systems in 72 hours. Book a technical call and I’ll walk through how we’d architect yours.

#ai agents#architecture#LangChain#LLM#production#RAG

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Written by

Anil Gulecha

Ex-HackerRank, Ex-Google

Anil reviews every architecture decision at Kalvium Labs. He's the engineer who still ships code — making technical trade-offs on RAG vs fine-tuning, model selection, and infrastructure choices. When a CTO evaluates us, Anil is the reason they trust the work.

LinkedIn GitHub · About us →

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

Kalvium Labs

AI products for startups

Keep reading

Technical

LangGraph vs LangChain in Production: When Each Makes Sense

Technical

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

dharini@kalviumlabs.ai WhatsApp

What happens on the call:

You describe your AI product idea

5 min: vision, users, constraints

We ask the hard questions

10 min: what happens when the AI gets it wrong

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Building AI Agents: Architecture, Trade-offs, and What We've Learned

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

The Agent Ecosystem Is Noisy. Here’s What Actually Works.

Framework Choice: LangChain vs Custom

Our Journey with LangChain

Where LangChain Breaks Down

What We Use Instead

Model Selection for Agents

Claude 3.5 Sonnet: Our Default

GPT-4o: Structured Output

Open-Source (Llama 3.1 70B, Mixtral): Batch Work

Model Selection Matrix

Tool Design: The Most Underrated Decision

Principles We Follow

Example: A Compliance Agent

Evaluation: Build This Before the Agent

What We Measure

The Eval Dataset

Common Mistakes (We Made All of Them)

Mistake 1: Overloading the System Prompt

Mistake 2: Not Handling Partial Failures

Mistake 3: Ignoring Cost

FAQ

When should we use LangChain versus building a custom agent framework?

How long does it take to build a production-ready AI agent?

Which model should we use for our first production agent?

How do we measure whether our AI agent is performing correctly in production?

What does it actually cost to run an AI agent at scale?

One engineering tradeoff, every Tuesday.

Anil Gulecha

Keep reading

LangGraph vs LangChain in Production: When Each Makes Sense

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking.
The only thing left is a conversation.

What happens on the call:

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

The Agent Ecosystem Is Noisy. Here’s What Actually Works.

Framework Choice: LangChain vs Custom

Our Journey with LangChain

Where LangChain Breaks Down

What We Use Instead

Model Selection for Agents

Claude 3.5 Sonnet: Our Default

GPT-4o: Structured Output

Open-Source (Llama 3.1 70B, Mixtral): Batch Work

Model Selection Matrix

Tool Design: The Most Underrated Decision

Principles We Follow

Example: A Compliance Agent

Evaluation: Build This Before the Agent

What We Measure

The Eval Dataset

Common Mistakes (We Made All of Them)

Mistake 1: Overloading the System Prompt

Mistake 2: Not Handling Partial Failures

Mistake 3: Ignoring Cost

FAQ

When should we use LangChain versus building a custom agent framework?

How long does it take to build a production-ready AI agent?

Which model should we use for our first production agent?

How do we measure whether our AI agent is performing correctly in production?

What does it actually cost to run an AI agent at scale?

One engineering tradeoff, every Tuesday.

Anil Gulecha

Keep reading

LangGraph vs LangChain in Production: When Each Makes Sense

LLM Structured Output: JSON Mode vs Function Calling

You've read the thinking. The only thing left is a conversation.

What happens on the call:

You've read the thinking.
The only thing left is a conversation.