The Agent Ecosystem Is Noisy. Here’s What Actually Works.
Every week, a new “agent framework” drops on Twitter. AutoGen, CrewAI, LangGraph, custom orchestrators: the ecosystem moves fast and the marketing moves faster.
We’ve built agent-based systems for clients across compliance, analytics, content generation, and customer support — including an AI data analyst you can query in plain English. Here’s what we’ve learned, not from tutorials, but from production.
Framework Choice: LangChain vs Custom
Our Journey with LangChain
We started with LangChain. Most teams do. It’s the obvious choice: large community, lots of examples, abstractions for everything.
By our third production agent, we stopped using it for new projects.
For more on this, read our guide on RAG in Production. The problem isn’t that LangChain is bad. The problem is that LangChain optimises for getting a demo working in 30 minutes. Production agents need different things: observability, error recovery, deterministic tool routing, and the ability to debug why the agent made a specific decision at 3am when it handled a customer query incorrectly.
Where LangChain Breaks Down
1. Debugging is painful. When an agent chain fails, the error trace passes through multiple abstraction layers. A simple “the model returned JSON in the wrong format” becomes a stack trace 40 lines deep.
2. Version instability. LangChain’s API has changed significantly across versions. Code that worked on 0.1.x doesn’t work on 0.2.x. In a production system, dependency instability is a liability.
3. Abstraction cost. LangChain wraps LLM APIs in classes that add method overhead without adding reliability. For a demo, this doesn’t matter. For a system handling 500 requests/day, you want to know exactly what’s being sent to the API.
What We Use Instead
Our production agents use a minimal custom framework:
# Simplified version of our agent loop
async def run_agent(task: str, tools: list[Tool], max_steps: int = 10):
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
messages.append({"role": "user", "content": task})
for step in range(max_steps):
response = await llm.chat(messages, tools=tools)
if response.stop_reason == "end_turn":
return response.content
if response.stop_reason == "tool_use":
for tool_call in response.tool_calls:
result = await execute_tool(tool_call)
messages.append(tool_call_message(tool_call, result))
log_step(step, tool_call, result) # observability
raise MaxStepsExceeded(messages) # explicit failure
Why this works better:
- Every tool call is logged with input, output, and latency
- Error handling is explicit: we control what happens when a tool fails
- No hidden abstractions: what you see is what the API receives
- Easy to add guardrails, rate limiting, and cost tracking
When we still use LangChain: Internal tools, rapid prototyping, and situations where the agent logic is simple enough that the abstraction overhead doesn’t matter.
Model Selection for Agents
This is where opinions get strong. Here’s ours, based on production data.
Claude 3.5 Sonnet: Our Default
For most agent use cases, Claude 3.5 Sonnet wins on:
- Instruction following: Handles complex multi-step instructions more reliably than GPT-4o
- Tool-calling accuracy: Fewer malformed tool calls, better parameter extraction (see Anthropic’s tool use docs for how the API works)
- Context utilisation: Better at using information from earlier in the conversation
- Cost: Competitive with GPT-4o at similar or better quality
GPT-4o: Structured Output
When we need guaranteed JSON output (e.g., structured data extraction, form filling), GPT-4o’s structured output mode is unbeatable. You define a JSON schema, and the model will return valid JSON matching that schema. Every time.
This matters for agents that need to populate databases, generate reports, or interface with typed APIs.
Open-Source (Llama 3.1 70B, Mixtral): Batch Work
For cost-sensitive batch processing (e.g., processing 10,000 documents overnight), we use open-source models on vLLM or Together AI. The quality is 80-90% of Claude/GPT-4o at 10-20% of the cost.
We don’t use open-source for interactive agents. The latency and reliability gap still matters for real-time user-facing systems.
Model Selection Matrix
| Use Case | Model | Why |
|---|---|---|
| Interactive agent (default) | Claude 3.5 Sonnet | Best instruction following + tool use |
| Structured data extraction | GPT-4o (structured output) | Guaranteed valid JSON |
| Batch document processing | Llama 3.1 70B | 10× cost reduction |
| Simple classification | Claude Haiku / GPT-4o-mini | Fast + cheap for simple tasks |
| Code generation within agents | Claude 3.5 Sonnet | Superior code quality |
Tool Design: The Most Underrated Decision
Here’s what most teams get wrong: they spend weeks picking the model and 30 minutes designing the tools. It should be the opposite.
A well-designed tool with a mediocre model outperforms a poorly-designed tool with the best model. Every time.
Principles We Follow
1. One tool = one action. A tool called manage_database that can create, read, update, and delete records will confuse the model. Four separate tools (create_record, get_record, update_record, delete_record) work better.
2. Descriptive names and parameters. The model reads the tool name and description to decide when to use it. search_knowledge_base(query: str, max_results: int) is better than search(q: str, n: int). When the tool is a retrieval function, the same RAG principles that govern standalone knowledge bases apply — chunking strategy, embedding choice, and reranking all affect what the agent can actually find.
3. Return structured errors. When a tool fails, return a clear error message the model can use to recover. Not a stack trace, but a sentence: “No records found matching that query. Try broadening the search terms.”
4. Limit the tool set. More tools = more confusion. We’ve found that agents with 5-8 tools significantly outperform agents with 15-20 tools, even when the larger set is technically more capable.
Example: A Compliance Agent
For a sales call compliance agent we built, the final tool set was:
1. search_call_transcripts(query, date_range, agent_name)
2. get_compliance_rules(category)
3. score_compliance(transcript_id, rule_ids)
4. generate_report(agent_name, date_range, format)
5. flag_violation(transcript_id, rule_id, severity, notes)
Five tools. The agent handles compliance analysis across thousands of calls. Adding more tools (like summarize_call or compare_agents) actually degraded performance because the model would use them unnecessarily.
Evaluation: Build This Before the Agent
This is the lesson that cost us the most time to learn.
Build your evaluation pipeline before you build the agent. Not after. Not “when we have time.” Before.
What We Measure
For every agent, we track:
- Task completion rate: Did the agent accomplish what the user asked?
- Tool accuracy: Did it call the right tools with the right parameters?
- Step efficiency: How many steps did it take vs the minimum required?
- Latency (p50 and p95): How long did the user wait?
- Cost per task: Total API cost including all model calls and tool executions
- Failure modes: When it fails, why? (model error, tool error, ambiguous input, max steps exceeded)
The Eval Dataset
We build a dataset of 50-100 test cases per agent before deployment:
{
"task": "Find all compliance violations for Agent Smith in March 2026",
"expected_tools": ["search_call_transcripts", "get_compliance_rules", "score_compliance"],
"expected_output_contains": ["violation", "Smith", "March"],
"max_acceptable_steps": 5,
"max_acceptable_latency_ms": 8000
}
This dataset becomes the ground truth for every iteration. Change the prompt? Run the eval. Switch models? Run the eval. Add a new tool? Run the eval.
Without this, you’re guessing. And in production AI, guessing is expensive.
Common Mistakes (We Made All of Them)
Mistake 1: Overloading the System Prompt
A 3,000-word system prompt that tries to cover every edge case will underperform a 500-word prompt that clearly states the agent’s role, available tools, and constraints.
What works: Role statement (2 sentences) + tool usage guidelines (1 paragraph) + output format (1 example) + explicit constraints (“Never do X”).
Mistake 2: Not Handling Partial Failures
When one tool call in a multi-step sequence fails, what happens? Most teams let the agent figure it out. In production, the agent will often hallucinate a tool result or get stuck in a retry loop.
What works: Explicit fallback logic. If search_call_transcripts returns no results, the agent should tell the user, not try 5 different query reformulations.
Mistake 3: Ignoring Cost
A complex agent that makes 8 tool calls per query, each requiring a model call, can cost $0.50–$2.00 per interaction. At 1,000 queries/day, that’s $500–$2,000/day in API costs alone.
What works: Set cost budgets per task. Track cost per query from day one. Use cheaper models for simple routing decisions and reserve expensive models for complex reasoning.
FAQ
When should we use LangChain versus building a custom agent framework?
LangChain is a reasonable choice for prototypes, internal tools, and agent logic that is simple enough that the abstraction overhead does not matter. For production systems handling real user requests, the abstraction cost shows up as debugging friction, version instability, and limited visibility into what the model actually receives. If you need full observability, deterministic tool routing, and traceable decision logs, a thin custom framework is the better starting point.
How long does it take to build a production-ready AI agent?
A working prototype with a defined tool set and a baseline eval dataset takes 3 to 5 days. Getting to production quality, with error handling, cost tracking, latency monitoring, and a passing eval suite, typically takes 3 to 6 weeks depending on integration complexity. The biggest variable is the quality of the APIs and data sources the agent depends on, not the agent code itself.
Which model should we use for our first production agent?
Start with Claude 3.5 Sonnet as the default for most agent tasks. It handles complex multi-step instructions reliably and produces well-formed tool calls without heavy prompt engineering overhead. If guaranteed JSON output is a hard requirement for database writes or typed API integrations, GPT-4o with structured output mode removes that uncertainty.
For more on this, read our guide on Prompt Engineering Is Dead. Prompt Architecture Matters..
How do we measure whether our AI agent is performing correctly in production?
You need an eval dataset of 50 to 100 test cases covering expected tool calls, acceptable step counts, and required output characteristics before you ship anything to users. Run the eval on every prompt change, model update, or tool modification. Without that baseline, you have no way to tell whether a change improved or degraded agent behavior.
What does it actually cost to run an AI agent at scale?
A complex agent making 6 to 8 model calls per task can cost between $0.20 and $2.00 per interaction depending on model choice and context size. At 1,000 queries per day, that is $200 to $2,000 in daily API costs, so cost tracking from day one is not optional. The most effective approach is to route simple classification and decision steps through cheaper models and reserve your primary model only for the steps that require genuine multi-step reasoning.
Building an AI agent? We prototype agent systems in 72 hours. Book a technical call and I’ll walk through how we’d architect yours.