Technical
· 13 min read

Vibe Coding in Production: How We Use AI to Build AI

Our team ships AI products using AI coding tools every day. Here's what actually works, what breaks, and the workflows we've settled on after 6 months.

Abraham Jeron
Abraham Jeron
AI products & system architecture — from prototype to production
Share
Vibe Coding in Production: How We Use AI to Build AI
TL;DR
  • We use AI coding tools on every project. Agent mode in Cursor handles about 60% of boilerplate and integration code. The other 40% needs human judgment.
  • Vibe coding works for scaffolding, tests, and data transformations. It fails for architecture decisions, security-critical code, and anything that touches billing.
  • Our biggest productivity gain: using Claude Code for code review, not code generation. Catching bugs before they ship saves more time than writing code faster.
  • The engineers who get the most out of AI tools are the ones who can read and evaluate the output critically. The tool amplifies skill, it doesn't replace it.
  • We measured a 35-40% reduction in time-to-first-commit across our team after standardizing on AI-assisted workflows. Most of that gain came from less time on boilerplate.

Six months ago, I told our engineers to stop treating AI coding tools as autocomplete. Start treating them as a junior developer sitting next to you. One that’s fast, tireless, and occasionally confidently wrong about important things.

That shift changed how we build. Not because the tools got dramatically better (though they did, incrementally). Because we figured out where to trust them and where to double-check everything.

This is the honest version of how a team of 200+ engineers uses AI to build AI products. What we use, where it works, where it doesn’t, and the specific workflows we’ve landed on after iterating for six months across dozens of client projects.

What “Vibe Coding” Actually Means to Us

The term gets thrown around loosely. Some people mean “let the AI write everything and hope for the best.” That’s not what we do. That’s how you ship bugs to production at scale.

For us, vibe coding means this: describe what you want in natural language, let the AI generate a first pass, then review, test, and refine. The AI handles the syntax and boilerplate. The engineer handles the thinking.

A concrete example from last week. We were building an API endpoint for a client’s document processing pipeline. I described the requirements to Cursor’s agent mode: “FastAPI endpoint, accepts a PDF upload, extracts text with PyMuPDF, chunks it into 500-token segments with 50-token overlap, returns a JSON array of chunks with metadata.”

Cursor generated a working endpoint in about 90 seconds. The code was correct for the happy path. But it didn’t handle encrypted PDFs, didn’t validate file size before processing, and used a naive character-count chunking approach instead of token-aware splitting.

Those gaps are exactly the kind of thing an experienced engineer catches and a junior developer misses. The AI is the junior developer. Fast, productive, needs review.

The Tool Stack We’ve Settled On

After trying basically everything over the past year, here’s what stuck.

Cursor (with Claude 3.5 Sonnet or GPT-4o) for daily coding. This is the primary tool for about 80% of our engineers. Agent mode is the key feature. You describe what you want at a high level, it generates multi-file changes, and you review the diff. For building AI products for startups, where speed matters and architectures are relatively standard, this workflow handles most of the implementation.

Claude Code (CLI) for complex refactoring and code review. When a task involves understanding a large codebase, tracing dependencies across files, or reviewing a pull request for subtle bugs, Claude Code’s ability to read entire directory structures makes it stronger than an editor-based tool. We use this mostly for review, not generation.

GitHub Copilot for inline completions. Some engineers still prefer Copilot for line-by-line autocomplete, especially in languages where the training data is dense (TypeScript, Python). It’s less useful for domain-specific code where the patterns aren’t in the training set.

No AI for SQL migrations, auth flows, or payment integration. Full stop. These are areas where a subtle bug has outsized consequences. We write this code manually, review it manually, and test it manually. The time saved by AI generation is not worth the risk of a billing error or a security hole that passes code review because it looked plausible.

Where AI Coding Tools Actually Save Time

We tracked this. Not scientifically with a controlled study, but practically across our project management data. After standardizing on AI-assisted workflows in October 2025, we measured time-to-first-commit (from task assignment to first PR) across comparable task types.

The numbers: 35-40% reduction in time-to-first-commit for implementation tasks. That’s meaningful. But it’s not evenly distributed.

High gain (50-60% faster):

  • CRUD endpoints and API boilerplate
  • Test generation from existing code (describe the function, generate the test suite)
  • Data transformation scripts (parse this CSV, reshape into this schema)
  • React component scaffolding with Tailwind styling
  • Documentation and README generation

Moderate gain (20-30% faster):

  • Integration code connecting two well-documented APIs
  • Bug fixes where the error message is clear and the fix is mechanical
  • Database query optimization when you describe the current query and the performance issue

No gain (sometimes slower):

  • Architecture decisions. The AI will happily generate an architecture. It just won’t be the right one for your specific constraints, scale requirements, and team capabilities.
  • Debugging complex state management issues. The AI suggests plausible fixes that often mask the root cause instead of fixing it.
  • Performance optimization at the system level. The AI optimizes individual functions. It doesn’t see the architectural bottleneck three services upstream.

The pattern is clear: AI coding tools save the most time on tasks where the solution is well-defined and the implementation is repetitive. They save the least time, and sometimes cost time, on tasks that require judgment about tradeoffs.

The Code Review Workflow That Changed Everything

This one surprised me. The biggest productivity gain from AI tools in our team wasn’t code generation. It was code review.

Here’s the workflow. An engineer opens a PR. Before requesting human review, they run Claude Code on the diff with a specific prompt: “Review this PR for bugs, security issues, performance problems, and deviation from our coding standards. Be specific about line numbers. Ignore style preferences.”

Claude Code catches things like:

  • Unhandled null cases that would throw in production
  • SQL injection vectors in raw query construction
  • Race conditions in async code
  • API keys accidentally logged to stdout
  • Missing error handling on external API calls

It doesn’t catch everything. Architectural issues, business logic correctness, and whether the feature actually solves the user’s problem still need human eyes. But the mechanical bugs, the ones that slip through human review because the reviewer’s eyes glazed over at line 247? Those get caught consistently.

We estimated that this single workflow prevents 2-3 production bugs per week across the team. Each production bug that would have required a hotfix, investigation, and client communication probably costs 4-8 hours of engineering time. So the AI code review is saving us roughly 8-24 hours per week in prevented incident response. That’s a better ROI than any code generation feature.

What Goes Wrong (And We’ve Learned to Watch For)

Honest accounting. These are the failure modes we’ve hit and now actively guard against.

Hallucinated APIs. The AI generates code that calls a library function that doesn’t exist in the version we’re using. This happens most often with rapidly evolving libraries. LangChain is the worst offender in our experience, because the API changed so frequently between versions that the AI’s training data is a mix of v0.1 and v0.2 patterns. We now include the specific library version in every prompt.

Plausible but wrong logic. The code runs without errors. The tests pass (because the AI wrote the tests to match its own output). But the business logic is subtly wrong. We caught one case where an AI-generated billing calculation rounded incorrectly on currency conversions, producing amounts that were off by 1-3 cents per transaction. At volume, that compounds. This is why we don’t use AI for anything touching billing.

Copy-paste architecture. When you generate code fast, you tend to duplicate patterns instead of abstracting them. Three months of vibe coding without discipline produces a codebase with 14 slightly different versions of the same API client wrapper. We run periodic “deduplication sprints” where we consolidate these into shared utilities.

False confidence in generated tests. An AI that writes the implementation and the tests is grading its own homework. The tests verify the code does what the AI intended, not what the business requires. We require human-written test cases for all business logic. AI can generate the test scaffolding and assertion syntax, but the actual test scenarios come from the spec.

Context window overflow on large codebases. When the codebase exceeds what fits in the model’s context window, AI tools start making suggestions based on incomplete information. For our larger projects (50,000+ lines), we’ve learned to scope AI interactions to specific directories or modules rather than asking it to reason about the entire codebase.

The “AI Won’t Replace Engineers” Evidence

I hear this debate constantly. Let me share what I’ve actually observed across our team.

The engineers who produce the best work with AI tools are the same engineers who produced the best work without them. Their code was well-structured before. Their architectural decisions were sound before. AI tools made them faster at the parts that were already easy for them.

The engineers who struggled with code quality before AI tools produce more code now, but not better code. They generate implementations faster but spend more time debugging issues that a more experienced engineer would have anticipated. The AI amplifies whatever skill level you bring to it.

Specific observation from our project data: senior engineers (3+ years of experience) saw a 40-50% productivity improvement with AI tools. Junior engineers (less than 1 year) saw a 15-25% improvement. The gap isn’t about tool proficiency. It’s about the ability to evaluate whether the AI’s output is correct, and to know when the generated code needs fundamental restructuring vs. minor tweaks.

Anil’s post on building AI agents gets into the technical decision-making that AI tools can’t do for you. Architecture is still a human skill. The AI can implement your architectural decisions faster, but it can’t make them for you.

Our Standard Workflow: Step by Step

For anyone looking to adopt a similar approach, here’s the specific workflow we use across projects.

Step 1: Architecture and design (no AI). The technical lead outlines the system architecture, API contracts, data models, and component boundaries. This is done in a design document, not in code. AI tools don’t participate in this step.

Step 2: Scaffold with AI. Using Cursor’s agent mode, generate the project structure, boilerplate, and skeleton implementations. This usually takes 30-60 minutes for what would be a full day of manual setup.

Step 3: Implement with AI assist. Engineers work through tasks using AI for first-pass implementation. Each task follows the pattern: describe the requirement in the prompt, generate code, review the diff carefully, fix issues, commit. Average cycle time per task: 2-4 hours for what used to take 4-8 hours.

Step 4: AI code review. Before opening a PR, run Claude Code on the changes. Fix everything it flags. This takes 10-15 minutes per PR and catches 70-80% of the mechanical bugs that would otherwise reach human review.

Step 5: Human code review. A senior engineer reviews the PR for architectural fit, business logic correctness, and anything the AI review missed. This review is faster because the mechanical issues are already resolved.

Step 6: Manual testing of critical paths. Auth flows, payment processing, data deletion, and anything security-sensitive gets manual testing regardless of AI involvement in the implementation.

The Tools Are Getting Better, But the Principles Stay

Six months ago, Cursor’s agent mode needed 2-3 attempts to get a moderately complex function right. Now it nails it on the first try about 70% of the time. Claude Code’s codebase understanding has improved noticeably since we started using it. GitHub’s own research on Copilot shows similar trends in developer productivity gains.

But the fundamental principle hasn’t changed: AI coding tools are productivity multipliers, not skill replacements. An engineer who doesn’t understand concurrency patterns will generate concurrent code with race conditions, regardless of which AI tool they use. An engineer who does understand them will use the AI to write the boilerplate and focus their attention on the tricky synchronization points.

The bet we’ve made as a team: invest in engineer skill development alongside AI tool adoption. Every engineer on our team learns prompt engineering, model evaluation, and AI system design as part of their training. Not because those are buzzwords. Because an engineer who understands how LLMs work writes better prompts, catches more AI mistakes, and builds better AI products for our clients.

That combination, strong engineering fundamentals plus AI-assisted workflows, is what lets us ship at the speed we do. The tools are the accelerator. The engineers are the engine.

FAQ

What AI coding tools does your team use?

We primarily use Cursor with Claude 3.5 Sonnet or GPT-4o for daily development, Claude Code (CLI) for complex refactoring and code review, and GitHub Copilot for inline completions. The choice depends on the task: Cursor for multi-file generation, Claude Code for codebase-wide review, Copilot for quick autocomplete. We avoid AI tools entirely for security-critical code, payment logic, and database migrations.

How much faster is development with AI coding tools?

We measured a 35-40% reduction in time-to-first-commit across our team after standardizing on AI-assisted workflows. The gain varies by task type: boilerplate and scaffolding see 50-60% improvement, integration code sees 20-30%, and architecture or complex debugging sees no improvement. The biggest single gain came from AI-assisted code review, which prevents an estimated 2-3 production bugs per week.

Is vibe coding safe for production code?

With the right guardrails, yes. The key is knowing where AI generation is safe (boilerplate, tests, data transformations) and where it’s not (auth flows, billing logic, security-critical paths). Every AI-generated code change goes through both AI code review and human code review before merging. We don’t use AI-generated code in production without review, and we don’t use AI for code categories where subtle errors have outsized consequences.

Does AI coding replace the need for experienced engineers?

No. Our data shows that senior engineers (3+ years experience) get 40-50% productivity gains from AI tools, while junior engineers get 15-25%. The gap exists because experienced engineers can evaluate whether AI-generated code is correct, recognize architectural issues the AI misses, and know when to reject the AI’s suggestion entirely. AI tools amplify existing skill. They don’t substitute for it.

How do you prevent AI-generated bugs from reaching production?

Three layers. First, engineers review all AI-generated code before committing, same as any code they write. Second, Claude Code runs an automated review on every PR before human review, catching mechanical bugs like null handling, injection vectors, and missing error handling. Third, critical paths (auth, billing, data deletion) get manual testing regardless of how the code was written. This combination catches the majority of issues before they reach staging, and almost all of them before production.


Want to see how an AI-native engineering team ships? Book a 30-minute call. We’ll show you our actual workflows and talk through how they’d apply to your project.

#ai development team#vibe coding#ai coding tools#cursor#claude code#copilot#developer productivity#engineering workflow
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Abraham Jeron

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us