Ship Features First, Build AI Infrastructure Later
The biggest engineering mistake I see at seed-stage AI startups: building shared infrastructure before shipping a single AI feature to users.
Teams spend months assembling orchestration frameworks, custom evaluation pipelines, and elaborate model-serving infrastructure. Meanwhile, no user has touched the product. The infrastructure is technically sound and answers a question nobody has asked yet.
The rule is simple: build the differentiator, buy the commodity. Your retrieval logic, your prompt architecture, your domain-specific evaluation dataset — those are worth building. Your vector database, your LLM API wrapper, your CI/CD pipeline — buy those. They’re solved problems.
I’ve taken over projects where teams spent three to four months building the wrong architecture because they confused infrastructure work with product work. A custom feature flag system, a bespoke model gateway, an internal eval dashboard — all built before a single user tried the core AI feature. When users finally did, they found a workflow problem in the first hour that would have survived another year of platform building.
Ship the smallest AI feature that solves a real problem. Then let what breaks tell you what infrastructure to build next. That’s the sequencing that works.
This framework covers the six decisions that actually matter when building AI products for startups — in the order you should make them.
The Wrong Question Most Startup CTOs Ask
Most startup CTOs plan AI product development by asking: “How do we add AI to our product?”
The right question is: “What problem costs us the most to solve manually, and can AI solve it reliably enough to matter?”
That reframe changes every decision downstream. I’ve built AI products for 10+ startups across compliance, EdTech, analytics, and customer support. The failures share a pattern: teams added AI because competitors were, not because it solved a specific problem. The wins share a different pattern: one clear problem, one targeted solution, accuracy requirements defined before writing a line of code.
This is the framework I use when a founder brings me an AI product idea. Six decisions, in order. Skip any of them and you’ll pay for it later.
Decision 1: Should You Build AI At All?
This sounds obvious. It isn’t.
Ask four questions before committing to any AI product development investment:
1. Is there a manual process with enough volume to automate? If your team is processing documents, analyzing content, answering repetitive questions, or classifying inputs at scale, you have a candidate. If you’re building a standard CRUD product with light data operations, AI probably adds cost and complexity without meaningful user value.
2. Can you tolerate imperfect outputs? AI makes mistakes. A customer support bot that’s wrong 5% of the time is a liability if users trust it blindly. It’s fine if you’ve designed the UX around review and correction. Know your acceptable error rate before starting. Write it down.
3. Does the volume justify the build cost? A feature used 100 times/day at $0.04 per AI call costs about $1,460/year in API fees. The same feature at 10,000 queries/day costs $146,000/year. Volume changes the economics of build and operation by orders of magnitude. Model both scenarios before committing.
4. Is the data actually accessible? The most common blocker we hit: “we thought we had the data.” PDFs locked in SharePoint, transcripts in a proprietary system, labels that don’t exist yet. Audit your data situation before committing to a timeline. This check alone saves weeks.
If you can’t answer all four questions clearly, you need a discovery week, not a development sprint.
Decision 2: Define the Accuracy Floor
This is the most skipped step in AI product development. It causes the most pain later.
Every AI feature needs a number: the minimum accuracy at which it’s useful rather than harmful.
| Feature Type | Accuracy Floor | Rationale |
|---|---|---|
| Customer support triage | 90%+ | Wrong routing frustrates customers directly |
| Document Q&A (internal tool) | 80-85% | Humans verify before acting |
| Content classification | 85%+ | Depends heavily on downstream use |
| Code suggestion | 70%+ | Developer reviews before accepting |
| Compliance scoring | 95%+ | Errors carry legal or financial risk |
| Sales call analysis | 88%+ | Feeding into performance reviews |
Define this number before you build. Then define how you’ll measure it. If you can’t measure accuracy, you can’t ship with confidence. And you can’t improve what you can’t measure.
For how we approach evaluation in practice, particularly for retrieval-based systems, see our post on RAG in production.
Decision 3: Build vs Buy
This decision is simpler than most teams make it.
One question: is this differentiated IP or commodity infrastructure?
Build the differentiator. Buy the commodity.
What to Buy
If you’re building a document intelligence product and you need:
- A vector database: use pgvector (Supabase) or Pinecone
- An LLM: use Claude 3.5 Sonnet or GPT-4o via API
- An eval framework: use Braintrust or LangSmith
- An embedding model: use
text-embedding-3-small
These are commodities. They’re cheap, reliable, and maintained by teams who do nothing else. You gain nothing by rebuilding them.
What to Build
The things worth building are specific to your data and use case:
- Your retrieval logic (chunking strategy, reranking, metadata filtering)
- Your prompt architecture (system prompts, few-shot examples, output validation)
- Your evaluation dataset (the 50-100 golden Q&A pairs that define “good” for your context)
- Your domain-specific tool set, if you’re building an agent
The trap most teams fall into: Spending months building orchestration frameworks they could find on GitHub, then spending two weeks on the actual differentiator. It should be the opposite. I’ve seen this mistake cost teams three to four months on their first AI product. This is the infrastructure-vs-features tradeoff in practice — and for AI startups, the answer is almost always: ship the feature first, build the shared infrastructure after you know what your users actually need.
The Platform SaaS Trap
A specific variant of the build trap: buying an AI platform that wraps all the primitives (LLMs, vector stores, orchestration) in a proprietary interface.
These platforms look compelling at demo time. At production time, you’ve traded debugging complexity you understand for debugging complexity you don’t. And you’re locked in.
My rule: use the primitives directly. Know what your application sends to the LLM. Know what the LLM returns. If you can’t inspect that at 3am when something breaks in production, you have a problem.
Decision 4: Model Selection
The goal isn’t the “best” model. It’s the model that’s good enough for your use case at the lowest cost that meets your latency requirements. Those are three separate constraints, and they trade off against each other.
The Decision Matrix
| Use Case | Model | Approx. Cost (per 1M tokens) | Notes |
|---|---|---|---|
| Interactive product (chat, Q&A) | Claude 3.5 Sonnet | ~$3 in / $15 out | Best instruction following |
| Structured data extraction | GPT-4o (structured output) | ~$5 in / $15 out | Guaranteed valid JSON |
| Batch document processing | Llama 3.1 70B (Together AI) | ~$0.88 in+out | 10-15x cheaper |
| Simple classification | Claude Haiku / GPT-4o-mini | ~$0.15-0.25 in | Fast and cheap enough |
| Code within your product | Claude 3.5 Sonnet | ~$3 in / $15 out | Strongest code quality |
| Cost-critical production scale | Gemini 1.5 Flash | ~$0.075 in | Lowest cost at scale |
A few notes worth calling out:
Claude 3.5 Sonnet is my default for anything interactive. It follows complex multi-step instructions more reliably than GPT-4o, produces well-formed tool calls without heavy prompt engineering, and uses context from earlier in conversations more consistently. For AI product features where the model is making decisions rather than just retrieving text, instruction following matters more than raw benchmark scores.
GPT-4o structured output mode is unbeatable for data extraction. You define a JSON schema. The model returns valid JSON matching that schema every time. For any feature that populates a database or drives a downstream typed API, this removes an entire class of production bugs.
Don’t pick a model once and never revisit. Model quality and pricing shift fast. We audited model choices across projects in Q1 2026 and found three where switching models would cut costs by 40-60% with no quality regression, purely from improvements in cheaper models over 12 months. Build model selection as a configuration concern, not a hardcoded decision.
The Context Window Trap
More context doesn’t always mean better answers. GPT-4o and Claude both degrade on very long contexts (100K+ tokens) because the model loses precision on information in the middle of the window. This effect is well-documented in research.
If your use case requires reasoning over long documents, test your model’s accuracy at different context lengths. Don’t assume more tokens in equals better output.
Decision 5: RAG vs Fine-Tuning vs Agents
This is the question I get most often from startup CTOs. The answer is usually simpler than expected.
Start with RAG
RAG (Retrieval-Augmented Generation) is the right starting point for roughly 80% of AI product features.
Use RAG when:
- Your knowledge base changes over time
- You need source attribution for answers
- You’re answering questions over a specific corpus: documents, transcripts, product data
- You want to inspect and fix what goes wrong
RAG’s debugging story is its biggest advantage. When it gives a wrong answer, you can see exactly what context was retrieved and why. That observability is worth a lot in production.
Cost to ship a working RAG system: $8,000-20,000 for most use cases, depending on corpus size and query volume.
Fine-Tune When the Base Model Fails at the Task
Fine-tuning isn’t about making the model smarter. It’s about changing behavior: output format, domain-specific terminology, or task consistency.
Fine-tune when:
- The base model consistently produces the wrong output format and structured output mode doesn’t apply
- You need the model to use specific domain vocabulary or follow a house style
- You’re doing a highly repetitive task where consistency matters more than reasoning
Fine-tuning costs: $1,000-10,000 for data preparation and training. The real bottleneck is data labeling. You need 500-5,000 high-quality labeled examples. That’s what takes time, not the training run itself.
One critical note: don’t fine-tune to inject knowledge. That’s what RAG is for. Fine-tuned models don’t reliably recall facts from training data the way people expect. Use fine-tuning for behavior, not knowledge.
Agents When You Need Multi-Step Reasoning
Agents are powerful and expensive to debug. Use them when the task genuinely requires multiple decisions, tool calls, and conditional logic that can’t be handled by a single prompt.
Agents make sense when:
- The task has multiple steps that depend on intermediate results
- Different paths lead to different outcomes based on what the agent finds
- You need to integrate with multiple external systems in a single workflow
Agents require a clear tool set (5-8 tools max is our finding), a before-ship evaluation dataset, and explicit cost budgets per task. An agent making 6-10 model calls per interaction at Claude 3.5 Sonnet pricing costs $0.30-$1.50 per task. At 1,000 tasks/day, that’s $300-$1,500/day.
For a deeper look at agent architecture trade-offs, how to design tool sets, and common failure patterns, see our post on building AI agents for production.
The sequencing recommendation: Start with a single-prompt RAG approach. If you hit the quality ceiling, add reranking. If you hit the ceiling there, evaluate whether agents are genuinely required or whether better retrieval design would solve the problem. Agents aren’t always the answer.
Decision 6: Infrastructure Costs at Seed and Series A
AI infrastructure is cheap at small scale and expensive at large scale. The mistake is not modeling the scale cost at design time.
Baseline Stack for a Seed-Stage AI Product
For a typical seed-stage AI product with a few thousand daily users and one or two AI features:
| Component | Option | Monthly Cost |
|---|---|---|
| Vector database | pgvector on Supabase Pro | $25 |
| LLM API (interactive) | Claude 3.5 Sonnet, moderate volume | $200-800 |
| Embedding API | text-embedding-3-small | $5-20 |
| Reranking | Cohere Rerank | $10-50 |
| Hosting | Fly.io or Railway | $20-50 |
| Eval/monitoring | Braintrust or LangSmith | $0-50 |
| Total | $260-1,000/month |
That’s manageable. But the LLM API cost scales directly with usage. You need to model this before picking your tier.
The Cost Per Query Math
The number that matters: cost per AI query.
A RAG-based Q&A feature with:
- 1 embedding call (text-embedding-3-small): ~$0.00002
- 1 reranking call (Cohere): ~$0.0001
- 1 generation call (Claude 3.5 Sonnet, ~2K tokens out): ~$0.04
Total: roughly $0.04 per query.
At 1,000 queries/day: $40/day, about $1,200/month. At 10,000 queries/day: $400/day, about $12,000/month.
That $12K/month number is real. If you’re building a B2C product with viral potential, model this before you pick your LLM tier. The levers to reduce it:
- Switch from Claude 3.5 Sonnet to Claude Haiku for simpler tasks: 15-20x cost reduction
- Cache frequent queries (common in document Q&A products): 30-50% cost reduction
- Run non-real-time batch processing on open-source models via Together AI: 10-15x cost reduction
What Series A Companies Get Wrong
The most common infrastructure mistake at Series A: running the same seed-stage architecture at 10x the volume without modeling the cost first.
A $0.04/query design that worked at 1,000 queries/day becomes $12,000/month at 10,000 queries/day. If that’s 15% of revenue, survivable. If it’s 80% of revenue, it’s a crisis.
At Series A, audit your AI cost structure before scaling. It’s far easier to re-architect at 10,000 queries/day than at 500,000.
What Doesn’t Work
I’ve shipped enough AI products to have a solid list of failure patterns. Every one of these we’ve either made ourselves or seen on projects we’ve taken over.
Building on preview models. We used a model in preview on one project, it was deprecated in three months, and we had to re-evaluate the entire pipeline. Stick to GA models for production. Preview models are for evaluation, not deployment.
Treating the prompt as a constant. Prompts degrade. Model updates change behavior. A prompt that worked well in November may produce different outputs in March after a silent model update. Test your prompts against your golden dataset after every model update. This isn’t optional, it’s operational hygiene.
Skipping evaluation to move faster. We did this once. We shipped a feature without a formal eval suite, accuracy was 68% on real user queries (not the 85% we’d estimated from qualitative testing), and we pulled it back within two weeks. The eval would have taken five days. The rollback cost us three weeks and damaged user trust. Build the eval first.
Choosing models based on benchmarks, not your data. A model that tops MMLU or HumanEval isn’t necessarily the best model for your specific task. Always test on your own queries with your own data before committing. Benchmark leaderboards are useful for rough comparison, not production decisions.
Underestimating latency requirements. “Fast enough” means different things for different products. An internal analytics tool where users expect a 5-10 second response is different from a customer-facing chat where users expect sub-2-second responses. Define your p95 latency target as a number before designing the pipeline. Then measure it.
FAQ
How do I know when my startup is ready to invest in AI product development?
You’re ready when you can clearly answer: what specific problem does AI solve, what’s the minimum accuracy at which this feature helps rather than harms, and where does the retrieval or training data come from. If any of those answers are “we’ll figure it out,” delay until they’re concrete. Skipping this validation step typically doubles the build cost and produces worse output quality than starting with a defined problem.
What’s the difference between ai app development and just calling an LLM API?
Calling an LLM API is a feature. AI app development is the architecture around that call: the data pipeline feeding it, the retrieval or fine-tuning layer that improves accuracy, the evaluation system that catches degradation, and the cost monitoring that prevents bill shock. In production AI products, most of the work is not the LLM call itself. It’s everything around it.
How long does it take to build a production-ready AI product feature?
A working prototype takes 3-7 days with a defined problem and clean data. Getting to production quality, meaning it passes your accuracy floor on real user queries, has monitoring and alerting, handles edge cases correctly, and has a cost model that doesn’t break at scale, typically takes 4-8 weeks depending on data availability and integration complexity. The biggest variable is almost always data. Structured, accessible, well-labeled data cuts the timeline. Messy, locked, or missing data doubles it.
Should we build the AI features in-house or work with an external team?
Build in-house if AI is your core product differentiator and you have an engineer with production LLM experience on staff. Work with an external team if you’re adding AI features to an existing product, need to prototype fast, or don’t have the internal experience to evaluate model choices and architecture trade-offs confidently. The cost of learning on your first AI product in-house is typically underestimated. We’ve taken over projects where teams spent three to four months building the wrong architecture because they didn’t know what they didn’t know. A technical discovery call with an experienced team costs nothing and usually surfaces the right questions.
What should a seed-stage startup budget for AI product development?
A single focused AI feature (a document Q&A system, a compliance scoring tool, a content generation pipeline) typically costs $8,000-20,000 to build to production quality with an external team. Infrastructure runs $300-1,000/month at seed-stage usage volumes. At Series A, a complete AI product with multiple integrated features is typically $25,000-60,000 to build. These are ranges, not quotes. The actual cost depends on data complexity, required accuracy, and integration depth. Always budget 20-30% beyond the first estimate. AI products almost always surface data quality issues that weren’t visible during scoping.
Working through an AI product decision? Book a 30-minute technical call and I’ll tell you what I’d build, what I’d skip, and whether the timeline you’ve been quoted makes sense.