AI API Pricing: Honest Breakdown (2026)
Published 21 April 2026 · 10 min read
Quick answer: In 2026, closed-frontier APIs (Anthropic, OpenAI, Google) sit in roughly the same band for comparable tiers — $2-$15 per million input tokens, $5-$75 per million output tokens — with large discounts for batched and cached inputs. Open-weight providers (Mistral, Together, Fireworks) undercut by 50-80% for comparable quality on narrow tasks. Your per-user cost depends more on prompt caching and batch tier than on vendor choice.
Pricing shape in 2026
All frontier providers price in input and output tokens. Input is always cheaper. Cached input is 5-10x cheaper than uncached. Batched (async, non-real-time) input is ~50% cheaper. Fine-tuned models carry per-token premiums plus fixed training and storage costs.
Frontier tiers (closed)
- Anthropic Claude Opus 4.7: premium tier. Best for high-stakes reasoning. Priciest input/output.
- Claude Sonnet 4.6: mid-tier. Most production apps use this level.
- Claude Haiku: fastest, cheapest Anthropic tier.
- OpenAI GPT-5 / GPT-5-mini: GPT-5 premium; mini is mid-tier equivalent.
- Google Gemini 2.5 Pro / Flash: Pro competes with Opus/GPT-5. Flash is cheapest major-vendor tier by far.
Open-weight providers
Mistral, Together AI, Fireworks, Groq, Anyscale serve Llama 3/4, Mixtral, Qwen, DeepSeek, Mistral Large. Prices are typically 50-80% below frontier on input/output. Quality is close-enough for many production tasks, especially structured extraction and retrieval summarisation. Not yet equal for long-context, multi-hop reasoning.
Cost per real use case
- RAG chatbot (avg 5k input + 400 output): $0.01-$0.05 per turn on mid-tier frontier, $0.002-$0.01 on open-weight.
- Document summarisation (30k input + 2k output): $0.10-$0.40 per doc.
- Agentic workflow (20 turns avg): $0.20-$1.50 per user-task.
- Code-gen assistant (per session): $0.05-$0.30.
How to cut costs 3-10x
- Prompt caching on repeat system prompts — 50-90% savings.
- Batch async requests when user-facing latency is not needed.
- Route by task: small model for extraction / classification, big model for reasoning.
- Truncate retrieval payloads hard. Top-3 chunks usually beats top-10.
- Cache final outputs by input-hash for deterministic use cases.
Related reading
Tool buyer's guide · 5 mistakes · GeraNexus
Play the first level free.
Start playing →