The tool is what you interact with.
The model is the brain behind it.
Same tool, different model → different results.
Anthropic (Claude)
Opus 4.6 · Sonnet 4.6 · Haiku 4.5
Deep reasoning, long-context, complex refactors
OpenAI (GPT)
GPT-5.4 · mini · nano
Versatile, fast, strong polyglot coding
Google (Gemini)
Gemini 3 Pro · Flash · Deep Think
Massive context windows, speed, multimodal
Open-source
DeepSeek V3.2 · Llama 4 · Qwen3-Coder
Free, private, self-hostable, rapidly improving
| Tier | Models | Use when | Trade-off |
|---|---|---|---|
| Frontier “Architects” | Opus 4.6, GPT-5.4, Gemini Pro | Architecture, hard bugs, security reviews | Slow, expensive |
| Mid-tier “Workhorses” | Sonnet 4.6, GPT-5.4-mini, Gemini Flash | Daily coding, features, tests, code review | Best balance |
| Small “Sprinters” | Haiku 4.5, GPT-5.4-nano, local models | Completions, simple Q&A, high-volume | Fast & cheap, limited reasoning |
| Model | Speed | Cost |
|---|---|---|
| Opus 4.6 | Slower | 5/25 |
| Sonnet 4.6 | Fast | 3/15 |
| Haiku 4.5 | Fastest | 1/5 |
(per million tokens: input / output)
Opus → deep reasoning, architecture, hard bugs
Sonnet → your daily default — features, refactors, reviews
Haiku → autocomplete, simple edits, high-volume
Strength: best at understanding complex intent and maintaining coherence across large contexts
| Model | Speed | Cost |
|---|---|---|
| GPT-5.4 | Medium | 2.50/15 |
| GPT-5.4-mini | Fast | ~0.40/1.60 |
| GPT-5.4-nano | Fastest | ~0.10/0.40 |
GPT-5.4 → strong general-purpose default
mini → everyday coding, cost-efficient
nano → completions, classification
Strength: polyglot projects, broadest tool integration
| Model | Speed | Cost |
|---|---|---|
| Gemini 3 Pro | Medium | 2–4/8–16 |
| Gemini 3 Flash | Fast | 0.50/2 |
| Deep Think | Slow | Premium |
Pro → 2M token context — feed in entire repos, massive logs
Flash → surprisingly beats Pro on coding (78% vs 76.2% SWE-bench). Best value in 2026.
Deep Think → hard reasoning, mathematical proofs
Strength: massive context windows, multimodal, fast iteration
| Model | Highlight |
|---|---|
| DeepSeek V3.2 | Near-frontier, free, 685B params |
| Llama 4 (Meta) | Enterprise, fine-tunable |
| Qwen3-Coder | Outperforms models 10x its size |
| Mistral | European compliance |
Pick open-source when:
Many teams: local models for autocomplete, cloud models for complex reasoning.
What are you doing?
│
├─ Quick completion while typing
│ → Small: Haiku, nano, local model
│
├─ Implementing a feature / writing tests
│ → Mid-tier: Sonnet, GPT-5.4-mini, Gemini Flash
│
├─ Complex refactor / architecture / hard bug
│ → Frontier: Opus, GPT-5.4, Gemini Pro
│
├─ Analyzing huge codebase or logs
│ → Gemini Pro (2M) or Opus (1M context)
│
├─ Maximum correctness needed
│ → Opus 4.6 or GPT-5.4
│
└─ Privacy / offline / self-hosted
→ DeepSeek, Llama 4, Qwen3
| Model | Score |
|---|---|
| Claude Opus 4.6 | 80.8% |
| Claude Sonnet 4.6 | 79.6% |
| Gemini 3 Flash | 78.0% |
| Gemini 3 Pro | 76.2% |
| GPT-5.4 | ~75% |
| DeepSeek V3.2-Speciale | ~74% |
Benchmarks ≠ real-world. Test models on your code before deciding.
For most developers:
For compliance teams:
For budget-conscious:
Use the smallest model that gets the job done.
Upgrade only when you need to.