Quick answer: The five mistakes: (1) prompt bloat — stuffing instructions until the model ignores half; (2) leaky eval sets — your test prompts include the answer; (3) ignoring model differences — copying a GPT-4 prompt to Claude without rework; (4) no versioning — last week's "fix" broke production; (5) trying to fix with more few-shot when you should fine-tune. All are fixable in hours.

1. Prompt bloat

Every new edge case adds a sentence. After twelve edits the prompt is 4,000 tokens, the model follows only the first 1,000, and quality regresses. Fix: write the prompt as a structured spec (role, constraints, examples, output format) and prune every addition that doesn't measurably improve an eval metric.

2. Leaky eval sets

Your eval prompt includes phrasing that cues the answer. The model looks great on tests and fails in production. Fix: treat evals like ML test sets — independent, representative, periodically refreshed. Rotate 20% of the eval set monthly.

3. Ignoring model differences

Claude rewards structured tags and <thinking> scratchpads. GPT-5 follows numbered imperative steps. Gemini prefers clean markdown. Don't copy cross-model without re-evaluating. Track separate prompt versions per model.

4. No versioning

You edited the prompt in a config file. Nobody reviewed. Production regressed. Fix: every prompt in git with PR review; every deploy tagged with the prompt version; rollback is a config flip.

5. Few-shot when you should fine-tune

If you have >500 labelled examples and your few-shot prompt is over 8k tokens, you are paying extra inference cost and still shipping inconsistency. Fine-tune. At the right data volume, fine-tuning beats few-shot on cost, latency, and quality simultaneously.

5 Mistakes Prompt Engineers Make in 2026 (And How to Fix Them)

1. Prompt bloat

2. Leaky eval sets

3. Ignoring model differences

4. No versioning

5. Few-shot when you should fine-tune

Related reading