Law 23 · Instruction & Output
Confidence Is Not Calibrated
A model's certainty is not evidence.

The principle
Models are routinely confident and wrong, and unconfident and right. Routing decisions on self-reported confidence inherits that miscalibration. 'Only flag high-confidence issues' or 'be conservative' just moves the noise around — it doesn't reduce it, because the confidence itself is the unreliable signal.
Why it happens
A base language model can be reasonably calibrated, meaning its stated probability of being right tracks how often it actually is, but the alignment step that makes models helpful degrades this: the GPT-4 technical report showed the pre-trained model was well calibrated and that post-training noticeably worsened calibration. The mechanism is that reward models used in preference optimization carry a systematic bias toward high-confidence-sounding answers regardless of correctness, so the tuned model learns to express certainty as a style rather than as a signal. This is why a self-reported high confidence is not evidence of correctness and why routing on it just reshuffles noise. Verbalized confidence in an aligned model is closer to a learned mannerism than to a measured probability.
Watch for
- Your gate is phrased as only act on high-confidence outputs or be conservative rather than as concrete criteria.
- Spot-checks turn up confident wrong answers and hesitant right ones at similar rates.
- Two cases that are equally clear-cut to a human get very different self-reported confidence from the model.
In practice
A content-moderation agent is told to only escalate high-confidence policy violations, and it sails through eval while quietly waving through the borderline harassment cases it felt unsure about. The threshold did nothing but reshuffle the noise, because the model's self-rated confidence was never tied to actual correctness. Rip out the confidence gate and replace it with categorical rules: escalate if it names a person plus a threat of harm; do not escalate generic insults, each with a worked example. Decide on observable features of the content, not on how sure the model claims to feel.
Apply it
- Replace confidence thresholds with explicit categorical rules for what counts as in and what counts as out.
- Anchor each rule to observable features of the input, with one worked example of an included and an excluded case.
- If you need a real uncertainty signal, derive it from agreement across independent samples or an external check, not from the model's self-rating.
The takeaway
Replace confidence thresholds and vague hedges with explicit, categorical criteria: what specifically counts as in, what specifically counts as out, with an example of each. Specificity beats self-assessed certainty every time.