Law 14 · Retrieval & Memory

Keyword Still Carries Weight

Pure semantic search quietly loses to a 40-year-old baseline.

The principle

Dense embedding retrievers dominate in-domain but frequently underperform BM25 once you leave the training distribution — exact-match terms, product codes, names, and rare jargon are where embeddings blur and lexical search shines. In-domain accuracy doesn't predict out-of-domain generalization. Combining the two is how strong systems cut retrieval failures dramatically.

Why it happens

Dense retrievers compress text into a fixed vector where meaning is smeared across dimensions, so exact tokens like SKUs, error codes, names, and rare jargon lose their distinctiveness and collapse toward similar-looking neighbors, exactly the cases where a lexical method that matches the literal string excels. The BEIR benchmark made the generalization gap concrete: dense models that beat BM25 in-domain frequently underperformed it on out-of-distribution datasets, showing that in-domain accuracy does not predict zero-shot robustness. The standard remedy is to run both and fuse their ranked lists, and reciprocal rank fusion is the canonical method because it combines rankings using only positions, needs no score calibration, and was shown to outperform any single retriever and prior fusion methods. Lexical and semantic retrieval fail in orthogonal ways, so combining them recovers the queries either alone would miss.

Watch for

Pure embedding search nails paraphrased demo questions but fails on exact codes, IDs, or product names in production.
Out-of-domain or jargon-heavy queries return near-identical-looking but wrong matches.
Retrieval was validated only on in-distribution examples similar to the embedding training data.

In practice

Your pure-embedding search nails paraphrased questions in the demo, then face-plants in production when a user searches for SKU 'AX-4400-B' or an error code, and the dense vectors blur it into a dozen near-identical part numbers. Embeddings smear exact tokens, IDs, names, and rare jargon. Default to hybrid: run BM25 alongside semantic search, fuse the results, and put a reranker on top. The 40-year-old lexical baseline is exactly what rescues your out-of-domain and exact-match queries.

Apply it

Run lexical and semantic retrieval in parallel and fuse their ranked lists rather than relying on embeddings alone.
Combine ranked results with a position-based fusion method that needs no score calibration between retrievers.
Add a reranker over the fused candidates to compound precision, especially for exact-match and out-of-domain queries.

The takeaway

Default to hybrid (semantic + keyword/BM25) search, not embeddings alone — especially for jargon, IDs, and out-of-domain queries. Add a reranker on top to compound the gains.

Sources and further reading

Read every law in the digital edition Back to all 50 laws

The principle

Why it happens

Watch for

Apply it

Sources and further reading

Related laws