What is re-ranking in a RAG system?

Re-ranking is a second-pass scoring step applied after initial retrieval. A more powerful but slower model — typically a cross-encoder — scores each candidate chunk against the query directly, reorders them by relevance and selects the top N for the language model's context window. It improves precision without requiring a full re-index.

What is hybrid search and why is it better than pure vector search?

Hybrid search combines vector (semantic) search with keyword (lexical) search — typically BM25 — and merges the result sets using a fusion algorithm such as Reciprocal Rank Fusion. It outperforms pure vector search when queries include exact terms like product codes, legal citations or proper nouns that semantic similarity alone may not rank highly.

When should I use re-ranking vs hybrid search?

These are complementary, not competing. Hybrid search improves the initial candidate set by ensuring both semantic and lexical matches are captured. Re-ranking then refines that candidate set before passing results to the language model. Production RAG systems that require high precision typically use both in sequence.

Re-ranking and Hybrid Search for RAG

Quick answer

First-pass retrieval — whether keyword or vector — is fast but imprecise. It retrieves a broad candidate set from which the language model is expected to reason. Re-ranking and hybrid search are two techniques that refine this candidate set before it reaches the model: hybrid search broadens the candidate pool by combining semantic and lexical signals, while re-ranking narrows it by applying a more precise relevance scoring step. Together, they constitute the standard approach for high-quality production RAG pipelines.

What this means

In a standard RAG pipeline, a query is embedded and the vector store returns the top K most semantically similar chunks. This is effective for conceptual queries but has two structural weaknesses: it may miss exact-match content (product codes, legal citations, proper nouns), and it ranks purely by vector distance, which is a proxy for relevance rather than a direct measure of it.

Hybrid search addresses the first problem. Re-ranking addresses the second. Both operate between the initial retrieval step and the language model's context assembly step — they are retrieval refinement layers, not replacements for the underlying index.

Why it matters for business

In enterprise deployments, retrieval quality is the primary lever for answer accuracy. A language model given a relevant, well-ranked context window will produce a better answer than the same model given a noisy, poorly ranked one. The relationship is not subtle — retrieval is frequently the binding constraint on system performance.

For Australian organisations deploying AI over high-stakes corpora — compliance libraries, legal documentation, technical procedures, financial data — the cost of a missed or incorrectly ranked chunk is not just a worse answer; it is a compliance risk or an operational error. Advanced retrieval techniques are not optional enhancements; for these contexts, they are baseline requirements.

How it works technically

Hybrid search combines two retrieval modalities:

Dense retrieval (vector search): Embedding-based semantic matching using ANN search over a vector index.
Sparse retrieval (keyword/lexical search): Term-frequency-based matching using BM25 or TF-IDF over an inverted index.

The two result sets are merged using a fusion algorithm. Reciprocal Rank Fusion (RRF) is the most common: each document's score is computed as the sum of 1/(rank + k) across both result lists, where k is a smoothing constant (typically 60). Documents appearing highly in both lists score most strongly. RRF is robust to score scale differences between the two modalities and requires no parameter tuning for the individual rankers.

Re-ranking uses a cross-encoder model — a model that takes a (query, chunk) pair as joint input and produces a single relevance score. Unlike bi-encoder embeddings, which compute query and document vectors independently, a cross-encoder processes the full query and document together, enabling richer relevance judgement. The trade-off is speed: cross-encoders are too slow for initial retrieval over millions of documents, but fast enough to re-rank a small candidate set of 20–50 chunks. Common cross-encoder models include Cohere Rerank, a range of open-source cross-encoders and fine-tuned versions for specific domains.

The combined pipeline: sparse + dense retrieval → RRF fusion (candidate set of ~50) → cross-encoder re-ranking → top 5–10 chunks → language model context window.

Practical implementation considerations

Implementing hybrid search requires maintaining two indexes: a vector index for dense retrieval and a text index for sparse retrieval. Some vector databases (Weaviate, OpenSearch, Elasticsearch with vector extensions) support both in a single system. Others require a separate BM25 search layer alongside the vector store, with a fusion step at the application layer.

Re-ranking via a hosted API (Cohere Rerank, Jina Reranker) adds a network call to the retrieval pipeline, introducing latency. For most enterprise applications, the additional 100–300ms is acceptable given the quality lift. For latency-sensitive real-time applications, a smaller, locally hosted cross-encoder may be preferable.

The optimal top-K for initial retrieval depends on corpus size and query complexity. A practical starting point is K=20–50 from the hybrid retrieval step, re-ranked to K=5–8 for the language model context window. Edison AI's AI implementation team benchmarks these parameters against domain-specific evaluation sets for each deployment rather than relying on defaults, because optimal configurations vary significantly by corpus type and query distribution.

Common mistakes

Treating re-ranking as optional. In any deployment where answer precision matters, passing the raw top-K vector results directly to the model leaves measurable quality on the table.
Ignoring the sparse retrieval leg of hybrid search. Pure vector search underperforms on queries with exact-match requirements — which are common in enterprise contexts (regulatory references, product identifiers, named individuals).
Not tuning the RRF fusion weights. Default equal weighting between dense and sparse results is a reasonable starting point, not a permanent configuration. Evaluation against a domain test set often reveals that one modality should be weighted more heavily.
Re-ranking too small a candidate set. If the initial hybrid retrieval only returns K=5, the re-ranker has little to work with. The initial retrieval must be generous enough that the correct answer is almost certainly in the candidate pool.
Skipping evaluation after adding re-ranking. Re-ranking generally improves precision but can occasionally lower recall if the cross-encoder has domain gaps. Measure both before and after.

What leaders should do next

Assess your current RAG retrieval pipeline. If it uses pure vector search with no lexical component, add a BM25 layer and RRF fusion — this is often the highest-return improvement available. If first-pass retrieval quality is acceptable but answer precision remains inconsistent, add a cross-encoder re-ranker over the top 20–50 candidates. Measure the impact using precision at K and answer faithfulness scores against a domain-specific test set before and after each change.

Edison AI builds bespoke AI systems — including retrieval over your own documents — for Australian businesses.

Frequently asked

Questions, answered.

What is re-ranking in a RAG system?
Re-ranking is a second-pass scoring step applied after initial retrieval. A more powerful but slower model — typically a cross-encoder — scores each candidate chunk against the query directly, reorders them by relevance and selects the top N for the language model's context window. It improves precision without requiring a full re-index.
What is hybrid search and why is it better than pure vector search?
Hybrid search combines vector (semantic) search with keyword (lexical) search — typically BM25 — and merges the result sets using a fusion algorithm such as Reciprocal Rank Fusion. It outperforms pure vector search when queries include exact terms like product codes, legal citations or proper nouns that semantic similarity alone may not rank highly.
When should I use re-ranking vs hybrid search?
These are complementary, not competing. Hybrid search improves the initial candidate set by ensuring both semantic and lexical matches are captured. Re-ranking then refines that candidate set before passing results to the language model. Production RAG systems that require high precision typically use both in sequence.

Take the next step

Ready to put this into practice?

Edison AI helps Australian businesses move from AI curiosity to practical implementation, with workflow design, team training and measurable outcomes. Tell us about your setup and we'll come back with a sequenced plan grounded in the same thinking you just read.

Book an AI readiness call