Reranking for RAG: How Cross-Encoders Improve Retrieval Quality and LLM Accuracy

Vector search is fast, but it scores query and document independently. Cross-encoder reranking reads them together for a more accurate relevance score. Here's how reranking improves your RAG pipeline and why "retrieve 50, rerank to 5" beats "retrieve 5" every time.

The Problem: Bi-Encoders Are Fast but Shallow

In retrieval augmented generation (RAG), vector search uses bi-encoders: the query and each document are embedded separately, and similarity is computed as a dot product or cosine. This is efficient—you can search millions of vectors in milliseconds—but the scoring is shallow.

Consider a query like "Can managers approve their own expense reports?" A chunk about "expense reports must be approved by a direct manager" might score high on vector similarity because it shares tokens and concepts. But the chunk that actually answers the question—"self-approval of expense reports is prohibited"—may rank lower. The bi-encoder cannot fully capture the distinction between these two.

Cross-Encoders: Reading Query and Document Together

A cross-encoder takes the query and document as a single input and produces a relevance score. It sees the full interaction: negation, specificity, and context. The trade-off is cost—you can't pre-compute embeddings for every query-document pair, so cross-encoders are used only at query time, on a small candidate set.

The winning pattern: retrieve many candidates cheaply (e.g., top 50 via vector search), then rerank to the top 5 with a cross-encoder. You get the speed of vector search and the accuracy of cross-encoder scoring.

How Reranking Works in Practice

When reranking is enabled in a RAG agent or pipeline node, the flow looks like this:

Retrieve: Fetch a larger candidate set (e.g., 50 chunks) from your vector index using your usual embedding model.
Rerank: Send the query and candidate texts to a rerank API (e.g., Cohere Rerank) and get relevance scores.
Trim: Take the top K (e.g., 5) and pass them to the LLM as context.

The LLM receives only the most relevant chunks, reducing noise and improving answer quality. Rerank scores can also be used for threshold filtering—dropping chunks below a relevance cutoff—which is more meaningful than applying thresholds on bi-encoder scores.

When to Use Reranking

Reranking is especially valuable when:

Your corpus is large or noisy: Vector search returns many near-matches; reranking cuts through the clutter.
Queries are nuanced: Questions involving negation, specificity, or subtle distinctions benefit from cross-encoder understanding.
You care about answer accuracy: The extra latency and API cost of reranking is justified by fewer hallucinations and better citations.

Fail-Open Behavior

In production, rerank APIs can fail (rate limits, network issues, key problems). A robust implementation should fail-open: if reranking fails, fall back to the original vector-sorted order rather than breaking the query. Users still get answers; they just may be slightly less precise.

ShinRAG: Reranking Built In

ShinRAG supports optional reranking at the agent and pipeline node level. Configure enableRerank, rerankRetrieveTopK, and rerankTopK to tune the recall vs. precision trade-off. If the rerank API is unavailable, results degrade gracefully to vector order.

Improve Your RAG Retrieval with Reranking

Create RAG agents with optional reranking in ShinRAG. Retrieve more, rerank to the best, and pass only the most relevant context to your LLM. No infrastructure required.

Get Started Free