PE‑Rank: compress passages into embeddings to speed LLM listwise reranking

June 21, 20247 min

Overview

Decision SnapshotReady For Pilot

Experiments on TREC DL and BEIR show consistent speedups and small quality loss; results use one LLM (Mistral‑7B) and several embedding models, so practical gains likely transfer but depend on your embedding and LLM choices.

Citations0

Evidence Strength0.75

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Qi Liu, Bo Wang, Nan Wang, Jiaxin Mao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

PE‑Rank makes listwise LLM reranking fast enough for online use by cutting inference time ~4–5× with only minor ranking loss on evaluated benchmarks.

Who Should Care

Summary TLDR

PE‑Rank replaces full passages with their dense embeddings (mapped into LLM token space) and uses a dynamic constrained decoding step to produce listwise rankings. It trains in two stages (alignment using text reconstruction; listwise learning-to-rank with KL distillation). On standard benchmarks PE‑Rank keeps ranking quality close to an uncompressed LLM reranker while cutting inference latency roughly 4–5× and greatly reducing processed/generated tokens. Code: https://github.com/liuqi6777/pe_rank.

Problem Statement

Listwise LLM rerankers are effective but hit practical limits: long passages blow past LLM context windows and cause high inference latency. Existing single‑pass compression methods do not scale to ranking many passages.

Main Contribution

PE‑Rank: represent each passage by its retrieval embedding, map it into LLM token space, and treat it as a special token to compress inputs for listwise reranking.

Dynamic‑Constrained Decoding: constrain the LLM output space to the remaining passage tokens and decode the ranking stepwise to speed and stabilize generation.

Key Findings

PE‑Rank cuts end‑to‑end reranking latency by about 4–5× while keeping ranking quality close to the uncompressed LLM.

NumbersDL19 top‑100: latency 16.20s -> 3.62s (×0.22); NDCG@10 drop < 2%

Practical UseIf you need listwise LLM reranking in production, swap to PE‑Rank to get multi‑second query latency down to ~3–4s for top‑100, with only minor accuracy loss on evaluated benchmarks.

Evidence RefAbstract; Sec 5.2; Table 3

PE‑Rank greatly reduces tokens processed and generated during LLM inference.

NumbersDL19 top‑100 #processed tokens 19506 -> 2942 (≈6.6× fewer); #generated 910 -> 180 (≈5.1× fewer)

Practical UseFewer tokens lowers compute cost and memory pressure. This helps scale to longer documents and larger candidate sets.

Evidence RefSec 5.2; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
NDCG@10 (TREC DL19, rerank top-100, BM25 retrieval)PE‑Rank 0.7048RankMistral_p 0.7173-0.0125TREC DL19Table 2; Sec 5.1Table 2
Latency per query (rerank top-100, DL19)PE‑Rank 3.62 sRankMistral_p 16.20 s-12.58 s (≈4.5× faster)TREC DL19Table 3; Fig 4Table 3

What To Try In 7 Days

Prototype PE‑Rank on a small reranking pipeline: encode top‑100 candidates with your embedding model and map them via an MLP into an LLM as special tokens.

Measure NDCG@10 and latency for top‑20 and top‑100 reranks; compare to your current LLM or supervised reranker.

Enable dynamic‑constrained decoding so the LLM only outputs passage tokens, and benchmark decoding time separately from prefilling time.

Optimization Features

Token Efficiency
Input length scales with number of passages, not passage length (O(n) vs O(nLp))Generated tokens equal to number of candidates (n) rather than many numeric/text tokens
Infra Optimization
Lower token counts reduce GPU compute and memory pressure at inference
System Optimization
Uses Deepspeed ZeRO, BFloat16, FlashAttention during training to save memory
Training Optimization

Two‑stage training: alignment (frozen LLM+encoder, train MLP) then fine‑tune MLP+LLM with listwise l

KL distillation to mimic uncompressed text-based ranking

Inference Optimization
Compress passages to single embeddings mapped as LLM special tokensDynamic‑Constrained Decoding: restrict outputs to remaining passage tokens per stepGreedy decoding over small token set reduces decode time

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

MS MARCO (public dataset)TREC DL (public)Wikipedia dump Dec 2020 (public)

Risks & Boundaries

Limitations

Relies on a good embedding model; poor embeddings reduce reranking quality.

Alignment requires an extra training stage and access to the LLM weights for fine‑tuning.

When Not To Use

When absolute top possible NDCG is required and any drop is unacceptable.

If you cannot fine‑tune the LLM or add the MLP mapping layer.

Failure Modes

Mapping MLP fails to align spaces and LLM cannot decode embeddings into passage semantics.

Embedding model encodes insufficient detail, causing ranking errors on fine‑grained relevance.

Core Entities

Models

Mistral-7B-Instruct-v0.2Jina-Embeddings (jina-embeddings-v2-base-en)BGE-baseRankGPTRankMistralRankVicunaRankZephyrmonoBERTmonoT5MiniLM (used as annotation model)

Metrics

NDCG@10processed #tokensgenerated #tokenslatency (seconds)

Datasets

MS MARCOTREC DL 2019TREC DL 2020BEIR (subset incl. Covid)Wikipedia dump (Dec 2020 sample)

Benchmarks

TREC DLBEIRMTEB (referenced for embedding comparison)