Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
PE‑Rank makes listwise LLM reranking fast enough for online use by cutting inference time ~4–5× with only minor ranking loss on evaluated benchmarks.
Summary TLDR
PE‑Rank replaces full passages with their dense embeddings (mapped into LLM token space) and uses a dynamic constrained decoding step to produce listwise rankings. It trains in two stages (alignment using text reconstruction; listwise learning-to-rank with KL distillation). On standard benchmarks PE‑Rank keeps ranking quality close to an uncompressed LLM reranker while cutting inference latency roughly 4–5× and greatly reducing processed/generated tokens. Code: https://github.com/liuqi6777/pe_rank.
Problem Statement
Listwise LLM rerankers are effective but hit practical limits: long passages blow past LLM context windows and cause high inference latency. Existing single‑pass compression methods do not scale to ranking many passages.
Main Contribution
PE‑Rank: represent each passage by its retrieval embedding, map it into LLM token space, and treat it as a special token to compress inputs for listwise reranking.
Dynamic‑Constrained Decoding: constrain the LLM output space to the remaining passage tokens and decode the ranking stepwise to speed and stabilize generation.
Two‑stage training: (1) alignment via text reconstruction to map embeddings to LLM token space; (2) listwise learning‑to‑rank with KL distillation to transfer ranking behavior.
Key Findings
PE‑Rank cuts end‑to‑end reranking latency by about 4–5× while keeping ranking quality close to the uncompressed LLM.
PE‑Rank greatly reduces tokens processed and generated during LLM inference.
Ranking effectiveness remains competitive versus similarly trained listwise LLM baselines.
Results
NDCG@10 (TREC DL19, rerank top-100, BM25 retrieval)
Latency per query (rerank top-100, DL19)
Processed tokens (#Proc) (rerank top-100, DL19)
Who Should Care
What To Try In 7 Days
Prototype PE‑Rank on a small reranking pipeline: encode top‑100 candidates with your embedding model and map them via an MLP into an LLM as special tokens.
Measure NDCG@10 and latency for top‑20 and top‑100 reranks; compare to your current LLM or supervised reranker.
Enable dynamic‑constrained decoding so the LLM only outputs passage tokens, and benchmark decoding time separately from prefilling time.
Optimization Features
Token Efficiency
- Input length scales with number of passages, not passage length (O(n) vs O(nLp))
- Generated tokens equal to number of candidates (n) rather than many numeric/text tokens
Infra Optimization
- Lower token counts reduce GPU compute and memory pressure at inference
System Optimization
- Uses Deepspeed ZeRO, BFloat16, FlashAttention during training to save memory
Training Optimization
- Two‑stage training: alignment (frozen LLM+encoder, train MLP) then fine‑tune MLP+LLM with listwise l
- KL distillation to mimic uncompressed text-based ranking
Inference Optimization
- Compress passages to single embeddings mapped as LLM special tokens
- Dynamic‑Constrained Decoding: restrict outputs to remaining passage tokens per step
- Greedy decoding over small token set reduces decode time
Reproducibility
Code Urls
Data Urls
- MS MARCO (public dataset)
- TREC DL (public)
- Wikipedia dump Dec 2020 (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on a good embedding model; poor embeddings reduce reranking quality.
- Alignment requires an extra training stage and access to the LLM weights for fine‑tuning.
- Training uses golden sequential labels but inference is autoregressive (teacher–student mismatch).
When Not To Use
- When absolute top possible NDCG is required and any drop is unacceptable.
- If you cannot fine‑tune the LLM or add the MLP mapping layer.
- If your embedding model does not capture task‑specific fine details.
Failure Modes
- Mapping MLP fails to align spaces and LLM cannot decode embeddings into passage semantics.
- Embedding model encodes insufficient detail, causing ranking errors on fine‑grained relevance.
- Mismatch between training (teacher uses full text) and inference (uses only embeddings) degrades performance if distillation is weak.
Core Entities
Models
- Mistral-7B-Instruct-v0.2
- Jina-Embeddings (jina-embeddings-v2-base-en)
- BGE-base
- RankGPT
- RankMistral
- RankVicuna
- RankZephyr
- monoBERT
- monoT5
- MiniLM (used as annotation model)
Metrics
- NDCG@10
- processed #tokens
- generated #tokens
- latency (seconds)
Datasets
- MS MARCO
- TREC DL 2019
- TREC DL 2020
- BEIR (subset incl. Covid)
- Wikipedia dump (Dec 2020 sample)
Benchmarks
- TREC DL
- BEIR
- MTEB (referenced for embedding comparison)

