PE‑Rank: compress passages into embeddings to speed LLM listwise reranking

June 21, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Qi Liu, Bo Wang, Nan Wang, Jiaxin Mao

Links

Abstract / PDF

Why It Matters For Business

PE‑Rank makes listwise LLM reranking fast enough for online use by cutting inference time ~4–5× with only minor ranking loss on evaluated benchmarks.

Summary TLDR

PE‑Rank replaces full passages with their dense embeddings (mapped into LLM token space) and uses a dynamic constrained decoding step to produce listwise rankings. It trains in two stages (alignment using text reconstruction; listwise learning-to-rank with KL distillation). On standard benchmarks PE‑Rank keeps ranking quality close to an uncompressed LLM reranker while cutting inference latency roughly 4–5× and greatly reducing processed/generated tokens. Code: https://github.com/liuqi6777/pe_rank.

Problem Statement

Listwise LLM rerankers are effective but hit practical limits: long passages blow past LLM context windows and cause high inference latency. Existing single‑pass compression methods do not scale to ranking many passages.

Main Contribution

PE‑Rank: represent each passage by its retrieval embedding, map it into LLM token space, and treat it as a special token to compress inputs for listwise reranking.

Dynamic‑Constrained Decoding: constrain the LLM output space to the remaining passage tokens and decode the ranking stepwise to speed and stabilize generation.

Two‑stage training: (1) alignment via text reconstruction to map embeddings to LLM token space; (2) listwise learning‑to‑rank with KL distillation to transfer ranking behavior.

Key Findings

PE‑Rank cuts end‑to‑end reranking latency by about 4–5× while keeping ranking quality close to the uncompressed LLM.

NumbersDL19 top‑100: latency 16.20s -> 3.62s (×0.22); NDCG@10 drop < 2%

PE‑Rank greatly reduces tokens processed and generated during LLM inference.

NumbersDL19 top‑100 #processed tokens 19506 -> 2942 (≈6.6× fewer); #generated 910 -> 180 (≈5.1× fewer)

Ranking effectiveness remains competitive versus similarly trained listwise LLM baselines.

NumbersDL19 NDCG@10: RankMistral 0.7173 vs PE‑Rank 0.7048 (no significant difference on DL19/DL20 under the paper's tests)

Results

NDCG@10 (TREC DL19, rerank top-100, BM25 retrieval)

ValuePE‑Rank 0.7048

BaselineRankMistral_p 0.7173

Latency per query (rerank top-100, DL19)

ValuePE‑Rank 3.62 s

BaselineRankMistral_p 16.20 s

Processed tokens (#Proc) (rerank top-100, DL19)

ValuePE‑Rank 2942.4

BaselineRankMistral_p 19506.2

Who Should Care

What To Try In 7 Days

Prototype PE‑Rank on a small reranking pipeline: encode top‑100 candidates with your embedding model and map them via an MLP into an LLM as special tokens.

Measure NDCG@10 and latency for top‑20 and top‑100 reranks; compare to your current LLM or supervised reranker.

Enable dynamic‑constrained decoding so the LLM only outputs passage tokens, and benchmark decoding time separately from prefilling time.

Optimization Features

Token Efficiency

  • Input length scales with number of passages, not passage length (O(n) vs O(nLp))
  • Generated tokens equal to number of candidates (n) rather than many numeric/text tokens

Infra Optimization

  • Lower token counts reduce GPU compute and memory pressure at inference

System Optimization

  • Uses Deepspeed ZeRO, BFloat16, FlashAttention during training to save memory

Training Optimization

  • Two‑stage training: alignment (frozen LLM+encoder, train MLP) then fine‑tune MLP+LLM with listwise l
  • KL distillation to mimic uncompressed text-based ranking

Inference Optimization

  • Compress passages to single embeddings mapped as LLM special tokens
  • Dynamic‑Constrained Decoding: restrict outputs to remaining passage tokens per step
  • Greedy decoding over small token set reduces decode time

Reproducibility

Data Urls

  • MS MARCO (public dataset)
  • TREC DL (public)
  • Wikipedia dump Dec 2020 (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on a good embedding model; poor embeddings reduce reranking quality.
  • Alignment requires an extra training stage and access to the LLM weights for fine‑tuning.
  • Training uses golden sequential labels but inference is autoregressive (teacher–student mismatch).

When Not To Use

  • When absolute top possible NDCG is required and any drop is unacceptable.
  • If you cannot fine‑tune the LLM or add the MLP mapping layer.
  • If your embedding model does not capture task‑specific fine details.

Failure Modes

  • Mapping MLP fails to align spaces and LLM cannot decode embeddings into passage semantics.
  • Embedding model encodes insufficient detail, causing ranking errors on fine‑grained relevance.
  • Mismatch between training (teacher uses full text) and inference (uses only embeddings) degrades performance if distillation is weak.

Core Entities

Models

  • Mistral-7B-Instruct-v0.2
  • Jina-Embeddings (jina-embeddings-v2-base-en)
  • BGE-base
  • RankGPT
  • RankMistral
  • RankVicuna
  • RankZephyr
  • monoBERT
  • monoT5
  • MiniLM (used as annotation model)

Metrics

  • NDCG@10
  • processed #tokens
  • generated #tokens
  • latency (seconds)

Datasets

  • MS MARCO
  • TREC DL 2019
  • TREC DL 2020
  • BEIR (subset incl. Covid)
  • Wikipedia dump (Dec 2020 sample)

Benchmarks

  • TREC DL
  • BEIR
  • MTEB (referenced for embedding comparison)