Compress prompts by sampling attention-important tokens and sentences with a small RL policy

April 23, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Lizhe Chen, Binjia Zhou, Yuyao Ge, Jiayi Chen, Shiguang NI

Links

Abstract / PDF

Why It Matters For Business

PIS lowers inference time and token costs by using attention-aware compression and a small RL policy, letting teams serve long-context LLM tasks faster while often preserving or improving accuracy.

Summary TLDR

The paper introduces Prompt Importance Sampling (PIS): a two-level prompt compression method that uses LLM attention as a proxy for token importance, removes low-value tokens and redundant sentences, and learns per-sentence compression ratios with a lightweight 9-layer DDQN. On multiple QA and summarization benchmarks, PIS keeps or improves task accuracy while cutting context size and inference latency versus existing compression baselines. The method relies on an encoder-only model for attention approximations, TF–IDF correction, and a Russian-roulette sentence sampler.

Problem Statement

Long prompts increase memory and latency for LLMs and current compression techniques either use expensive auxiliary models or ignore the model's internal attention signals. We need a method that compresses prompts cheaply while preserving the tokens that actually matter to the LLM's generation.

Main Contribution

A measure-theoretic framing that links token importance to LLM attention scores, formalizing prompt compression as importance sampling.

PIS: a dual-level compression pipeline — token-level importance sampling using attention+TF–IDF and a 9-layer DDQN to pick sentence-wise compression ratios, plus sentence-level Russian-roulette sampling.

Comprehensive evaluation showing improved compression quality, lower latency, and better generalization across QA and summarization datasets without auxiliary generative models.

Key Findings

PIS improves task performance at the same compression rate compared to strong baselines.

Numbers15% relative performance improvement at equivalent compression ratios (paper claim)

PIS reduces inference overhead versus strong baselines.

Numbers38% lower inference overhead compared to a strong baseline (paper claim)

Compressed prompts sometimes improve downstream accuracy versus the original full prompt.

Numbers5% accuracy uplift on downstream tasks with compressed prompts (paper claim)

Token-level importance sampling (TIS) is critical to quality.

NumbersExact Match drops from 89.05 to 65.21 (−23.84 points) when removing TIS on MeetingBank (Table 5)

Results

MeetingBank QA Exact Match

Value89.05 (Ours)

Baseline87.19 (LLMLingua-2)

MeetingBank Summary BLEU

Value23.98 (Ours)

Baseline20.77 (LLMLingua-2)

Compression ratio

Value≈3.01× (Ours)

Baseline≈2.96–3.04× (baselines varied)

Latency at 1500 tokens, 5× compression

Value2.65 s (Ours)

Baseline3.85 s (LLMLingua-2)

GSM8K 1-shot EM (out-of-domain)

Value80.19 (Ours)

Baseline78.75 (LLMLingua-2)

Ablation: remove Token-Level Importance Sampling

ValueEM 65.21 (w/o TIS)

BaselineEM 89.05 (Full)

Who Should Care

What To Try In 7 Days

Run PIS as a preprocessing step on one long-input pipeline (meeting transcripts) and measure latency and QA/summarization quality.

Replace current heuristic pruning with token-level attention+TF–IDF ranking to see immediate quality gains.

A/B test the compact DDQN policy vs fixed compression ratios to confirm stability and cost trade-offs.

Agent Features

Tool Use

  • RL
  • importance-sampling-inspired selection

Architectures

  • encoder-only LM for attention proxy
  • 9-layer DDQN (policy network)

Optimization Features

Token Efficiency

  • ≈3× compression ratios reported while preserving quality

System Optimization

  • lightweight encoder + 9-layer RL reduces pipeline overhead
  • linear-time compression step (sequential currently, parallelizable)

Training Optimization

  • small DDQN trained on BERT embeddings to avoid full LLM gradients

Inference Optimization

  • token-level pruning to reduce attention computation
  • sentence-level Russian-roulette to cut redundancy

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Punctuation-based sentence splitting fails on technical or non-standard documents.
  • RL training requires repeated LLM evaluations, which increases development cost.
  • Single-ratio models limit flexibility; trials showed 17% higher policy gradient variance when trying ratio-agnostic policies.

When Not To Use

  • When documents have logical units not aligned with punctuation (e.g., code, packed technical sections).
  • When you cannot afford extra compute to train the DDQN reward policy.
  • When absolute fidelity of the original uncompressed prompt is required.

Failure Modes

  • Excessive sentence fragmentation can increase processing time and harm quality.
  • Encoder-only attention may misrank tokens compared to the deployed LLM, risking deletion of important tokens.
  • Stochastic sentence deletion could remove rare critical sentences in edge cases.

Core Entities

Models

  • GPT-4O-MINI-2024-07-18
  • GPT-3.5-TURBO-0613
  • GPT-4
  • MISTRAL-7B
  • LLAMA-2-7B
  • BERT-BASE-UNCASED
  • LLMLingua
  • LLMLingua-2
  • SelectiveContext

Metrics

  • Exact Match
  • BLEU
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • BERTScore
  • Latency (s)
  • Compression ratio (1/τ)

Datasets

  • MeetingBank
  • GSM8K
  • BBH
  • LongBench-GovReport