Compress prompts by sampling attention-important tokens and sentences with a small RL policy

Overview

Decision SnapshotNeeds Validation

The approach is practical and lightweight and shows repeated empirical gains, but relies on an encoder proxy and RL training; expect moderate engineering work to integrate and tune.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Lizhe Chen, Binjia Zhou, Yuyao Ge, Jiayi Chen, Shiguang NI

Links

Abstract / PDF

Why It Matters For Business

PIS lowers inference time and token costs by using attention-aware compression and a small RL policy, letting teams serve long-context LLM tasks faster while often preserving or improving accuracy.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

The paper introduces Prompt Importance Sampling (PIS): a two-level prompt compression method that uses LLM attention as a proxy for token importance, removes low-value tokens and redundant sentences, and learns per-sentence compression ratios with a lightweight 9-layer DDQN. On multiple QA and summarization benchmarks, PIS keeps or improves task accuracy while cutting context size and inference latency versus existing compression baselines. The method relies on an encoder-only model for attention approximations, TF–IDF correction, and a Russian-roulette sentence sampler.

Problem Statement

Long prompts increase memory and latency for LLMs and current compression techniques either use expensive auxiliary models or ignore the model's internal attention signals. We need a method that compresses prompts cheaply while preserving the tokens that actually matter to the LLM's generation.

Main Contribution

A measure-theoretic framing that links token importance to LLM attention scores, formalizing prompt compression as importance sampling.

PIS: a dual-level compression pipeline — token-level importance sampling using attention+TF–IDF and a 9-layer DDQN to pick sentence-wise compression ratios, plus sentence-level Russian-roulette sampling.

Key Findings

PIS improves task performance at the same compression rate compared to strong baselines.

Numbers15% relative performance improvement at equivalent compression ratios (paper claim)

Practical UseIf you replace heuristic or generative compression with PIS, you can expect better-quality compressed prompts for the same token budget on evaluated benchmarks.

Evidence RefAbstract; Introduction

PIS reduces inference overhead versus strong baselines.

Numbers38% lower inference overhead compared to a strong baseline (paper claim)

Practical UseUse PIS to cut per-request latency and cost when serving long-context LLM workloads.

Evidence RefAbstract; Latency Comparison (Section 5.3.3)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MeetingBank QA Exact Match	89.05 (Ours)	87.19 (LLMLingua-2)	+1.86 EM	MeetingBank (Table 1)	Table 1 reports 89.05 EM for PIS vs 87.19 for LLMLingua-2	Table 1
MeetingBank Summary BLEU	23.98 (Ours)	20.77 (LLMLingua-2)	+3.21 BLEU	MeetingBank (Table 1)	Table 1 BLEU 23.98 vs 20.77	Table 1

What To Try In 7 Days

Run PIS as a preprocessing step on one long-input pipeline (meeting transcripts) and measure latency and QA/summarization quality.

Replace current heuristic pruning with token-level attention+TF–IDF ranking to see immediate quality gains.

A/B test the compact DDQN policy vs fixed compression ratios to confirm stability and cost trade-offs.

Agent Features

Tool Use

RLimportance-sampling-inspired selection

Architectures

encoder-only LM for attention proxy9-layer DDQN (policy network)

Optimization Features

Token Efficiency

≈3× compression ratios reported while preserving quality

System Optimization

lightweight encoder + 9-layer RL reduces pipeline overheadlinear-time compression step (sequential currently, parallelizable)

Training Optimization

small DDQN trained on BERT embeddings to avoid full LLM gradients

Inference Optimization

token-level pruning to reduce attention computationsentence-level Russian-roulette to cut redundancy

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Punctuation-based sentence splitting fails on technical or non-standard documents.

RL training requires repeated LLM evaluations, which increases development cost.

When Not To Use

When documents have logical units not aligned with punctuation (e.g., code, packed technical sections).

When you cannot afford extra compute to train the DDQN reward policy.

Failure Modes

Excessive sentence fragmentation can increase processing time and harm quality.

Encoder-only attention may misrank tokens compared to the deployed LLM, risking deletion of important tokens.

Core Entities

Models

GPT-4O-MINI-2024-07-18GPT-3.5-TURBO-0613GPT-4MISTRAL-7BLLAMA-2-7BBERT-BASE-UNCASEDLLMLinguaLLMLingua-2SelectiveContext

Metrics

Exact MatchBLEUROUGE-1ROUGE-2ROUGE-LBERTScoreLatency (s)Compression ratio (1/τ)

Datasets

MeetingBankGSM8KBBHLongBench-GovReport

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PIS improves task performance at the same compression rate compared to strong baselines.

PIS reduces inference overhead versus strong baselines.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Do multi-step math without long traces: refine compact latent anchors and stop when stable

Key finding

Use the frozen LLM itself to compress over-limit prompts into 1/12 memory tokens

Key finding

Question-aware prompt compression that speeds up LLMs and often improves accuracy on very long contexts

Key finding

Compress prompts by turning text into relation graphs, keeping readability and model utility

Key finding

Compress MT evaluation prompts to cut tokens ~2.37× while keeping evaluation quality

Key finding