Compress prompts by sampling attention-important tokens and sentences with a small RL policy

April 23, 20257 min

Overview

Decision SnapshotNeeds Validation

The approach is practical and lightweight and shows repeated empirical gains, but relies on an encoder proxy and RL training; expect moderate engineering work to integrate and tune.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Lizhe Chen, Binjia Zhou, Yuyao Ge, Jiayi Chen, Shiguang NI

Links

Abstract / PDF

Why It Matters For Business

PIS lowers inference time and token costs by using attention-aware compression and a small RL policy, letting teams serve long-context LLM tasks faster while often preserving or improving accuracy.

Who Should Care

Summary TLDR

The paper introduces Prompt Importance Sampling (PIS): a two-level prompt compression method that uses LLM attention as a proxy for token importance, removes low-value tokens and redundant sentences, and learns per-sentence compression ratios with a lightweight 9-layer DDQN. On multiple QA and summarization benchmarks, PIS keeps or improves task accuracy while cutting context size and inference latency versus existing compression baselines. The method relies on an encoder-only model for attention approximations, TF–IDF correction, and a Russian-roulette sentence sampler.

Problem Statement

Long prompts increase memory and latency for LLMs and current compression techniques either use expensive auxiliary models or ignore the model's internal attention signals. We need a method that compresses prompts cheaply while preserving the tokens that actually matter to the LLM's generation.

Main Contribution

A measure-theoretic framing that links token importance to LLM attention scores, formalizing prompt compression as importance sampling.

PIS: a dual-level compression pipeline — token-level importance sampling using attention+TF–IDF and a 9-layer DDQN to pick sentence-wise compression ratios, plus sentence-level Russian-roulette sampling.

Key Findings

PIS improves task performance at the same compression rate compared to strong baselines.

Numbers15% relative performance improvement at equivalent compression ratios (paper claim)

Practical UseIf you replace heuristic or generative compression with PIS, you can expect better-quality compressed prompts for the same token budget on evaluated benchmarks.

Evidence RefAbstract; Introduction

PIS reduces inference overhead versus strong baselines.

Numbers38% lower inference overhead compared to a strong baseline (paper claim)

Practical UseUse PIS to cut per-request latency and cost when serving long-context LLM workloads.

Evidence RefAbstract; Latency Comparison (Section 5.3.3)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MeetingBank QA Exact Match89.05 (Ours)87.19 (LLMLingua-2)+1.86 EMMeetingBank (Table 1)Table 1 reports 89.05 EM for PIS vs 87.19 for LLMLingua-2Table 1
MeetingBank Summary BLEU23.98 (Ours)20.77 (LLMLingua-2)+3.21 BLEUMeetingBank (Table 1)Table 1 BLEU 23.98 vs 20.77Table 1

What To Try In 7 Days

Run PIS as a preprocessing step on one long-input pipeline (meeting transcripts) and measure latency and QA/summarization quality.

Replace current heuristic pruning with token-level attention+TF–IDF ranking to see immediate quality gains.

A/B test the compact DDQN policy vs fixed compression ratios to confirm stability and cost trade-offs.

Agent Features

Tool Use
RLimportance-sampling-inspired selection
Architectures
encoder-only LM for attention proxy9-layer DDQN (policy network)

Optimization Features

Token Efficiency
≈3× compression ratios reported while preserving quality
System Optimization
lightweight encoder + 9-layer RL reduces pipeline overheadlinear-time compression step (sequential currently, parallelizable)
Training Optimization
small DDQN trained on BERT embeddings to avoid full LLM gradients
Inference Optimization
token-level pruning to reduce attention computationsentence-level Russian-roulette to cut redundancy

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Punctuation-based sentence splitting fails on technical or non-standard documents.

RL training requires repeated LLM evaluations, which increases development cost.

When Not To Use

When documents have logical units not aligned with punctuation (e.g., code, packed technical sections).

When you cannot afford extra compute to train the DDQN reward policy.

Failure Modes

Excessive sentence fragmentation can increase processing time and harm quality.

Encoder-only attention may misrank tokens compared to the deployed LLM, risking deletion of important tokens.

Core Entities

Models

GPT-4O-MINI-2024-07-18GPT-3.5-TURBO-0613GPT-4MISTRAL-7BLLAMA-2-7BBERT-BASE-UNCASEDLLMLinguaLLMLingua-2SelectiveContext

Metrics

Exact MatchBLEUROUGE-1ROUGE-2ROUGE-LBERTScoreLatency (s)Compression ratio (1/τ)

Datasets

MeetingBankGSM8KBBHLongBench-GovReport