Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
PIS lowers inference time and token costs by using attention-aware compression and a small RL policy, letting teams serve long-context LLM tasks faster while often preserving or improving accuracy.
Summary TLDR
The paper introduces Prompt Importance Sampling (PIS): a two-level prompt compression method that uses LLM attention as a proxy for token importance, removes low-value tokens and redundant sentences, and learns per-sentence compression ratios with a lightweight 9-layer DDQN. On multiple QA and summarization benchmarks, PIS keeps or improves task accuracy while cutting context size and inference latency versus existing compression baselines. The method relies on an encoder-only model for attention approximations, TF–IDF correction, and a Russian-roulette sentence sampler.
Problem Statement
Long prompts increase memory and latency for LLMs and current compression techniques either use expensive auxiliary models or ignore the model's internal attention signals. We need a method that compresses prompts cheaply while preserving the tokens that actually matter to the LLM's generation.
Main Contribution
A measure-theoretic framing that links token importance to LLM attention scores, formalizing prompt compression as importance sampling.
PIS: a dual-level compression pipeline — token-level importance sampling using attention+TF–IDF and a 9-layer DDQN to pick sentence-wise compression ratios, plus sentence-level Russian-roulette sampling.
Comprehensive evaluation showing improved compression quality, lower latency, and better generalization across QA and summarization datasets without auxiliary generative models.
Key Findings
PIS improves task performance at the same compression rate compared to strong baselines.
PIS reduces inference overhead versus strong baselines.
Compressed prompts sometimes improve downstream accuracy versus the original full prompt.
Token-level importance sampling (TIS) is critical to quality.
Results
MeetingBank QA Exact Match
MeetingBank Summary BLEU
Compression ratio
Latency at 1500 tokens, 5× compression
GSM8K 1-shot EM (out-of-domain)
Ablation: remove Token-Level Importance Sampling
Who Should Care
What To Try In 7 Days
Run PIS as a preprocessing step on one long-input pipeline (meeting transcripts) and measure latency and QA/summarization quality.
Replace current heuristic pruning with token-level attention+TF–IDF ranking to see immediate quality gains.
A/B test the compact DDQN policy vs fixed compression ratios to confirm stability and cost trade-offs.
Agent Features
Tool Use
- RL
- importance-sampling-inspired selection
Architectures
- encoder-only LM for attention proxy
- 9-layer DDQN (policy network)
Optimization Features
Token Efficiency
- ≈3× compression ratios reported while preserving quality
System Optimization
- lightweight encoder + 9-layer RL reduces pipeline overhead
- linear-time compression step (sequential currently, parallelizable)
Training Optimization
- small DDQN trained on BERT embeddings to avoid full LLM gradients
Inference Optimization
- token-level pruning to reduce attention computation
- sentence-level Russian-roulette to cut redundancy
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Punctuation-based sentence splitting fails on technical or non-standard documents.
- RL training requires repeated LLM evaluations, which increases development cost.
- Single-ratio models limit flexibility; trials showed 17% higher policy gradient variance when trying ratio-agnostic policies.
When Not To Use
- When documents have logical units not aligned with punctuation (e.g., code, packed technical sections).
- When you cannot afford extra compute to train the DDQN reward policy.
- When absolute fidelity of the original uncompressed prompt is required.
Failure Modes
- Excessive sentence fragmentation can increase processing time and harm quality.
- Encoder-only attention may misrank tokens compared to the deployed LLM, risking deletion of important tokens.
- Stochastic sentence deletion could remove rare critical sentences in edge cases.
Core Entities
Models
- GPT-4O-MINI-2024-07-18
- GPT-3.5-TURBO-0613
- GPT-4
- MISTRAL-7B
- LLAMA-2-7B
- BERT-BASE-UNCASED
- LLMLingua
- LLMLingua-2
- SelectiveContext
Metrics
- Exact Match
- BLEU
- ROUGE-1
- ROUGE-2
- ROUGE-L
- BERTScore
- Latency (s)
- Compression ratio (1/τ)
Datasets
- MeetingBank
- GSM8K
- BBH
- LongBench-GovReport

