Compress long prompts into short natural-language 'Capsule Prompts' that cut cost and latency while keeping accuracy

Overview

Decision SnapshotNeeds Validation

The method shows consistent cost and latency benefits and reasonable transfer to other LLMs on evaluated datasets, but results are limited to the datasets and LLMs tested and depend on reward design and length tuning.

Citations1

Evidence Strength0.65

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 50%

Authors

Yu-Neng Chuang, Tianwei Xing, Chia-Yuan Chang, Zirui Liu, Xun Chen, Xia Hu

Links

Abstract / PDF / Data

Why It Matters For Business

Compress prompts into short natural-language pieces to cut API token costs and inference time without retraining target LLMs; this reduces cloud bills and speeds up batch processing.

Who Should Care

ML Engineer Product Manager CTO Founder Data Scientist

Summary TLDR

This paper introduces Nano-Capsulator, a method that trains a generator to turn long prompts into short, natural-language 'Capsule Prompts' using a semantics-preserving loss plus a reward that enforces downstream utility and length constraints. On benchmarks (CSQA, GSM8K, MultiRC, TriviaQA-Long) Capsule Prompts reduced input length by ~81%, cut API cost by up to 80.1%, and sped up inference up to 4.5× while largely retaining accuracy. The compressor is trained once (Vicuna-7B + LoRA) and the resulting NL prompts transfer to other LLMs (Vicuna-13B, PaLM, Claude2) and to unseen but similar datasets.

Problem Statement

Soft prompts compress context but fail to transfer across different LLMs (especially API models). Compressing into natural language is desirable for transferability but is hard because (1) text is discrete so you cannot backpropagate easily, and (2) generated NL summaries need reliable length control without losing task utility.

Main Contribution

A training framework (Nano-Capsulator) that produces NL-formatted Capsule Prompts by optimizing a semantics-preserving loss together with a reward that enforces downstream utility and strict length constraints.

Demonstration that Capsule Prompts compress inputs by up to 81.4%, cut API cost up to 80.1%, and reduce inference latency up to 4.5× while generally retaining accuracy on several LLMs and datasets.

Key Findings

Capsule Prompts reduce input length by about 81.4% on evaluated tasks.

Numbers81.4% compression rate (reported in Table 1 and text)

Practical UseShorter NL prompts let you fit more cases in memory, enable bigger batches, and reduce token-related costs when sending inputs to LLMs.

Evidence RefSection 4.3; Table 1

API costs for input tokens dropped up to 80.1% when using Capsule Prompts.

NumbersUp to 80.1% cost saved on Claude2/PaLM (Table 2, Appendix B)

Practical UseIf you pay per-token or per-request on API LLMs, compressing prompts into NL Capsule Prompts can produce large direct cost savings.

Evidence RefTable 2; Appendix B

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Compression rate	81.4% (input length reduced)	original prompts	-81.4%	reported across CSQA, GSM8K, MultiRC, TriviaQA-Long	Table 1; Section 4.3	Table 1
API cost saved	up to 80.1% saved	original prompts on Claude2/PaLM	-80.1%	CSQA, GSM8K, MultiRC, TriviaQA-Long	Table 2; Appendix B	Table 2

What To Try In 7 Days

Run Nano-Capsulator-style compression on one long-prompt workflow and measure API cost and latency before/after.

Train a small compressor (Vicuna-7B + LoRA) on 1–2k examples and reuse the generated NL prompts on your target API LLM to test transferability.

Tune the Capsule Prompt length constraint for your target LLM to balance speed and accuracy.

Optimization Features

Token Efficiency

Natural-language prompt compressionTruncation-based length enforcement

Infra Optimization

Lower API token usage

System Optimization

Reduced memory footprint enables larger batches

Training Optimization

LoRA

Inference Optimization

Prompt CompressionToken BudgetingBatch-size enablement

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusNo

LicenseUnknown

Data URLs

CommonsenseQAGSM8KMultiRCTriviaQA-LongBoolQ

Risks & Boundaries

Limitations

Requires training the compressor on similar-domain prompts for best results; transfer is not guaranteed for very different domains.

Length constraint choice matters and must be tuned per target LLM; too-short prompts can lose critical logic.

When Not To Use

When prompts are already short and token cost is not a concern.

When exact token-level wording must be preserved (e.g., legal text).

Failure Modes

Over-compression that drops essential details and reduces accuracy.

Reward mis-specification that optimizes for surrogate scoring but harms real-task performance.

Core Entities

Models

Vicuna-7BVicuna-13BPaLMClaude2OPT-2.7BLlama-2-7BGPT-3.5-Turbo

Metrics

Accuracycompression rateinference latencyAPI cost savedtoken length

Datasets

CommonsenseQA (CSQA)GSM8KMultiRCTriviaQA-LongBoolQ

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Capsule Prompts reduce input length by about 81.4% on evaluated tasks.

API costs for input tokens dropped up to 80.1% when using Capsule Prompts.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Do multi-step math without long traces: refine compact latent anchors and stop when stable

Key finding

Use the frozen LLM itself to compress over-limit prompts into 1/12 memory tokens

Key finding

Question-aware prompt compression that speeds up LLMs and often improves accuracy on very long contexts

Key finding

Compress prompts by sampling attention-important tokens and sentences with a small RL policy

Key finding

Compress prompts by turning text into relation graphs, keeping readability and model utility

Key finding