Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
Compress prompts into short natural-language pieces to cut API token costs and inference time without retraining target LLMs; this reduces cloud bills and speeds up batch processing.
Summary TLDR
This paper introduces Nano-Capsulator, a method that trains a generator to turn long prompts into short, natural-language 'Capsule Prompts' using a semantics-preserving loss plus a reward that enforces downstream utility and length constraints. On benchmarks (CSQA, GSM8K, MultiRC, TriviaQA-Long) Capsule Prompts reduced input length by ~81%, cut API cost by up to 80.1%, and sped up inference up to 4.5× while largely retaining accuracy. The compressor is trained once (Vicuna-7B + LoRA) and the resulting NL prompts transfer to other LLMs (Vicuna-13B, PaLM, Claude2) and to unseen but similar datasets.
Problem Statement
Soft prompts compress context but fail to transfer across different LLMs (especially API models). Compressing into natural language is desirable for transferability but is hard because (1) text is discrete so you cannot backpropagate easily, and (2) generated NL summaries need reliable length control without losing task utility.
Main Contribution
A training framework (Nano-Capsulator) that produces NL-formatted Capsule Prompts by optimizing a semantics-preserving loss together with a reward that enforces downstream utility and strict length constraints.
Demonstration that Capsule Prompts compress inputs by up to 81.4%, cut API cost up to 80.1%, and reduce inference latency up to 4.5× while generally retaining accuracy on several LLMs and datasets.
Evidence that the generated NL prompts transfer across LLMs and to unseen datasets with similar tasks, and that the approach works for both few-shot chain-of-thought examples and long reading passages.
Key Findings
Capsule Prompts reduce input length by about 81.4% on evaluated tasks.
API costs for input tokens dropped up to 80.1% when using Capsule Prompts.
Inference latency decreased up to 4.5× with compressed prompts on evaluated models and batch sizes.
Capsule Prompts generally preserve task accuracy and transfer across LLMs and similar unseen datasets.
Training cost is modest: about 8 hours for few-shot CoT and 4 hours for reading-comprehension compression on two A40 GPUs using LoRA.
Results
Compression rate
API cost saved
Inference latency
Accuracy
Training time
Who Should Care
What To Try In 7 Days
Run Nano-Capsulator-style compression on one long-prompt workflow and measure API cost and latency before/after.
Train a small compressor (Vicuna-7B + LoRA) on 1–2k examples and reuse the generated NL prompts on your target API LLM to test transferability.
Tune the Capsule Prompt length constraint for your target LLM to balance speed and accuracy.
Optimization Features
Token Efficiency
- Natural-language prompt compression
- Truncation-based length enforcement
Infra Optimization
- Lower API token usage
System Optimization
- Reduced memory footprint enables larger batches
Training Optimization
- LoRA
Inference Optimization
- Prompt Compression
- Token Budgeting
- Batch-size enablement
Reproducibility
Data Urls
- CommonsenseQA
- GSM8K
- MultiRC
- TriviaQA-Long
- BoolQ
Data Available
Open Source Status
- no
Risks & Boundaries
Limitations
- Requires training the compressor on similar-domain prompts for best results; transfer is not guaranteed for very different domains.
- Length constraint choice matters and must be tuned per target LLM; too-short prompts can lose critical logic.
- Reward function depends on frozen LLM scoring and may bias compression toward that model's behavior.
- No public code or deployment recipes are provided in the text.
When Not To Use
- When prompts are already short and token cost is not a concern.
- When exact token-level wording must be preserved (e.g., legal text).
- When you cannot run the reward/evaluation step on a representative frozen LLM.
Failure Modes
- Over-compression that drops essential details and reduces accuracy.
- Reward mis-specification that optimizes for surrogate scoring but harms real-task performance.
- Poor out-of-domain transfer when dataset or task differs from training data.
Core Entities
Models
- Vicuna-7B
- Vicuna-13B
- PaLM
- Claude2
- OPT-2.7B
- Llama-2-7B
- GPT-3.5-Turbo
Metrics
- Accuracy
- compression rate
- inference latency
- API cost saved
- token length
Datasets
- CommonsenseQA (CSQA)
- GSM8K
- MultiRC
- TriviaQA-Long
- BoolQ

