Compress long prompts into short natural-language 'Capsule Prompts' that cut cost and latency while keeping accuracy

February 28, 20248 min

Overview

Decision SnapshotNeeds Validation

The method shows consistent cost and latency benefits and reasonable transfer to other LLMs on evaluated datasets, but results are limited to the datasets and LLMs tested and depend on reward design and length tuning.

Citations1

Evidence Strength0.65

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 50%

Authors

Yu-Neng Chuang, Tianwei Xing, Chia-Yuan Chang, Zirui Liu, Xun Chen, Xia Hu

Links

Abstract / PDF / Data

Why It Matters For Business

Compress prompts into short natural-language pieces to cut API token costs and inference time without retraining target LLMs; this reduces cloud bills and speeds up batch processing.

Who Should Care

Summary TLDR

This paper introduces Nano-Capsulator, a method that trains a generator to turn long prompts into short, natural-language 'Capsule Prompts' using a semantics-preserving loss plus a reward that enforces downstream utility and length constraints. On benchmarks (CSQA, GSM8K, MultiRC, TriviaQA-Long) Capsule Prompts reduced input length by ~81%, cut API cost by up to 80.1%, and sped up inference up to 4.5× while largely retaining accuracy. The compressor is trained once (Vicuna-7B + LoRA) and the resulting NL prompts transfer to other LLMs (Vicuna-13B, PaLM, Claude2) and to unseen but similar datasets.

Problem Statement

Soft prompts compress context but fail to transfer across different LLMs (especially API models). Compressing into natural language is desirable for transferability but is hard because (1) text is discrete so you cannot backpropagate easily, and (2) generated NL summaries need reliable length control without losing task utility.

Main Contribution

A training framework (Nano-Capsulator) that produces NL-formatted Capsule Prompts by optimizing a semantics-preserving loss together with a reward that enforces downstream utility and strict length constraints.

Demonstration that Capsule Prompts compress inputs by up to 81.4%, cut API cost up to 80.1%, and reduce inference latency up to 4.5× while generally retaining accuracy on several LLMs and datasets.

Key Findings

Capsule Prompts reduce input length by about 81.4% on evaluated tasks.

Numbers81.4% compression rate (reported in Table 1 and text)

Practical UseShorter NL prompts let you fit more cases in memory, enable bigger batches, and reduce token-related costs when sending inputs to LLMs.

Evidence RefSection 4.3; Table 1

API costs for input tokens dropped up to 80.1% when using Capsule Prompts.

NumbersUp to 80.1% cost saved on Claude2/PaLM (Table 2, Appendix B)

Practical UseIf you pay per-token or per-request on API LLMs, compressing prompts into NL Capsule Prompts can produce large direct cost savings.

Evidence RefTable 2; Appendix B

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Compression rate81.4% (input length reduced)original prompts-81.4%reported across CSQA, GSM8K, MultiRC, TriviaQA-LongTable 1; Section 4.3Table 1
API cost savedup to 80.1% savedoriginal prompts on Claude2/PaLM-80.1%CSQA, GSM8K, MultiRC, TriviaQA-LongTable 2; Appendix BTable 2

What To Try In 7 Days

Run Nano-Capsulator-style compression on one long-prompt workflow and measure API cost and latency before/after.

Train a small compressor (Vicuna-7B + LoRA) on 1–2k examples and reuse the generated NL prompts on your target API LLM to test transferability.

Tune the Capsule Prompt length constraint for your target LLM to balance speed and accuracy.

Optimization Features

Token Efficiency
Natural-language prompt compressionTruncation-based length enforcement
Infra Optimization
Lower API token usage
System Optimization
Reduced memory footprint enables larger batches
Training Optimization
LoRA
Inference Optimization
Prompt CompressionToken BudgetingBatch-size enablement

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusNo
LicenseUnknown

Data URLs

CommonsenseQAGSM8KMultiRCTriviaQA-LongBoolQ

Risks & Boundaries

Limitations

Requires training the compressor on similar-domain prompts for best results; transfer is not guaranteed for very different domains.

Length constraint choice matters and must be tuned per target LLM; too-short prompts can lose critical logic.

When Not To Use

When prompts are already short and token cost is not a concern.

When exact token-level wording must be preserved (e.g., legal text).

Failure Modes

Over-compression that drops essential details and reduces accuracy.

Reward mis-specification that optimizes for surrogate scoring but harms real-task performance.

Core Entities

Models

Vicuna-7BVicuna-13BPaLMClaude2OPT-2.7BLlama-2-7BGPT-3.5-Turbo

Metrics

Accuracycompression rateinference latencyAPI cost savedtoken length

Datasets

CommonsenseQA (CSQA)GSM8KMultiRCTriviaQA-LongBoolQ