Compress long prompts into short natural-language 'Capsule Prompts' that cut cost and latency while keeping accuracy

February 28, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

1

Authors

Yu-Neng Chuang, Tianwei Xing, Chia-Yuan Chang, Zirui Liu, Xun Chen, Xia Hu

Links

Abstract / PDF

Why It Matters For Business

Compress prompts into short natural-language pieces to cut API token costs and inference time without retraining target LLMs; this reduces cloud bills and speeds up batch processing.

Summary TLDR

This paper introduces Nano-Capsulator, a method that trains a generator to turn long prompts into short, natural-language 'Capsule Prompts' using a semantics-preserving loss plus a reward that enforces downstream utility and length constraints. On benchmarks (CSQA, GSM8K, MultiRC, TriviaQA-Long) Capsule Prompts reduced input length by ~81%, cut API cost by up to 80.1%, and sped up inference up to 4.5× while largely retaining accuracy. The compressor is trained once (Vicuna-7B + LoRA) and the resulting NL prompts transfer to other LLMs (Vicuna-13B, PaLM, Claude2) and to unseen but similar datasets.

Problem Statement

Soft prompts compress context but fail to transfer across different LLMs (especially API models). Compressing into natural language is desirable for transferability but is hard because (1) text is discrete so you cannot backpropagate easily, and (2) generated NL summaries need reliable length control without losing task utility.

Main Contribution

A training framework (Nano-Capsulator) that produces NL-formatted Capsule Prompts by optimizing a semantics-preserving loss together with a reward that enforces downstream utility and strict length constraints.

Demonstration that Capsule Prompts compress inputs by up to 81.4%, cut API cost up to 80.1%, and reduce inference latency up to 4.5× while generally retaining accuracy on several LLMs and datasets.

Evidence that the generated NL prompts transfer across LLMs and to unseen datasets with similar tasks, and that the approach works for both few-shot chain-of-thought examples and long reading passages.

Key Findings

Capsule Prompts reduce input length by about 81.4% on evaluated tasks.

Numbers81.4% compression rate (reported in Table 1 and text)

API costs for input tokens dropped up to 80.1% when using Capsule Prompts.

NumbersUp to 80.1% cost saved on Claude2/PaLM (Table 2, Appendix B)

Inference latency decreased up to 4.5× with compressed prompts on evaluated models and batch sizes.

NumbersLatency reduced by 2.1×–4.5× (Figures 8–9; Section 4.6)

Capsule Prompts generally preserve task accuracy and transfer across LLMs and similar unseen datasets.

Training cost is modest: about 8 hours for few-shot CoT and 4 hours for reading-comprehension compression on two A40 GPUs using LoRA.

Numbers≈8h (CoT), ≈4h (reading); 2× NVIDIA A40 GPUs with LoRA (Appendix D)

Results

Compression rate

Value81.4% (input length reduced)

Baselineoriginal prompts

API cost saved

Valueup to 80.1% saved

Baselineoriginal prompts on Claude2/PaLM

Inference latency

Value2.1×–4.5× speedup

Baselineoriginal prompts

Accuracy

Valuemostly similar to original prompts

Baselineoriginal prompts

Training time

Value≈8h (CoT), ≈4h (reading)

Baselinen/a

Who Should Care

What To Try In 7 Days

Run Nano-Capsulator-style compression on one long-prompt workflow and measure API cost and latency before/after.

Train a small compressor (Vicuna-7B + LoRA) on 1–2k examples and reuse the generated NL prompts on your target API LLM to test transferability.

Tune the Capsule Prompt length constraint for your target LLM to balance speed and accuracy.

Optimization Features

Token Efficiency

  • Natural-language prompt compression
  • Truncation-based length enforcement

Infra Optimization

  • Lower API token usage

System Optimization

  • Reduced memory footprint enables larger batches

Training Optimization

  • LoRA

Inference Optimization

  • Prompt Compression
  • Token Budgeting
  • Batch-size enablement

Reproducibility

Data Urls

  • CommonsenseQA
  • GSM8K
  • MultiRC
  • TriviaQA-Long
  • BoolQ

Data Available

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Requires training the compressor on similar-domain prompts for best results; transfer is not guaranteed for very different domains.
  • Length constraint choice matters and must be tuned per target LLM; too-short prompts can lose critical logic.
  • Reward function depends on frozen LLM scoring and may bias compression toward that model's behavior.
  • No public code or deployment recipes are provided in the text.

When Not To Use

  • When prompts are already short and token cost is not a concern.
  • When exact token-level wording must be preserved (e.g., legal text).
  • When you cannot run the reward/evaluation step on a representative frozen LLM.

Failure Modes

  • Over-compression that drops essential details and reduces accuracy.
  • Reward mis-specification that optimizes for surrogate scoring but harms real-task performance.
  • Poor out-of-domain transfer when dataset or task differs from training data.

Core Entities

Models

  • Vicuna-7B
  • Vicuna-13B
  • PaLM
  • Claude2
  • OPT-2.7B
  • Llama-2-7B
  • GPT-3.5-Turbo

Metrics

  • Accuracy
  • compression rate
  • inference latency
  • API cost saved
  • token length

Datasets

  • CommonsenseQA (CSQA)
  • GSM8K
  • MultiRC
  • TriviaQA-Long
  • BoolQ