Use the frozen LLM itself to compress over-limit prompts into 1/12 memory tokens

Overview

Decision SnapshotReady For Pilot

The method shows consistent ROUGE gains on multiple datasets and keeps the large model frozen, which lowers training cost. Evidence is from in-domain and out-domain comparisons, but results come from experiments on 7B backbones and public datasets.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Jun Gao, Ziqiang Cao, Wenjie Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SelfCP lets you fit much longer context into an existing LLM by training a tiny adapter instead of re-training the whole model, cutting memory and inference costs while often improving output quality on summarization and QA.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO

Summary TLDR

SelfCP uses a frozen target LLM to compress long prompts into compact 'memory tokens' via a small trainable connector and a learnable memory tag. The method trains only ~17M parameters (connector + special embedding), keeps the LLM frozen, and substitutes up to 12× of over-length prompt tokens with dense tokens. Across English and Chinese benchmarks (summarization, QA, legal verdict generation), SelfCP raises ROUGE scores vs. naive truncated inputs and matches or beats other prompt-compression baselines, while reducing GPU memory needs and enabling larger few-shot contexts via caching.

Problem Statement

Transformer LLMs choke when prompts exceed their context window. Long inputs (summaries, many demonstrations) either must be truncated (losing information) or require costly model changes. We need a cheap, general way to compress over-limit prompts so LLMs can read more context without retraining or heavy compute.

Main Contribution

Introduce SelfCP: use the frozen target LLM as both compressor and generator, and train only a small connector and a memory-tag embedding.

Compress over-limit prompts into dense memory tokens that substitute up to 12× of original tokens, preserving or improving generation quality.

Key Findings

SelfCP compresses long prompts to 1/12 of original tokens and uses those memory tokens in generation.

Numbers12× compression ratio used by default

Practical UseYou can replace a 12× longer prompt with compact memory tokens to fit within the model window and keep more context without retraining the LLM.

Evidence Refabstract; section 5.1

Small additional training cost: only 17M new parameters while keeping a 7B backbone frozen.

Numbers17M trainable params (~0.24% of 7B)

Practical UseTrain a tiny adapter instead of fine-tuning the whole model to get prompt-compression gains with low GPU/time cost.

Evidence Refsection 4.6; efficiency analysis

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE-1	30.5	Vicuna 19.9	+10.6	XSUM (in-domain) Vicuna-7b	SelfCP R-1 30.5 vs Vicuna 19.9	Table 2
ROUGE-1	33.3	Vicuna 17.3	+16.0	CICERO (in-domain) Vicuna-7b	SelfCP R-1 33.3 vs Vicuna 17.3	Table 2

What To Try In 7 Days

Prototype SelfCP: freeze your 7B model, add a linear connector + memory-tag embedding (~17M params), and train on mixed long-text samples.

Use the 12× compression default. Measure ROUGE on a held-out summarization or QA set to compare against truncation.

Add caching (MDB) for repeated demonstrations to speed up few-shot pipelines.

Optimization Features

Token Efficiency

Achieves up to 12× token compression while preserving key content

Infra Optimization

Lower GPU memory pressure enables more few-shot demonstrations on limited GPUs

Model Optimization

Keep backbone frozen; train small adapter (17M params)

System Optimization

Parallelize compression and projection steps; caching reduces per-query cost

Training Optimization

Train only connector and memory-tag embedding under LM objective

Inference Optimization

Replace 12× tokens with memory tokens to reduce generator input lengthCache compressed demonstrations (MDB) to avoid repeated compression

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/jungao1106/SelfCP

Data URLs

XSUM, CICERO, DUC 2007, ARXIV, CoLA (public datasets referenced in paper)

Risks & Boundaries

Limitations

Compression beyond 12× degrades quality; 16× shows significant drops in experiments.

Method was evaluated on 7B-class backbones; behavior on much larger models is untested.

When Not To Use

If you need lossless, token-level original text (legal verbatim text) where any compression-induced change is unacceptable.

When you plan to fully fine-tune the backbone anyway, since full fine-tuning may address long-context needs differently.

Failure Modes

Over-compression (≫12×) strips critical facts and hurts generation.

Connector misalignment: if connector is poorly trained, memory tokens can be unreadable to the frozen generator.

Core Entities

Models

Vicuna-7bBlueLM-7bLlama2-7b (comparison baselines)

Metrics

ROUGE-1ROUGE-2ROUGE-LAccuracyGPU hours / TFLOPs / TMACsThroughput (iter/s)Memory (GB)

Datasets

XSUMCICERODUC 2007CLCV (Chinese verdict dataset)ARXIVCoLASUPER-NI / instruction mix (training pool)

Benchmarks

ROUGE-1ROUGE-2ROUGE-LAccuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SelfCP compresses long prompts to 1/12 of original tokens and uses those memory tokens in generation.

Small additional training cost: only 17M new parameters while keeping a 7B backbone frozen.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Do multi-step math without long traces: refine compact latent anchors and stop when stable

Key finding

Question-aware prompt compression that speeds up LLMs and often improves accuracy on very long contexts

Key finding

Compress prompts by sampling attention-important tokens and sentences with a small RL policy

Key finding

Compress prompts by turning text into relation graphs, keeping readability and model utility

Key finding

Compress MT evaluation prompts to cut tokens ~2.37× while keeping evaluation quality

Key finding