Use the frozen LLM itself to compress over-limit prompts into 1/12 memory tokens

May 27, 20247 min

Overview

Decision SnapshotReady For Pilot

The method shows consistent ROUGE gains on multiple datasets and keeps the large model frozen, which lowers training cost. Evidence is from in-domain and out-domain comparisons, but results come from experiments on 7B backbones and public datasets.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Jun Gao, Ziqiang Cao, Wenjie Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SelfCP lets you fit much longer context into an existing LLM by training a tiny adapter instead of re-training the whole model, cutting memory and inference costs while often improving output quality on summarization and QA.

Who Should Care

Summary TLDR

SelfCP uses a frozen target LLM to compress long prompts into compact 'memory tokens' via a small trainable connector and a learnable memory tag. The method trains only ~17M parameters (connector + special embedding), keeps the LLM frozen, and substitutes up to 12× of over-length prompt tokens with dense tokens. Across English and Chinese benchmarks (summarization, QA, legal verdict generation), SelfCP raises ROUGE scores vs. naive truncated inputs and matches or beats other prompt-compression baselines, while reducing GPU memory needs and enabling larger few-shot contexts via caching.

Problem Statement

Transformer LLMs choke when prompts exceed their context window. Long inputs (summaries, many demonstrations) either must be truncated (losing information) or require costly model changes. We need a cheap, general way to compress over-limit prompts so LLMs can read more context without retraining or heavy compute.

Main Contribution

Introduce SelfCP: use the frozen target LLM as both compressor and generator, and train only a small connector and a memory-tag embedding.

Compress over-limit prompts into dense memory tokens that substitute up to 12× of original tokens, preserving or improving generation quality.

Key Findings

SelfCP compresses long prompts to 1/12 of original tokens and uses those memory tokens in generation.

Numbers12× compression ratio used by default

Practical UseYou can replace a 12× longer prompt with compact memory tokens to fit within the model window and keep more context without retraining the LLM.

Evidence Refabstract; section 5.1

Small additional training cost: only 17M new parameters while keeping a 7B backbone frozen.

Numbers17M trainable params (~0.24% of 7B)

Practical UseTrain a tiny adapter instead of fine-tuning the whole model to get prompt-compression gains with low GPU/time cost.

Evidence Refsection 4.6; efficiency analysis

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE-130.5Vicuna 19.9+10.6XSUM (in-domain) Vicuna-7bSelfCP R-1 30.5 vs Vicuna 19.9Table 2
ROUGE-133.3Vicuna 17.3+16.0CICERO (in-domain) Vicuna-7bSelfCP R-1 33.3 vs Vicuna 17.3Table 2

What To Try In 7 Days

Prototype SelfCP: freeze your 7B model, add a linear connector + memory-tag embedding (~17M params), and train on mixed long-text samples.

Use the 12× compression default. Measure ROUGE on a held-out summarization or QA set to compare against truncation.

Add caching (MDB) for repeated demonstrations to speed up few-shot pipelines.

Optimization Features

Token Efficiency
Achieves up to 12× token compression while preserving key content
Infra Optimization
Lower GPU memory pressure enables more few-shot demonstrations on limited GPUs
Model Optimization
Keep backbone frozen; train small adapter (17M params)
System Optimization
Parallelize compression and projection steps; caching reduces per-query cost
Training Optimization
Train only connector and memory-tag embedding under LM objective
Inference Optimization
Replace 12× tokens with memory tokens to reduce generator input lengthCache compressed demonstrations (MDB) to avoid repeated compression

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

XSUM, CICERO, DUC 2007, ARXIV, CoLA (public datasets referenced in paper)

Risks & Boundaries

Limitations

Compression beyond 12× degrades quality; 16× shows significant drops in experiments.

Method was evaluated on 7B-class backbones; behavior on much larger models is untested.

When Not To Use

If you need lossless, token-level original text (legal verbatim text) where any compression-induced change is unacceptable.

When you plan to fully fine-tune the backbone anyway, since full fine-tuning may address long-context needs differently.

Failure Modes

Over-compression (≫12×) strips critical facts and hurts generation.

Connector misalignment: if connector is poorly trained, memory tokens can be unreadable to the frozen generator.

Core Entities

Models

Vicuna-7bBlueLM-7bLlama2-7b (comparison baselines)

Metrics

ROUGE-1ROUGE-2ROUGE-LAccuracyGPU hours / TFLOPs / TMACsThroughput (iter/s)Memory (GB)

Datasets

XSUMCICERODUC 2007CLCV (Chinese verdict dataset)ARXIVCoLASUPER-NI / instruction mix (training pool)

Benchmarks

ROUGE-1ROUGE-2ROUGE-LAccuracy