Overview
The method shows consistent ROUGE gains on multiple datasets and keeps the large model frozen, which lowers training cost. Evidence is from in-domain and out-domain comparisons, but results come from experiments on 7B backbones and public datasets.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
SelfCP lets you fit much longer context into an existing LLM by training a tiny adapter instead of re-training the whole model, cutting memory and inference costs while often improving output quality on summarization and QA.
Who Should Care
Summary TLDR
SelfCP uses a frozen target LLM to compress long prompts into compact 'memory tokens' via a small trainable connector and a learnable memory tag. The method trains only ~17M parameters (connector + special embedding), keeps the LLM frozen, and substitutes up to 12× of over-length prompt tokens with dense tokens. Across English and Chinese benchmarks (summarization, QA, legal verdict generation), SelfCP raises ROUGE scores vs. naive truncated inputs and matches or beats other prompt-compression baselines, while reducing GPU memory needs and enabling larger few-shot contexts via caching.
Problem Statement
Transformer LLMs choke when prompts exceed their context window. Long inputs (summaries, many demonstrations) either must be truncated (losing information) or require costly model changes. We need a cheap, general way to compress over-limit prompts so LLMs can read more context without retraining or heavy compute.
Main Contribution
Introduce SelfCP: use the frozen target LLM as both compressor and generator, and train only a small connector and a memory-tag embedding.
Compress over-limit prompts into dense memory tokens that substitute up to 12× of original tokens, preserving or improving generation quality.
Key Findings
SelfCP compresses long prompts to 1/12 of original tokens and uses those memory tokens in generation.
Small additional training cost: only 17M new parameters while keeping a 7B backbone frozen.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ROUGE-1 | 30.5 | Vicuna 19.9 | +10.6 | XSUM (in-domain) Vicuna-7b | SelfCP R-1 30.5 vs Vicuna 19.9 | Table 2 |
| ROUGE-1 | 33.3 | Vicuna 17.3 | +16.0 | CICERO (in-domain) Vicuna-7b | SelfCP R-1 33.3 vs Vicuna 17.3 | Table 2 |
What To Try In 7 Days
Prototype SelfCP: freeze your 7B model, add a linear connector + memory-tag embedding (~17M params), and train on mixed long-text samples.
Use the 12× compression default. Measure ROUGE on a held-out summarization or QA set to compare against truncation.
Add caching (MDB) for repeated demonstrations to speed up few-shot pipelines.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Compression beyond 12× degrades quality; 16× shows significant drops in experiments.
Method was evaluated on 7B-class backbones; behavior on much larger models is untested.
When Not To Use
If you need lossless, token-level original text (legal verbatim text) where any compression-induced change is unacceptable.
When you plan to fully fine-tune the backbone anyway, since full fine-tuning may address long-context needs differently.
Failure Modes
Over-compression (≫12×) strips critical facts and hurts generation.
Connector misalignment: if connector is poorly trained, memory tokens can be unreadable to the frozen generator.

