Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
SelfCP lets you fit much longer context into an existing LLM by training a tiny adapter instead of re-training the whole model, cutting memory and inference costs while often improving output quality on summarization and QA.
Summary TLDR
SelfCP uses a frozen target LLM to compress long prompts into compact 'memory tokens' via a small trainable connector and a learnable memory tag. The method trains only ~17M parameters (connector + special embedding), keeps the LLM frozen, and substitutes up to 12× of over-length prompt tokens with dense tokens. Across English and Chinese benchmarks (summarization, QA, legal verdict generation), SelfCP raises ROUGE scores vs. naive truncated inputs and matches or beats other prompt-compression baselines, while reducing GPU memory needs and enabling larger few-shot contexts via caching.
Problem Statement
Transformer LLMs choke when prompts exceed their context window. Long inputs (summaries, many demonstrations) either must be truncated (losing information) or require costly model changes. We need a cheap, general way to compress over-limit prompts so LLMs can read more context without retraining or heavy compute.
Main Contribution
Introduce SelfCP: use the frozen target LLM as both compressor and generator, and train only a small connector and a memory-tag embedding.
Compress over-limit prompts into dense memory tokens that substitute up to 12× of original tokens, preserving or improving generation quality.
Propose three practical compression strategies (Former, Latter, Concatenated) and a Memory Demonstration Bank (MDB) for caching/retrieving compressed demonstrations.
Show cross-language and out-of-domain gains on multiple benchmarks while adding only ~17M trainable parameters.
Key Findings
SelfCP compresses long prompts to 1/12 of original tokens and uses those memory tokens in generation.
Small additional training cost: only 17M new parameters while keeping a 7B backbone frozen.
SelfCP increases ROUGE-1 by roughly +9–16 points versus naive truncated inputs on tested tasks.
Compression is robust up to ~12× but degrades at 16×.
Results
ROUGE-1
ROUGE-1
ROUGE-1
ROUGE-1
Trainable parameters added
Compression robustness
Who Should Care
What To Try In 7 Days
Prototype SelfCP: freeze your 7B model, add a linear connector + memory-tag embedding (~17M params), and train on mixed long-text samples.
Use the 12× compression default. Measure ROUGE on a held-out summarization or QA set to compare against truncation.
Add caching (MDB) for repeated demonstrations to speed up few-shot pipelines.
Optimization Features
Token Efficiency
- Achieves up to 12× token compression while preserving key content
Infra Optimization
- Lower GPU memory pressure enables more few-shot demonstrations on limited GPUs
Model Optimization
- Keep backbone frozen; train small adapter (17M params)
System Optimization
- Parallelize compression and projection steps; caching reduces per-query cost
Training Optimization
- Train only connector and memory-tag embedding under LM objective
Inference Optimization
- Replace 12× tokens with memory tokens to reduce generator input length
- Cache compressed demonstrations (MDB) to avoid repeated compression
Reproducibility
Code Urls
Data Urls
- XSUM, CICERO, DUC 2007, ARXIV, CoLA (public datasets referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Compression beyond 12× degrades quality; 16× shows significant drops in experiments.
- Method was evaluated on 7B-class backbones; behavior on much larger models is untested.
- Compression can lose fine-grained details if the compressed segments contain dense factual content.
When Not To Use
- If you need lossless, token-level original text (legal verbatim text) where any compression-induced change is unacceptable.
- When you plan to fully fine-tune the backbone anyway, since full fine-tuning may address long-context needs differently.
- If you require immediate support for extremely high compression ratios (>12×) without quality checks.
Failure Modes
- Over-compression (≫12×) strips critical facts and hurts generation.
- Connector misalignment: if connector is poorly trained, memory tokens can be unreadable to the frozen generator.
- Domain mismatch between training texts and deployed prompts can cause OoD compression errors.
Core Entities
Models
- Vicuna-7b
- BlueLM-7b
- Llama2-7b (comparison baselines)
Metrics
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Accuracy
- GPU hours / TFLOPs / TMACs
- Throughput (iter/s)
- Memory (GB)
Datasets
- XSUM
- CICERO
- DUC 2007
- CLCV (Chinese verdict dataset)
- ARXIV
- CoLA
- SUPER-NI / instruction mix (training pool)
Benchmarks
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Accuracy

