Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
LoCoCo lets you handle much longer documents without buying more GPU memory or changing the model core. That reduces infrastructure cost for long‑context applications and speeds up inference prefill.
Summary TLDR
LoCoCo adds small 1‑D convolutional 'compressor' heads to a pre-trained LLM to merge past key-value (KV) pairs into a fixed-size cache. This keeps peak memory constant (O(MB+M)), cuts KV cache growth, and supports both inference and post-training extension of context windows (e.g., 4K→32K) with modest tuning. Experiments on Llama-2 and ChatGLM show better accuracy on long-context tasks than recent token‑eviction baselines, higher throughput, and workable memory on commodity GPUs. Code is provided.
Problem Statement
Transformer KV caches grow linearly with context length, quickly exhausting GPU memory. Existing token-dropping or local-attention fixes either lose global context or need architecture changes. The field needs a drop-in, data-driven compressor that keeps a constant-size KV cache while preserving useful long-range information.
Main Contribution
Propose LoCoCo: learnable 1‑D convolutional heads that merge new tokens and cached KV pairs into a fixed-size cache (M slots).
Provide a drop-in design that preserves pre-trained weights and requires only small extra heads and light calibration or fine-tuning.
Show LoCoCo works for both inference (post-hoc compression) and post-training context extension (4K→32K) with modest memory and good throughput.
Key Findings
LoCoCo can compress very long prefill contexts into a tiny KV cache during inference.
Post-training tuning with LoCoCo extends Llama-2 context from 4K to 32K while using a fixed 512-slot cache.
LoCoCo keeps peak GPU memory low and improves inference throughput compared to baselines.
LoCoCo yields consistent quality gains on standard long-context benchmarks.
Results
inference compression
perplexity (language modeling)
throughput (pre-fill stage)
memory usage (training / tuning)
Who Should Care
What To Try In 7 Days
Drop a LoCoCo convolutional head into an existing Llama-2 model and run the provided repo example on a 2k–16k prefill to measure throughput gains.
Fine-tune only the convolutional heads on a small calibration split (≈104M tokens) and compare perplexity vs your current token‑eviction approach.
Replace or augment an existing token‑eviction policy (e.g., heavy hitters) with LoCoCo fusion and measure downstream task accuracy on a long-document benchmark.
Agent Features
Memory
- fixed-size KV cache
Architectures
- Long-context Transformer
Optimization Features
Token Efficiency
- reported compression ratio up to 32:1 (e.g., 4096→128)
Infra Optimization
- improved pre-fill throughput vs full KV; similar memory to other eviction baselines
System Optimization
- peak memory O(MB+M) instead of O(LB) for segment attention
Training Optimization
- LoRA
Inference Optimization
- fixed-size KV cache via learned convolutional fusion
- drop-in heads that keep pre-trained weights unchanged
Reproducibility
Code Urls
Data Urls
- RedPajama
- Proof-Pile-2
- RACE
- TriviaQA
- HellaSwag
- WinoGrande
- ARC
- SCROLLS
- LongBench
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance can degrade under extreme compression or very small convolution kernels (ablation shows small kernels hurt).
- LoCoCo needs calibration or light fine-tuning (authors used 200 steps for post-hoc compression and LoRA for extension).
- Compression is learned and may merge fine-grained token signals needed for exact token‑level retrieval.
When Not To Use
- When you can afford full uncompressed KV memory and desire exact full-sequence attention.
- When the application requires strict token-level fidelity (exact retrieval of single tokens).
- When you cannot allocate even the small extra compute for convolutional fusion (though authors report overhead is small).
Failure Modes
- Loss of rare, token-specific signals when multiple tokens are merged into one slot.
- Optimization instability with very long convolution kernels or extremely small kernels.
- Residual gap versus full sequence at extreme context lengths or very high compression ratios.
Core Entities
Models
- Llama-2-7B
- Llama-2-13B
- ChatGLM3-6B-32k
Metrics
- perplexity
- throughput (tokens/s)
- GPU memory (GB)
- Accuracy
Datasets
- RedPajama
- Proof-Pile-2
- RACE
- TriviaQA
- HellaSwag
- WinoGrande
- ARC
- SCROLLS
- LongBench
Benchmarks
- SCROLLS
- LongBench

