Overview
Results cover multiple models and benchmarks and include memory/throughput measurements, but some claims are reported as aggregate gains and evaluation focuses on specific datasets and GPU types.
Citations0
Evidence Strength0.70
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
LoCoCo lets you handle much longer documents without buying more GPU memory or changing the model core. That reduces infrastructure cost for long‑context applications and speeds up inference prefill.
Who Should Care
Summary TLDR
LoCoCo adds small 1‑D convolutional 'compressor' heads to a pre-trained LLM to merge past key-value (KV) pairs into a fixed-size cache. This keeps peak memory constant (O(MB+M)), cuts KV cache growth, and supports both inference and post-training extension of context windows (e.g., 4K→32K) with modest tuning. Experiments on Llama-2 and ChatGLM show better accuracy on long-context tasks than recent token‑eviction baselines, higher throughput, and workable memory on commodity GPUs. Code is provided.
Problem Statement
Transformer KV caches grow linearly with context length, quickly exhausting GPU memory. Existing token-dropping or local-attention fixes either lose global context or need architecture changes. The field needs a drop-in, data-driven compressor that keeps a constant-size KV cache while preserving useful long-range information.
Main Contribution
Propose LoCoCo: learnable 1‑D convolutional heads that merge new tokens and cached KV pairs into a fixed-size cache (M slots).
Provide a drop-in design that preserves pre-trained weights and requires only small extra heads and light calibration or fine-tuning.
Key Findings
LoCoCo can compress very long prefill contexts into a tiny KV cache during inference.
Post-training tuning with LoCoCo extends Llama-2 context from 4K to 32K while using a fixed 512-slot cache.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| inference compression | 3,482 tokens → 128 KV slots | uncompressed full KV | accuracy +0.2791 vs baseline token‑eviction | reported across downstream tasks (abstract claim) | Abstract; Intro | Abstract |
| perplexity (language modeling) | 3.4408 | full sequence 3.4012 | gap 0.0396 | Proof-Pile-2, Llama-2-7B @ eval context 32768 | Table 2 (row: 7B, 32768) | Table 2 |
What To Try In 7 Days
Drop a LoCoCo convolutional head into an existing Llama-2 model and run the provided repo example on a 2k–16k prefill to measure throughput gains.
Fine-tune only the convolutional heads on a small calibration split (≈104M tokens) and compare perplexity vs your current token‑eviction approach.
Replace or augment an existing token‑eviction policy (e.g., heavy hitters) with LoCoCo fusion and measure downstream task accuracy on a long-document benchmark.
Agent Features
Memory
Architectures
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Performance can degrade under extreme compression or very small convolution kernels (ablation shows small kernels hurt).
LoCoCo needs calibration or light fine-tuning (authors used 200 steps for post-hoc compression and LoRA for extension).
When Not To Use
When you can afford full uncompressed KV memory and desire exact full-sequence attention.
When the application requires strict token-level fidelity (exact retrieval of single tokens).
Failure Modes
Loss of rare, token-specific signals when multiple tokens are merged into one slot.
Optimization instability with very long convolution kernels or extremely small kernels.

