Use tiny fixed KV caches and learned 1‑D convolutions to compress thousands of tokens with low memory and near-full performance

June 8, 20247 min

Overview

Decision SnapshotNeeds Validation

Results cover multiple models and benchmarks and include memory/throughput measurements, but some claims are reported as aggregate gains and evaluation focuses on specific datasets and GPU types.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LoCoCo lets you handle much longer documents without buying more GPU memory or changing the model core. That reduces infrastructure cost for long‑context applications and speeds up inference prefill.

Who Should Care

Summary TLDR

LoCoCo adds small 1‑D convolutional 'compressor' heads to a pre-trained LLM to merge past key-value (KV) pairs into a fixed-size cache. This keeps peak memory constant (O(MB+M)), cuts KV cache growth, and supports both inference and post-training extension of context windows (e.g., 4K→32K) with modest tuning. Experiments on Llama-2 and ChatGLM show better accuracy on long-context tasks than recent token‑eviction baselines, higher throughput, and workable memory on commodity GPUs. Code is provided.

Problem Statement

Transformer KV caches grow linearly with context length, quickly exhausting GPU memory. Existing token-dropping or local-attention fixes either lose global context or need architecture changes. The field needs a drop-in, data-driven compressor that keeps a constant-size KV cache while preserving useful long-range information.

Main Contribution

Propose LoCoCo: learnable 1‑D convolutional heads that merge new tokens and cached KV pairs into a fixed-size cache (M slots).

Provide a drop-in design that preserves pre-trained weights and requires only small extra heads and light calibration or fine-tuning.

Key Findings

LoCoCo can compress very long prefill contexts into a tiny KV cache during inference.

Numberscompressed 3,482 tokens into a 128-size KV cache; accuracy gain vs baseline 0.2791 (reported)

Practical UseYou can run long-context generation with a small fixed KV cache (128 slots) and preserve model accuracy versus token‑eviction baselines.

Evidence RefAbstract; Intro

Post-training tuning with LoCoCo extends Llama-2 context from 4K to 32K while using a fixed 512-slot cache.

Numbersextended 4K32K with memory 512; perplexity gap vs full-sequence ≤0.04 on evaluated setups

Practical UseYou can train/finetune on much longer sequences without linearly increasing GPU memory by keeping a small KV cache and adding LoCoCo heads.

Evidence RefAbstract; Table 2 (7B 32768: ours 3.4408 vs full 3.4012)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
inference compression3,482 tokens → 128 KV slotsuncompressed full KVaccuracy +0.2791 vs baseline token‑evictionreported across downstream tasks (abstract claim)Abstract; IntroAbstract
perplexity (language modeling)3.4408full sequence 3.4012gap 0.0396Proof-Pile-2, Llama-2-7B @ eval context 32768Table 2 (row: 7B, 32768)Table 2

What To Try In 7 Days

Drop a LoCoCo convolutional head into an existing Llama-2 model and run the provided repo example on a 2k–16k prefill to measure throughput gains.

Fine-tune only the convolutional heads on a small calibration split (≈104M tokens) and compare perplexity vs your current token‑eviction approach.

Replace or augment an existing token‑eviction policy (e.g., heavy hitters) with LoCoCo fusion and measure downstream task accuracy on a long-document benchmark.

Agent Features

Memory
fixed-size KV cache
Architectures
Long-context Transformer

Optimization Features

Token Efficiency
reported compression ratio up to 32:1 (e.g., 4096→128)
Infra Optimization
improved pre-fill throughput vs full KV; similar memory to other eviction baselines
System Optimization
peak memory O(MB+M) instead of O(LB) for segment attention
Training Optimization
LoRA
Inference Optimization
fixed-size KV cache via learned convolutional fusiondrop-in heads that keep pre-trained weights unchanged

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

RedPajamaProof-Pile-2RACETriviaQAHellaSwagWinoGrandeARCSCROLLSLongBench

Risks & Boundaries

Limitations

Performance can degrade under extreme compression or very small convolution kernels (ablation shows small kernels hurt).

LoCoCo needs calibration or light fine-tuning (authors used 200 steps for post-hoc compression and LoRA for extension).

When Not To Use

When you can afford full uncompressed KV memory and desire exact full-sequence attention.

When the application requires strict token-level fidelity (exact retrieval of single tokens).

Failure Modes

Loss of rare, token-specific signals when multiple tokens are merged into one slot.

Optimization instability with very long convolution kernels or extremely small kernels.

Core Entities

Models

Llama-2-7BLlama-2-13BChatGLM3-6B-32k

Metrics

perplexitythroughput (tokens/s)GPU memory (GB)Accuracy

Datasets

RedPajamaProof-Pile-2RACETriviaQAHellaSwagWinoGrandeARCSCROLLSLongBench

Benchmarks

SCROLLSLongBench