Use tiny fixed KV caches and learned 1‑D convolutions to compress thousands of tokens with low memory and near-full performance

Overview

Decision SnapshotNeeds Validation

Results cover multiple models and benchmarks and include memory/throughput measurements, but some claims are reported as aggregate gains and evaluation focuses on specific datasets and GPU types.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LoCoCo lets you handle much longer documents without buying more GPU memory or changing the model core. That reduces infrastructure cost for long‑context applications and speeds up inference prefill.

Who Should Care

ML Engineer Engineering Lead Product Manager Founder

Summary TLDR

LoCoCo adds small 1‑D convolutional 'compressor' heads to a pre-trained LLM to merge past key-value (KV) pairs into a fixed-size cache. This keeps peak memory constant (O(MB+M)), cuts KV cache growth, and supports both inference and post-training extension of context windows (e.g., 4K→32K) with modest tuning. Experiments on Llama-2 and ChatGLM show better accuracy on long-context tasks than recent token‑eviction baselines, higher throughput, and workable memory on commodity GPUs. Code is provided.

Problem Statement

Transformer KV caches grow linearly with context length, quickly exhausting GPU memory. Existing token-dropping or local-attention fixes either lose global context or need architecture changes. The field needs a drop-in, data-driven compressor that keeps a constant-size KV cache while preserving useful long-range information.

Main Contribution

Propose LoCoCo: learnable 1‑D convolutional heads that merge new tokens and cached KV pairs into a fixed-size cache (M slots).

Provide a drop-in design that preserves pre-trained weights and requires only small extra heads and light calibration or fine-tuning.

Key Findings

LoCoCo can compress very long prefill contexts into a tiny KV cache during inference.

Numberscompressed 3,482 tokens into a 128-size KV cache; accuracy gain vs baseline 0.2791 (reported)

Practical UseYou can run long-context generation with a small fixed KV cache (128 slots) and preserve model accuracy versus token‑eviction baselines.

Evidence RefAbstract; Intro

Post-training tuning with LoCoCo extends Llama-2 context from 4K to 32K while using a fixed 512-slot cache.

Numbersextended 4K→32K with memory 512; perplexity gap vs full-sequence ≤0.04 on evaluated setups

Practical UseYou can train/finetune on much longer sequences without linearly increasing GPU memory by keeping a small KV cache and adding LoCoCo heads.

Evidence RefAbstract; Table 2 (7B 32768: ours 3.4408 vs full 3.4012)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
inference compression	3,482 tokens → 128 KV slots	uncompressed full KV	accuracy +0.2791 vs baseline token‑eviction	reported across downstream tasks (abstract claim)	Abstract; Intro	Abstract
perplexity (language modeling)	3.4408	full sequence 3.4012	gap 0.0396	Proof-Pile-2, Llama-2-7B @ eval context 32768	Table 2 (row: 7B, 32768)	Table 2

What To Try In 7 Days

Drop a LoCoCo convolutional head into an existing Llama-2 model and run the provided repo example on a 2k–16k prefill to measure throughput gains.

Fine-tune only the convolutional heads on a small calibration split (≈104M tokens) and compare perplexity vs your current token‑eviction approach.

Replace or augment an existing token‑eviction policy (e.g., heavy hitters) with LoCoCo fusion and measure downstream task accuracy on a long-document benchmark.

Agent Features

Memory

fixed-size KV cache

Architectures

Long-context Transformer

Optimization Features

Token Efficiency

reported compression ratio up to 32:1 (e.g., 4096→128)

Infra Optimization

improved pre-fill throughput vs full KV; similar memory to other eviction baselines

System Optimization

peak memory O(MB+M) instead of O(LB) for segment attention

Training Optimization

LoRA

Inference Optimization

fixed-size KV cache via learned convolutional fusiondrop-in heads that keep pre-trained weights unchanged

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/VITA-Group/LoCoCo

Data URLs

RedPajamaProof-Pile-2RACETriviaQAHellaSwagWinoGrandeARCSCROLLSLongBench

Risks & Boundaries

Limitations

Performance can degrade under extreme compression or very small convolution kernels (ablation shows small kernels hurt).

LoCoCo needs calibration or light fine-tuning (authors used 200 steps for post-hoc compression and LoRA for extension).

When Not To Use

When you can afford full uncompressed KV memory and desire exact full-sequence attention.

When the application requires strict token-level fidelity (exact retrieval of single tokens).

Failure Modes

Loss of rare, token-specific signals when multiple tokens are merged into one slot.

Optimization instability with very long convolution kernels or extremely small kernels.

Core Entities

Models

Llama-2-7BLlama-2-13BChatGLM3-6B-32k

Metrics

perplexitythroughput (tokens/s)GPU memory (GB)Accuracy

Datasets

RedPajamaProof-Pile-2RACETriviaQAHellaSwagWinoGrandeARCSCROLLSLongBench

Benchmarks

SCROLLSLongBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LoCoCo can compress very long prefill contexts into a tiny KV cache during inference.

Post-training tuning with LoCoCo extends Llama-2 context from 4K to 32K while using a fixed 512-slot cache.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Focus: agent-controlled context compression that cuts token use 22.7% without losing accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Prompt LLMs to propose hyperparameters and training code; they match or beat standard HPO early in search.

Key finding

MiniCache: merge adjacent layers' KV caches to cut memory and speed up LLM inference

Key finding