Use tiny fixed KV caches and learned 1‑D convolutions to compress thousands of tokens with low memory and near-full performance

June 8, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen

Links

Abstract / PDF

Why It Matters For Business

LoCoCo lets you handle much longer documents without buying more GPU memory or changing the model core. That reduces infrastructure cost for long‑context applications and speeds up inference prefill.

Summary TLDR

LoCoCo adds small 1‑D convolutional 'compressor' heads to a pre-trained LLM to merge past key-value (KV) pairs into a fixed-size cache. This keeps peak memory constant (O(MB+M)), cuts KV cache growth, and supports both inference and post-training extension of context windows (e.g., 4K→32K) with modest tuning. Experiments on Llama-2 and ChatGLM show better accuracy on long-context tasks than recent token‑eviction baselines, higher throughput, and workable memory on commodity GPUs. Code is provided.

Problem Statement

Transformer KV caches grow linearly with context length, quickly exhausting GPU memory. Existing token-dropping or local-attention fixes either lose global context or need architecture changes. The field needs a drop-in, data-driven compressor that keeps a constant-size KV cache while preserving useful long-range information.

Main Contribution

Propose LoCoCo: learnable 1‑D convolutional heads that merge new tokens and cached KV pairs into a fixed-size cache (M slots).

Provide a drop-in design that preserves pre-trained weights and requires only small extra heads and light calibration or fine-tuning.

Show LoCoCo works for both inference (post-hoc compression) and post-training context extension (4K→32K) with modest memory and good throughput.

Key Findings

LoCoCo can compress very long prefill contexts into a tiny KV cache during inference.

Numberscompressed 3,482 tokens into a 128-size KV cache; accuracy gain vs baseline 0.2791 (reported)

Post-training tuning with LoCoCo extends Llama-2 context from 4K to 32K while using a fixed 512-slot cache.

Numbersextended 4K→32K with memory 512; perplexity gap vs full-sequence ≤0.04 on evaluated setups

LoCoCo keeps peak GPU memory low and improves inference throughput compared to baselines.

NumbersMemory 50GB (same as H2O), Throughput 33 token/s vs Full Sequence 11 token/s

LoCoCo yields consistent quality gains on standard long-context benchmarks.

NumbersSCROLLS: LoCoCo Quality 0.3528 vs H2O 0.3461; LongBench: 37.4% vs H2O 36.9%

Results

inference compression

Value3,482 tokens → 128 KV slots

Baselineuncompressed full KV

perplexity (language modeling)

Value3.4408

Baselinefull sequence 3.4012

throughput (pre-fill stage)

Value33 token/s

Baselinefull sequence 11 token/s

memory usage (training / tuning)

Value50GB

Baselinefull sequence: OOM

Who Should Care

What To Try In 7 Days

Drop a LoCoCo convolutional head into an existing Llama-2 model and run the provided repo example on a 2k–16k prefill to measure throughput gains.

Fine-tune only the convolutional heads on a small calibration split (≈104M tokens) and compare perplexity vs your current token‑eviction approach.

Replace or augment an existing token‑eviction policy (e.g., heavy hitters) with LoCoCo fusion and measure downstream task accuracy on a long-document benchmark.

Agent Features

Memory

  • fixed-size KV cache

Architectures

  • Long-context Transformer

Optimization Features

Token Efficiency

  • reported compression ratio up to 32:1 (e.g., 4096→128)

Infra Optimization

  • improved pre-fill throughput vs full KV; similar memory to other eviction baselines

System Optimization

  • peak memory O(MB+M) instead of O(LB) for segment attention

Training Optimization

  • LoRA

Inference Optimization

  • fixed-size KV cache via learned convolutional fusion
  • drop-in heads that keep pre-trained weights unchanged

Reproducibility

Data Urls

  • RedPajama
  • Proof-Pile-2
  • RACE
  • TriviaQA
  • HellaSwag
  • WinoGrande
  • ARC
  • SCROLLS
  • LongBench

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance can degrade under extreme compression or very small convolution kernels (ablation shows small kernels hurt).
  • LoCoCo needs calibration or light fine-tuning (authors used 200 steps for post-hoc compression and LoRA for extension).
  • Compression is learned and may merge fine-grained token signals needed for exact token‑level retrieval.

When Not To Use

  • When you can afford full uncompressed KV memory and desire exact full-sequence attention.
  • When the application requires strict token-level fidelity (exact retrieval of single tokens).
  • When you cannot allocate even the small extra compute for convolutional fusion (though authors report overhead is small).

Failure Modes

  • Loss of rare, token-specific signals when multiple tokens are merged into one slot.
  • Optimization instability with very long convolution kernels or extremely small kernels.
  • Residual gap versus full sequence at extreme context lengths or very high compression ratios.

Core Entities

Models

  • Llama-2-7B
  • Llama-2-13B
  • ChatGLM3-6B-32k

Metrics

  • perplexity
  • throughput (tokens/s)
  • GPU memory (GB)
  • Accuracy

Datasets

  • RedPajama
  • Proof-Pile-2
  • RACE
  • TriviaQA
  • HellaSwag
  • WinoGrande
  • ARC
  • SCROLLS
  • LongBench

Benchmarks

  • SCROLLS
  • LongBench