Compress long contexts into cached activations (beacons) to cut KV memory 8x and speed inference ~2x while keeping quality

January 7, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

Links

Abstract / PDF

Why It Matters For Business

Cuts serving memory by up to 8x and halves latency on long inputs while keeping task quality, letting teams process far larger documents at lower GPU cost.

Summary TLDR

Activation Beacon is a plug-in for transformer LLMs that compresses long inputs by distilling chunks into special 'beacon' token activations (keys/values). The method compresses progressively per chunk, caches beacon activations, and is trained with a compression-aware next-token objective. On tests up to 128K context it keeps generation quality close to an uncompressed model while cutting KV cache by ~8x and halving inference time in high-length settings.

Problem Statement

Transformer LLMs become very slow and memory-heavy when processing long inputs because they must store and attend over per-token key/value activations. Existing context-compression approaches (soft tokens or token deletion) either fail to capture complex long-context information, require re-encoding, or lack flexible compression ratios.

Main Contribution

Introduce a beacon token whose per-layer key/value activations serve as the compressed representation of long context.

Progressive chunked compression: split long inputs into chunks, break each chunk into fine-grained units, interleave beacon tokens, accumulate beacon activations and discard raw-token activations.

Train with compression-based auto-regression and random chunk-wise compression ratios so one model supports many compression settings.

Show empirical wins: comparable quality to uncompressed fine-tuned baselines on long-context benchmarks while reducing KV cache and inference cost significantly.

Key Findings

Compression preserves generation quality on evaluated long-context benchmarks.

NumbersSingle-Doc: Ours 34.9 vs Full-FT 34.8 (LongBench Table 1)

Inference latency halves in high-length settings using beacon compression.

Numbers~2x speedup at 128K context (end-to-end latency, §4.3 Table 2)

KV cache memory is reduced by the compression ratio.

Numbers8x KV cache reduction reported for x8 compression (§4.3)

Method generalizes beyond training sequence lengths.

NumbersTrained on ≤20K contexts; evaluated up to 128K (Needle-in-a-Haystack Figure 4)

Outperforms soft-token and token-deletion compression baselines on long-context tasks.

NumbersICAE 12.9 vs Ours 34.9 (Single-Doc, Table 1)

Results

Single-Doc QA (LongBench)

ValueOurs 34.9

BaselineFull-FT 34.8

End-to-end latency

Value2.0x faster

BaselineFull-FT

KV cache size

Value8x reduction

Baselineno compression

FLOPs reduction (projection)

Valueup to >4x at 256K (estimated)

Baselinefull-attention

Short-context task retention

Valuenegligible drop

Baselineoriginal LLM

Who Should Care

What To Try In 7 Days

Run Activation Beacon on an existing 7B model and a representative long-document workload to measure latency and KV memory vs your current pipeline.

Start with x8 compression on development data; compare top-1 retrieval/QA accuracy and tail latency.

If you use multi-turn chat, test chunked incremental updates to see savings from not re-encoding previous turns.

Agent Features

Memory

  • progressive chunked activation cache

Architectures

  • transformer

Optimization Features

Token Efficiency

  • supports variable compression ratios (2,4,8,16,32); x8 recommended

Infra Optimization

  • reduces KV cache memory proportional to compression ratio (e.g., 8x)

Model Optimization

  • activation compression (keys/values per layer)

System Optimization

  • discernible attention locality per chunk reduces self-attention scope
  • progressive distillation allows inputs longer than model window

Training Optimization

  • compression-based auto-regression objective
  • chunk-wise random compression ratio sampling
  • freeze backbone weights; train beacon projections + embedding

Inference Optimization

  • cache and reuse beacon activations across chunks
  • avoid re-encoding compressed tokens at query time

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Beacon tokens introduce extra per-chunk compute (MLP/projection) that offsets some attention gains at small lengths.
  • Training requires pretraining + fine-tuning pipeline; both stages improve quality.
  • Performance depends on fine-grained interleaving; appending all beacons at chunk end degrades quality (§4.6).
  • Evaluations focus on 7B models and selected benchmarks; larger models and diverse domains untested here.

When Not To Use

  • When you need query-dependent token deletion that relies on the question at hand (some deletion methods are query-aware).
  • If you cannot add a training stage or lack representative long-context data for fine-tuning.

Failure Modes

  • High compression ratios can remove fine-grained details and hurt retrieval/QA.
  • Using non-fine-grained beacon placement (all at chunk end) causes major information loss (§4.6).
  • If beacon projections are poorly trained, downstream generation may degrade.

Core Entities

Models

  • Llama-2-7B
  • Qwen-2-7B

Metrics

  • Accuracy
  • latency (s)
  • KV cache reduction
  • FLOPs

Datasets

  • LongBench
  • Needle-in-a-Haystack
  • LongAlpaca
  • BookSum
  • RedPajama
  • Synthesized QA

Benchmarks

  • LongBench
  • Needle-in-a-Haystack