Train LLMs with compressed KV caches: keep most performance while cutting rollout memory by ~35–53%

January 15, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang

Links

Abstract / PDF

Why It Matters For Business

You can cut rollout memory and enable larger RL batch sizes with minimal accuracy loss, lowering GPU cost and enabling RL experiments on smaller clusters.

Summary TLDR

Sparse-RL lets you run reinforcement learning for large language models with compressed key-value (KV) caches during rollouts. It adds two corrections—sequence-level rejection and token-level importance reweighting—to fix the policy mismatch caused by compression. In experiments on math reasoning benchmarks (Qwen2.5 and Llama), Sparse-RL cuts KV storage by 35–53% while retaining ~97% of dense performance on large models and sometimes improving small-model results. It also makes the model robust when the same compression is used at inference time.

Problem Statement

Rollout generation in RL for LLMs builds a growing KV cache that quickly exhausts GPU memory. Training-free KV compression can reduce memory but causes a policy mismatch: sampled trajectories come from a compressed (sparse) view while gradients use the dense policy, producing anomalous trajectories and catastrophic training collapse. The paper solves how to use KV compression safely in RL rollouts.

Main Contribution

Identify policy mismatch between sparse sampler (compressed KV) and dense learner as the reason sparse rollouts crash RL training.

Introduce Sparse-RL: combine sequence-level Sparsity-Aware Rejection Sampling and token-level Importance-based Reweighting to correct off-policy bias from compression.

Demonstrate on 4 model sizes and 2 compression algorithms that Sparse-RL keeps performance while cutting KV memory usage under a fixed token budget.

Show that training with Sparse-RL produces models that are more robust under the same sparse inference setting.

Key Findings

Sparse-RL keeps most dense performance while saving KV memory.

NumbersQwen2.5-7B retains 96.8% of dense avg score (51.4 vs 53.1)

Memory (KV tokens) reduced substantially across sizes.

NumbersToken savings: 35.1%, 53.3%, 42.0%, 39.4% (various models)

Sparse-RL improves sparse inference when training and deployment use same compression.

NumbersQwen2.5-3B MATH500: +7.6pp (61.8 vs 54.2)

Rejection and reweighting keep training stable with low trust-region violation.

NumbersAvg rejection rate 0.07; avg clip ratio 0.0005

A KV budget of ~512 tokens suffices for comparable performance on evaluated benchmarks.

Numbers512-token budget matches FullKV for tested setups

Results

KV token savings

Value35.1% / 53.3% / 42.0% / 39.4%

BaselineDense GRPO (FullKV)

Accuracy

Value96.8% retained

BaselineDense GRPO avg (Qwen2.5-7B)

Small-model improvement

Value36.2 vs 35.4 avg (1.8pp / 2.3%)

BaselineDense GRPO (Qwen2.5-1.5B)

Sparse inference advantage

Value+7.6 percentage points

BaselineDense-trained model under same compression

Training stability stats

Valueavg rejection 0.07; avg clip ratio 0.0005

BaselineNaive sparse rollout collapsed

Who Should Care

What To Try In 7 Days

Run a pilot: train with a KV budget of 512 tokens and monitor rejection rate and clip ratio.

Instrument token-probability ratios between compressed and dense views to implement rejection + importance reweighting.

Measure downstream sparse-inference accuracy to see if sparsity-aware training improves your deployment.

Agent Features

Memory

  • KV cache compression
  • short-term context eviction

Planning

  • chain-of-thought style reasoning

Frameworks

  • GRPO
  • slime

Is Agentic

true

Architectures

  • autoregressive LLM

Optimization Features

Token Efficiency

  • fixed KV token budget (e.g., 512)

Infra Optimization

  • lower GPU memory per rollout

Model Optimization

  • sparsity-aware training

System Optimization

  • reduced memory footprint enables larger rollout batches

Training Optimization

  • importance sampling reweighting
  • sequence-level rejection sampling

Inference Optimization

  • KV cache compression (R-KV, SnapKV)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluated on mathematical reasoning with binary rewards; generalization to open-ended generation or natural rewards is untested.
  • Strict sequence-level rejection can waste samples; rejection rate may rise under aggressive compression.
  • Performance depends on compression preserving core context; very small budgets (e.g., 128 tokens) degrade accuracy.

When Not To Use

  • Open-ended creative or instruction-following tasks where anomalous tokens are hard to define.
  • Extremely tiny KV budgets (shown failure at 128 tokens in ablation).
  • When you cannot compute dense token probabilities for rejection/reweighting.

Failure Modes

  • Anomalous sequences (e.g., infinite repetition) caused by compression that produce extreme gradients if not filtered.
  • High rejection rates under aggressive budgets that waste compute and slow training.
  • Residual policy mismatch if compression radically changes token support, leading to biased updates.

Core Entities

Models

  • Qwen2.5-1.5B
  • Qwen2.5-3B
  • Qwen2.5-7B
  • Llama-3.2-1B-Instruct
  • GRPO

Metrics

  • Pass@1
  • Avg@32
  • Average reward
  • Token savings (KV tokens)
  • Rejection rate
  • Policy KL mismatch

Datasets

  • SimpleRL-Zoo (hard split)
  • GSM8K
  • MATH500
  • Gaokao
  • Minerva
  • OlympiadBench
  • AIME24
  • AMC23

Benchmarks

  • GSM8K
  • MATH500
  • Gaokao
  • Minerva
  • OlympiadBench
  • AIME24
  • AMC23