Train LLMs with compressed KV caches: keep most performance while cutting rollout memory by ~35–53%

January 15, 20267 min

Overview

Decision SnapshotNeeds Validation

The method is practical for math-style RL tasks and was validated across multiple model sizes and two compression algorithms, but it was tested on verifiable tasks with binary rewards only.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang

Links

Abstract / PDF

Why It Matters For Business

You can cut rollout memory and enable larger RL batch sizes with minimal accuracy loss, lowering GPU cost and enabling RL experiments on smaller clusters.

Who Should Care

Summary TLDR

Sparse-RL lets you run reinforcement learning for large language models with compressed key-value (KV) caches during rollouts. It adds two corrections—sequence-level rejection and token-level importance reweighting—to fix the policy mismatch caused by compression. In experiments on math reasoning benchmarks (Qwen2.5 and Llama), Sparse-RL cuts KV storage by 35–53% while retaining ~97% of dense performance on large models and sometimes improving small-model results. It also makes the model robust when the same compression is used at inference time.

Problem Statement

Rollout generation in RL for LLMs builds a growing KV cache that quickly exhausts GPU memory. Training-free KV compression can reduce memory but causes a policy mismatch: sampled trajectories come from a compressed (sparse) view while gradients use the dense policy, producing anomalous trajectories and catastrophic training collapse. The paper solves how to use KV compression safely in RL rollouts.

Main Contribution

Identify policy mismatch between sparse sampler (compressed KV) and dense learner as the reason sparse rollouts crash RL training.

Introduce Sparse-RL: combine sequence-level Sparsity-Aware Rejection Sampling and token-level Importance-based Reweighting to correct off-policy bias from compression.

Key Findings

Sparse-RL keeps most dense performance while saving KV memory.

NumbersQwen2.5-7B retains 96.8% of dense avg score (51.4 vs 53.1)

Practical UseYou can cut rollout KV storage substantially with little accuracy loss on large models; expect ~3% max drop in evaluated math benchmarks.

Evidence RefSection 5.2, Table 1

Memory (KV tokens) reduced substantially across sizes.

NumbersToken savings: 35.1%, 53.3%, 42.0%, 39.4% (various models)

Practical UseUse Sparse-RL to lower GPU memory for rollouts and increase batch sizes or sequence length without major retraining changes.

Evidence RefSection 5.2, Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
KV token savings35.1% / 53.3% / 42.0% / 39.4%Dense GRPO (FullKV)Llama-3.2-1B / Qwen2.5-1.5B / Qwen2.5-3B / Qwen2.5-7BTable 1; Section 5.2Table 1
Accuracy96.8% retainedDense GRPO avg (Qwen2.5-7B)51.4 vs 53.1 avg7B average across 7 benchmarksSection 5.2Section 5.2

What To Try In 7 Days

Run a pilot: train with a KV budget of 512 tokens and monitor rejection rate and clip ratio.

Instrument token-probability ratios between compressed and dense views to implement rejection + importance reweighting.

Measure downstream sparse-inference accuracy to see if sparsity-aware training improves your deployment.

Agent Features

Memory
KV cache compressionshort-term context eviction
Planning
chain-of-thought style reasoning
Frameworks
GRPOslime
Is Agentic

Yes

Architectures
autoregressive LLM

Optimization Features

Token Efficiency
fixed KV token budget (e.g., 512)
Infra Optimization
lower GPU memory per rollout
Model Optimization
sparsity-aware training
System Optimization
reduced memory footprint enables larger rollout batches
Training Optimization
importance sampling reweightingsequence-level rejection sampling
Inference Optimization
KV cache compression (R-KV, SnapKV)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated on mathematical reasoning with binary rewards; generalization to open-ended generation or natural rewards is untested.

Strict sequence-level rejection can waste samples; rejection rate may rise under aggressive compression.

When Not To Use

Open-ended creative or instruction-following tasks where anomalous tokens are hard to define.

Extremely tiny KV budgets (shown failure at 128 tokens in ablation).

Failure Modes

Anomalous sequences (e.g., infinite repetition) caused by compression that produce extreme gradients if not filtered.

High rejection rates under aggressive budgets that waste compute and slow training.

Core Entities

Models

Qwen2.5-1.5BQwen2.5-3BQwen2.5-7BLlama-3.2-1B-InstructGRPO

Metrics

Pass@1Avg@32Average rewardToken savings (KV tokens)Rejection ratePolicy KL mismatch

Datasets

SimpleRL-Zoo (hard split)GSM8KMATH500GaokaoMinervaOlympiadBenchAIME24AMC23

Benchmarks

GSM8KMATH500GaokaoMinervaOlympiadBenchAIME24AMC23