Train LLMs with compressed KV caches: keep most performance while cutting rollout memory by ~35–53%

Overview

Decision SnapshotNeeds Validation

The method is practical for math-style RL tasks and was validated across multiple model sizes and two compression algorithms, but it was tested on verifiable tasks with binary rewards only.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang

Links

Abstract / PDF

Why It Matters For Business

You can cut rollout memory and enable larger RL batch sizes with minimal accuracy loss, lowering GPU cost and enabling RL experiments on smaller clusters.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist Product Manager

Summary TLDR

Sparse-RL lets you run reinforcement learning for large language models with compressed key-value (KV) caches during rollouts. It adds two corrections—sequence-level rejection and token-level importance reweighting—to fix the policy mismatch caused by compression. In experiments on math reasoning benchmarks (Qwen2.5 and Llama), Sparse-RL cuts KV storage by 35–53% while retaining ~97% of dense performance on large models and sometimes improving small-model results. It also makes the model robust when the same compression is used at inference time.

Problem Statement

Rollout generation in RL for LLMs builds a growing KV cache that quickly exhausts GPU memory. Training-free KV compression can reduce memory but causes a policy mismatch: sampled trajectories come from a compressed (sparse) view while gradients use the dense policy, producing anomalous trajectories and catastrophic training collapse. The paper solves how to use KV compression safely in RL rollouts.

Main Contribution

Identify policy mismatch between sparse sampler (compressed KV) and dense learner as the reason sparse rollouts crash RL training.

Introduce Sparse-RL: combine sequence-level Sparsity-Aware Rejection Sampling and token-level Importance-based Reweighting to correct off-policy bias from compression.

Key Findings

Sparse-RL keeps most dense performance while saving KV memory.

NumbersQwen2.5-7B retains 96.8% of dense avg score (51.4 vs 53.1)

Practical UseYou can cut rollout KV storage substantially with little accuracy loss on large models; expect ~3% max drop in evaluated math benchmarks.

Evidence RefSection 5.2, Table 1

Memory (KV tokens) reduced substantially across sizes.

NumbersToken savings: 35.1%, 53.3%, 42.0%, 39.4% (various models)

Practical UseUse Sparse-RL to lower GPU memory for rollouts and increase batch sizes or sequence length without major retraining changes.

Evidence RefSection 5.2, Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
KV token savings	35.1% / 53.3% / 42.0% / 39.4%	Dense GRPO (FullKV)	—	Llama-3.2-1B / Qwen2.5-1.5B / Qwen2.5-3B / Qwen2.5-7B	Table 1; Section 5.2	Table 1
Accuracy	96.8% retained	Dense GRPO avg (Qwen2.5-7B)	51.4 vs 53.1 avg	7B average across 7 benchmarks	Section 5.2	Section 5.2

What To Try In 7 Days

Run a pilot: train with a KV budget of 512 tokens and monitor rejection rate and clip ratio.

Instrument token-probability ratios between compressed and dense views to implement rejection + importance reweighting.

Measure downstream sparse-inference accuracy to see if sparsity-aware training improves your deployment.

Agent Features

Memory

KV cache compressionshort-term context eviction

Planning

chain-of-thought style reasoning

Frameworks

GRPOslime

Is Agentic

Yes

Architectures

autoregressive LLM

Optimization Features

Token Efficiency

fixed KV token budget (e.g., 512)

Infra Optimization

lower GPU memory per rollout

Model Optimization

sparsity-aware training

System Optimization

reduced memory footprint enables larger rollout batches

Training Optimization

importance sampling reweightingsequence-level rejection sampling

Inference Optimization

KV cache compression (R-KV, SnapKV)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluated on mathematical reasoning with binary rewards; generalization to open-ended generation or natural rewards is untested.

Strict sequence-level rejection can waste samples; rejection rate may rise under aggressive compression.

When Not To Use

Open-ended creative or instruction-following tasks where anomalous tokens are hard to define.

Extremely tiny KV budgets (shown failure at 128 tokens in ablation).

Failure Modes

Anomalous sequences (e.g., infinite repetition) caused by compression that produce extreme gradients if not filtered.

High rejection rates under aggressive budgets that waste compute and slow training.

Core Entities

Models

Qwen2.5-1.5BQwen2.5-3BQwen2.5-7BLlama-3.2-1B-InstructGRPO

Metrics

Pass@1Avg@32Average rewardToken savings (KV tokens)Rejection ratePolicy KL mismatch

Datasets

SimpleRL-Zoo (hard split)GSM8KMATH500GaokaoMinervaOlympiadBenchAIME24AMC23

Benchmarks

GSM8KMATH500GaokaoMinervaOlympiadBenchAIME24AMC23

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Sparse-RL keeps most dense performance while saving KV memory.

Memory (KV tokens) reduced substantially across sizes.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding