Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can cut rollout memory and enable larger RL batch sizes with minimal accuracy loss, lowering GPU cost and enabling RL experiments on smaller clusters.
Summary TLDR
Sparse-RL lets you run reinforcement learning for large language models with compressed key-value (KV) caches during rollouts. It adds two corrections—sequence-level rejection and token-level importance reweighting—to fix the policy mismatch caused by compression. In experiments on math reasoning benchmarks (Qwen2.5 and Llama), Sparse-RL cuts KV storage by 35–53% while retaining ~97% of dense performance on large models and sometimes improving small-model results. It also makes the model robust when the same compression is used at inference time.
Problem Statement
Rollout generation in RL for LLMs builds a growing KV cache that quickly exhausts GPU memory. Training-free KV compression can reduce memory but causes a policy mismatch: sampled trajectories come from a compressed (sparse) view while gradients use the dense policy, producing anomalous trajectories and catastrophic training collapse. The paper solves how to use KV compression safely in RL rollouts.
Main Contribution
Identify policy mismatch between sparse sampler (compressed KV) and dense learner as the reason sparse rollouts crash RL training.
Introduce Sparse-RL: combine sequence-level Sparsity-Aware Rejection Sampling and token-level Importance-based Reweighting to correct off-policy bias from compression.
Demonstrate on 4 model sizes and 2 compression algorithms that Sparse-RL keeps performance while cutting KV memory usage under a fixed token budget.
Show that training with Sparse-RL produces models that are more robust under the same sparse inference setting.
Key Findings
Sparse-RL keeps most dense performance while saving KV memory.
Memory (KV tokens) reduced substantially across sizes.
Sparse-RL improves sparse inference when training and deployment use same compression.
Rejection and reweighting keep training stable with low trust-region violation.
A KV budget of ~512 tokens suffices for comparable performance on evaluated benchmarks.
Results
KV token savings
Accuracy
Small-model improvement
Sparse inference advantage
Training stability stats
Who Should Care
What To Try In 7 Days
Run a pilot: train with a KV budget of 512 tokens and monitor rejection rate and clip ratio.
Instrument token-probability ratios between compressed and dense views to implement rejection + importance reweighting.
Measure downstream sparse-inference accuracy to see if sparsity-aware training improves your deployment.
Agent Features
Memory
- KV cache compression
- short-term context eviction
Planning
- chain-of-thought style reasoning
Frameworks
- GRPO
- slime
Is Agentic
true
Architectures
- autoregressive LLM
Optimization Features
Token Efficiency
- fixed KV token budget (e.g., 512)
Infra Optimization
- lower GPU memory per rollout
Model Optimization
- sparsity-aware training
System Optimization
- reduced memory footprint enables larger rollout batches
Training Optimization
- importance sampling reweighting
- sequence-level rejection sampling
Inference Optimization
- KV cache compression (R-KV, SnapKV)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluated on mathematical reasoning with binary rewards; generalization to open-ended generation or natural rewards is untested.
- Strict sequence-level rejection can waste samples; rejection rate may rise under aggressive compression.
- Performance depends on compression preserving core context; very small budgets (e.g., 128 tokens) degrade accuracy.
When Not To Use
- Open-ended creative or instruction-following tasks where anomalous tokens are hard to define.
- Extremely tiny KV budgets (shown failure at 128 tokens in ablation).
- When you cannot compute dense token probabilities for rejection/reweighting.
Failure Modes
- Anomalous sequences (e.g., infinite repetition) caused by compression that produce extreme gradients if not filtered.
- High rejection rates under aggressive budgets that waste compute and slow training.
- Residual policy mismatch if compression radically changes token support, leading to biased updates.
Core Entities
Models
- Qwen2.5-1.5B
- Qwen2.5-3B
- Qwen2.5-7B
- Llama-3.2-1B-Instruct
- GRPO
Metrics
- Pass@1
- Avg@32
- Average reward
- Token savings (KV tokens)
- Rejection rate
- Policy KL mismatch
Datasets
- SimpleRL-Zoo (hard split)
- GSM8K
- MATH500
- Gaokao
- Minerva
- OlympiadBench
- AIME24
- AMC23
Benchmarks
- GSM8K
- MATH500
- Gaokao
- Minerva
- OlympiadBench
- AIME24
- AMC23

