Overview
Mustafar pairs a simple, reproducible pruning rule with a working GPU kernel and open-source code. It is ready for testing in GPU-based inference stacks, but gains depend on batch size, model architecture, and per-task sensitivity; use 50% KV sparsity as a safe starting point.
Citations0
Evidence Strength0.85
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
Mustafar cuts KV memory use and can double token throughput on batch-friendly workloads, enabling longer contexts or lower cloud costs without model fine-tuning.
Who Should Care
Summary TLDR
Mustafar shows that element-wise (unstructured) pruning applied per token compresses both Key and Value caches to high sparsity with little accuracy loss. The paper pairs simple per-token magnitude pruning with a bitmap compressed format and a custom sparse attention CUDA kernel that computes directly on compressed KV caches. On tests (LongBench, RULER) Mustafar preserves accuracy at 50% and often at 70% sparsity, reduces KV memory to as low as 45% of dense at 70% sparsity, and raises tokens/sec up to 2.23× (Llama-3-8B, batch 8). Code is available.
Problem Statement
KV cache size is the main memory bottleneck for long-context decoding. We need a pruning and runtime strategy that (1) removes a large fraction of KV elements without breaking task accuracy, and (2) compresses and computes over the resulting arbitrary sparsity efficiently enough that overall latency improves.
Main Contribution
Show per-token magnitude-based unstructured pruning preserves accuracy better than structured pruning for both Key and Value caches
Introduce a bitmap-based compressed KV format and a custom CUDA sparse attention kernel that runs directly on compressed caches
Key Findings
Per-token magnitude-based unstructured pruning retains accuracy far better than structured pruning across LongBench
Mustafar compresses KV cache to as low as 45% of dense at 70% joint Key+Value sparsity
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| LongBench average (Llama-3-8B-Instruct) | Dense 43.19; K0.5 V0.5 42.65 | Dense model | −0.54 | LongBench avg | Table 3 (LongBench) | Table 3 |
| KV cache compression ratio | 45% of dense | Dense KV cache (100%) | −55% size | measured memory footprint | Figure 6b / Section 4.3 | Figure 6b |
What To Try In 7 Days
Clone the repo and run the provided kernel on a small model (Llama-2-7B) to reproduce K0.5 V0.5 throughput
Measure LongBench/one task accuracy at 50% KV sparsity; use that as a conservative production setting
Combine Mustafar with your existing 4-bit quant pipeline and validate end-to-end latency and accuracy on a representative workload (start with K0.5 V0.0 then K0.5 V0.5)
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Kernel currently does not support low-bit precision compute on compressed tiles
Small batch sizes (batch=1) can be slower due to underutilized GPU
When Not To Use
Latency-sensitive single-request workloads (batch size 1)
Models or tasks where key cache is highly sensitive to element removal at high sparsity
Failure Modes
Large accuracy loss when applying >70% key sparsity on models sensitive to key magnitudes
Throughput drop for small batches due to GPU underutilization

