Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

May 28, 20258 min

Overview

Decision SnapshotReady For Pilot

Mustafar pairs a simple, reproducible pruning rule with a working GPU kernel and open-source code. It is ready for testing in GPU-based inference stacks, but gains depend on batch size, model architecture, and per-task sensitivity; use 50% KV sparsity as a safe starting point.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari

Links

Abstract / PDF / Code

Why It Matters For Business

Mustafar cuts KV memory use and can double token throughput on batch-friendly workloads, enabling longer contexts or lower cloud costs without model fine-tuning.

Who Should Care

Summary TLDR

Mustafar shows that element-wise (unstructured) pruning applied per token compresses both Key and Value caches to high sparsity with little accuracy loss. The paper pairs simple per-token magnitude pruning with a bitmap compressed format and a custom sparse attention CUDA kernel that computes directly on compressed KV caches. On tests (LongBench, RULER) Mustafar preserves accuracy at 50% and often at 70% sparsity, reduces KV memory to as low as 45% of dense at 70% sparsity, and raises tokens/sec up to 2.23× (Llama-3-8B, batch 8). Code is available.

Problem Statement

KV cache size is the main memory bottleneck for long-context decoding. We need a pruning and runtime strategy that (1) removes a large fraction of KV elements without breaking task accuracy, and (2) compresses and computes over the resulting arbitrary sparsity efficiently enough that overall latency improves.

Main Contribution

Show per-token magnitude-based unstructured pruning preserves accuracy better than structured pruning for both Key and Value caches

Introduce a bitmap-based compressed KV format and a custom CUDA sparse attention kernel that runs directly on compressed caches

Key Findings

Per-token magnitude-based unstructured pruning retains accuracy far better than structured pruning across LongBench

NumbersLlama-3-8B LongBench avg: dense 43.19 vs K0.5 V0.5 42.65−0.54)

Practical UseUse per-token magnitude pruning as a conservative default; 50% KV sparsity gives near-dense quality on LongBench.

Evidence RefTable 3 (LongBench, Llama-3-8B-Instruct)

Mustafar compresses KV cache to as low as 45% of dense at 70% joint Key+Value sparsity

NumbersKV cache compression ratio 45% at K0.7 V0.7

Practical UseExpect roughly 2× memory headroom for long contexts when applying 70% joint sparsity and Mustafar format.

Evidence RefFigure 6b / Section 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LongBench average (Llama-3-8B-Instruct)Dense 43.19; K0.5 V0.5 42.65Dense model−0.54LongBench avgTable 3 (LongBench)Table 3
KV cache compression ratio45% of denseDense KV cache (100%)−55% sizemeasured memory footprintFigure 6b / Section 4.3Figure 6b

What To Try In 7 Days

Clone the repo and run the provided kernel on a small model (Llama-2-7B) to reproduce K0.5 V0.5 throughput

Measure LongBench/one task accuracy at 50% KV sparsity; use that as a conservative production setting

Combine Mustafar with your existing 4-bit quant pipeline and validate end-to-end latency and accuracy on a representative workload (start with K0.5 V0.0 then K0.5 V0.5)

Optimization Features

Token Efficiency
compatible with token eviction (H2O)local dense window of last 32 tokens
Infra Optimization
optimizes global memory traffic to SMs on NVIDIA GPUsworks best with batch sizes that saturate SMs (batch≥4)
Model Optimization
per-token magnitude-based unstructured pruningcompatibility with quantization (KIVI)
System Optimization
Triton GPU compression kerneltile-wise shared-memory decompressionwarp-thread 1×64 thread-tile layout
Inference Optimization
bitmap-based compressed KV formatcustom CUDA sparse attention kernel (SpMV on compressed tiles)load-as-compressed, compute-as-dense pipeline

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Kernel currently does not support low-bit precision compute on compressed tiles

Small batch sizes (batch=1) can be slower due to underutilized GPU

When Not To Use

Latency-sensitive single-request workloads (batch size 1)

Models or tasks where key cache is highly sensitive to element removal at high sparsity

Failure Modes

Large accuracy loss when applying >70% key sparsity on models sensitive to key magnitudes

Throughput drop for small batches due to GPU underutilization

Core Entities

Models

Llama-3-8B-InstructLlama-2-7BMistral-7B-Instruct-v0.2Llama-2-13B-chatLlama-3.1-8B-Instruct

Metrics

LongBench average scoretokens/sec (throughput)KV cache compression ratio (percent of dense)kernel latency breakdown (cuBLAS normalized)

Datasets

LongBenchRULER

Benchmarks

LongBenchRULER