Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Overview

Decision SnapshotReady For Pilot

Mustafar pairs a simple, reproducible pruning rule with a working GPU kernel and open-source code. It is ready for testing in GPU-based inference stacks, but gains depend on batch size, model architecture, and per-task sensitivity; use 50% KV sparsity as a safe starting point.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari

Links

Abstract / PDF / Code

Why It Matters For Business

Mustafar cuts KV memory use and can double token throughput on batch-friendly workloads, enabling longer contexts or lower cloud costs without model fine-tuning.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Founder

Summary TLDR

Mustafar shows that element-wise (unstructured) pruning applied per token compresses both Key and Value caches to high sparsity with little accuracy loss. The paper pairs simple per-token magnitude pruning with a bitmap compressed format and a custom sparse attention CUDA kernel that computes directly on compressed KV caches. On tests (LongBench, RULER) Mustafar preserves accuracy at 50% and often at 70% sparsity, reduces KV memory to as low as 45% of dense at 70% sparsity, and raises tokens/sec up to 2.23× (Llama-3-8B, batch 8). Code is available.

Problem Statement

KV cache size is the main memory bottleneck for long-context decoding. We need a pruning and runtime strategy that (1) removes a large fraction of KV elements without breaking task accuracy, and (2) compresses and computes over the resulting arbitrary sparsity efficiently enough that overall latency improves.

Main Contribution

Show per-token magnitude-based unstructured pruning preserves accuracy better than structured pruning for both Key and Value caches

Introduce a bitmap-based compressed KV format and a custom CUDA sparse attention kernel that runs directly on compressed caches

Key Findings

Per-token magnitude-based unstructured pruning retains accuracy far better than structured pruning across LongBench

NumbersLlama-3-8B LongBench avg: dense 43.19 vs K0.5 V0.5 42.65 (Δ −0.54)

Practical UseUse per-token magnitude pruning as a conservative default; 50% KV sparsity gives near-dense quality on LongBench.

Evidence RefTable 3 (LongBench, Llama-3-8B-Instruct)

Mustafar compresses KV cache to as low as 45% of dense at 70% joint Key+Value sparsity

NumbersKV cache compression ratio 45% at K0.7 V0.7

Practical UseExpect roughly 2× memory headroom for long contexts when applying 70% joint sparsity and Mustafar format.

Evidence RefFigure 6b / Section 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LongBench average (Llama-3-8B-Instruct)	Dense 43.19; K0.5 V0.5 42.65	Dense model	−0.54	LongBench avg	Table 3 (LongBench)	Table 3
KV cache compression ratio	45% of dense	Dense KV cache (100%)	−55% size	measured memory footprint	Figure 6b / Section 4.3	Figure 6b

What To Try In 7 Days

Clone the repo and run the provided kernel on a small model (Llama-2-7B) to reproduce K0.5 V0.5 throughput

Measure LongBench/one task accuracy at 50% KV sparsity; use that as a conservative production setting

Combine Mustafar with your existing 4-bit quant pipeline and validate end-to-end latency and accuracy on a representative workload (start with K0.5 V0.0 then K0.5 V0.5)

Optimization Features

Token Efficiency

compatible with token eviction (H2O)local dense window of last 32 tokens

Infra Optimization

optimizes global memory traffic to SMs on NVIDIA GPUsworks best with batch sizes that saturate SMs (batch≥4)

Model Optimization

per-token magnitude-based unstructured pruningcompatibility with quantization (KIVI)

System Optimization

Triton GPU compression kerneltile-wise shared-memory decompressionwarp-thread 1×64 thread-tile layout

Inference Optimization

bitmap-based compressed KV formatcustom CUDA sparse attention kernel (SpMV on compressed tiles)load-as-compressed, compute-as-dense pipeline

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/dhjoo98/mustafar

Risks & Boundaries

Limitations

Kernel currently does not support low-bit precision compute on compressed tiles

Small batch sizes (batch=1) can be slower due to underutilized GPU

When Not To Use

Latency-sensitive single-request workloads (batch size 1)

Models or tasks where key cache is highly sensitive to element removal at high sparsity

Failure Modes

Large accuracy loss when applying >70% key sparsity on models sensitive to key magnitudes

Throughput drop for small batches due to GPU underutilization

Core Entities

Models

Llama-3-8B-InstructLlama-2-7BMistral-7B-Instruct-v0.2Llama-2-13B-chatLlama-3.1-8B-Instruct

Metrics

LongBench average scoretokens/sec (throughput)KV cache compression ratio (percent of dense)kernel latency breakdown (cuBLAS normalized)

Datasets

LongBenchRULER

Benchmarks

LongBenchRULER

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Per-token magnitude-based unstructured pruning retains accuracy far better than structured pruning across LongBench

Mustafar compresses KV cache to as low as 45% of dense at 70% joint Key+Value sparsity

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Prompt caching cuts agent API costs 41–80% and speeds time-to-first-token 13–31%

Key finding