GVote: per-request KV-cache compression that auto-selects how much to keep, cutting memory ~2× while keeping accuracy

Overview

Decision SnapshotNeeds Validation

Method shows clear empirical gains across several models and datasets and gives practical hyperparameter defaults. Missing public code and full runtime/latency breakdown reduce immediate production confidence.

Citations0

Evidence Strength0.60

Confidence0.65

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang

Links

Abstract / PDF

Why It Matters For Business

GVote can cut GPU memory used by KV-caches about in half without manual tuning. That frees headroom for larger batch sizes, longer contexts, or lower-cost GPUs and reduces engineering time spent tuning budgets per workload.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

GVote is a per-request KV-cache compression method that samples plausible future queries from a fitted Gaussian over transformer hidden states, votes on which keys are needed, and sets the cache budget automatically. On several benchmarks it halves average KV memory while keeping accuracy similar or better than fixed-budget baselines. Key knobs: nucleus threshold p_nuc (recommend 0.95) and number of samples S (recommend S ≥ 8).

Problem Statement

Existing KV-cache compression methods require a fixed, manually chosen global memory budget. That one-size-fits-all budget either wastes memory on simple requests or causes big accuracy drops on hard requests. We need an adaptive, per-request way to pick how many keys to keep.

Main Contribution

Formulate the fixed-budget limitation and argue it fails for heterogeneous workloads.

Introduce GVote: a Monte‑Carlo, per-request method that samples synthetic future queries from a Gaussian fit to hidden states and unions their top-k keys to set the cache budget.

Key Findings

GVote reduces KV-cache usage roughly twofold on evaluated benchmarks while keeping accuracy similar or better.

Numbers2× memory reduction reported across eight datasets (avg)

Practical UseDeploy per-request adaptive pruning (GVote) to cut KV memory use about half instead of hand-tuning a single global budget.

Evidence RefAbstract; Section 4.2; Figure 4

Synthetic queries correlate well with actual future queries' attention patterns.

NumbersPearson r = 0.7759; mean attention overlap = 0.929

Practical UseSampling from the hidden-state Gaussian is a viable proxy for future queries; it supports data-driven budget decisions instead of fixed heuristics.

Evidence RefSection 3.3; Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average KV memory usage	≈50% of baseline (2× reduction)	fixed-budget methods (varied 10%–50%)	≈2× reduction	avg over 8 datasets	GVote halves memory on average while keeping accuracy	Abstract; Section 4.2; Figure 4
Accuracy	≈0.35 accuracy at 10% memory usage	other methods require ≥20% usage for lower accuracy	GVote uses ~10% where baselines need ≥20%	Multi-Doc QA (Longbench subset)	Paper cites 0.35 accuracy at 10% usage vs baselines needing double memory	Section 1; Figure 1

What To Try In 7 Days

Implement GVote as a prefill step (vectorised sampling + boolean mask union) in your PyTorch inference pipeline.

Start with p_nuc=0.95 and S=8; measure average KV memory and end-to-end latency on representative requests.

Compare quality vs a tuned fixed-budget baseline at the same average memory to validate accuracy trade-offs.

Optimization Features

Token Efficiency

adaptive key selection

Infra Optimization

compatible with FlashAttention varlen interface

System Optimization

prefill-time vectorised samplingmask-based union to avoid materializing indices

Inference Optimization

KV Cache OptimizationContext CompressionToken Budgeting

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

GVote adds one-time prefill compute and memory for S synthetic queries; cost rises with S and can be heavy for extreme context lengths.

Non-uniform per-head caches produce irregular shapes and require attention kernels that accept variable lengths (e.g., FlashAttention varlen).

When Not To Use

When you cannot afford any extra prefill computation or intermediate memory (tight latency/real-time settings).

On systems that lack attention kernels supporting variable-length per-head caches.

Failure Modes

Picking S too large inflates memory via the union of many noisy samples and can cause out-of-memory errors.

If hidden-state distribution diverges from Gaussian, synthetic queries may miss critical tokens or include many irrelevant ones.

Core Entities

Models

Llama3.1-8B-InstructLlama3.2-3B-InstructQwen2.5-7B-InstructQwen2.5-14B-Instruct

Metrics

Accuracyaverage memory usageattention overlapPearson correlation

Datasets

GSM8KRULER (RULER-4K/RULER-CWE)LongbenchMulti-Doc QASingle-Doc QA

Benchmarks

GSM8KRULERLongbench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GVote reduces KV-cache usage roughly twofold on evaluated benchmarks while keeping accuracy similar or better.

Synthetic queries correlate well with actual future queries' attention patterns.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding