GVote: per-request KV-cache compression that auto-selects how much to keep, cutting memory ~2× while keeping accuracy

September 3, 20257 min

Overview

Decision SnapshotNeeds Validation

Method shows clear empirical gains across several models and datasets and gives practical hyperparameter defaults. Missing public code and full runtime/latency breakdown reduce immediate production confidence.

Citations0

Evidence Strength0.60

Confidence0.65

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang

Links

Abstract / PDF

Why It Matters For Business

GVote can cut GPU memory used by KV-caches about in half without manual tuning. That frees headroom for larger batch sizes, longer contexts, or lower-cost GPUs and reduces engineering time spent tuning budgets per workload.

Who Should Care

Summary TLDR

GVote is a per-request KV-cache compression method that samples plausible future queries from a fitted Gaussian over transformer hidden states, votes on which keys are needed, and sets the cache budget automatically. On several benchmarks it halves average KV memory while keeping accuracy similar or better than fixed-budget baselines. Key knobs: nucleus threshold p_nuc (recommend 0.95) and number of samples S (recommend S ≥ 8).

Problem Statement

Existing KV-cache compression methods require a fixed, manually chosen global memory budget. That one-size-fits-all budget either wastes memory on simple requests or causes big accuracy drops on hard requests. We need an adaptive, per-request way to pick how many keys to keep.

Main Contribution

Formulate the fixed-budget limitation and argue it fails for heterogeneous workloads.

Introduce GVote: a Monte‑Carlo, per-request method that samples synthetic future queries from a Gaussian fit to hidden states and unions their top-k keys to set the cache budget.

Key Findings

GVote reduces KV-cache usage roughly twofold on evaluated benchmarks while keeping accuracy similar or better.

Numbers memory reduction reported across eight datasets (avg)

Practical UseDeploy per-request adaptive pruning (GVote) to cut KV memory use about half instead of hand-tuning a single global budget.

Evidence RefAbstract; Section 4.2; Figure 4

Synthetic queries correlate well with actual future queries' attention patterns.

NumbersPearson r = 0.7759; mean attention overlap = 0.929

Practical UseSampling from the hidden-state Gaussian is a viable proxy for future queries; it supports data-driven budget decisions instead of fixed heuristics.

Evidence RefSection 3.3; Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average KV memory usage≈50% of baseline ( reduction)fixed-budget methods (varied 10%–50%)≈2× reductionavg over 8 datasetsGVote halves memory on average while keeping accuracyAbstract; Section 4.2; Figure 4
Accuracy≈0.35 accuracy at 10% memory usageother methods require ≥20% usage for lower accuracyGVote uses ~10% where baselines need ≥20%Multi-Doc QA (Longbench subset)Paper cites 0.35 accuracy at 10% usage vs baselines needing double memorySection 1; Figure 1

What To Try In 7 Days

Implement GVote as a prefill step (vectorised sampling + boolean mask union) in your PyTorch inference pipeline.

Start with p_nuc=0.95 and S=8; measure average KV memory and end-to-end latency on representative requests.

Compare quality vs a tuned fixed-budget baseline at the same average memory to validate accuracy trade-offs.

Optimization Features

Token Efficiency
adaptive key selection
Infra Optimization
compatible with FlashAttention varlen interface
System Optimization
prefill-time vectorised samplingmask-based union to avoid materializing indices
Inference Optimization
KV Cache OptimizationContext CompressionToken Budgeting

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

GVote adds one-time prefill compute and memory for S synthetic queries; cost rises with S and can be heavy for extreme context lengths.

Non-uniform per-head caches produce irregular shapes and require attention kernels that accept variable lengths (e.g., FlashAttention varlen).

When Not To Use

When you cannot afford any extra prefill computation or intermediate memory (tight latency/real-time settings).

On systems that lack attention kernels supporting variable-length per-head caches.

Failure Modes

Picking S too large inflates memory via the union of many noisy samples and can cause out-of-memory errors.

If hidden-state distribution diverges from Gaussian, synthetic queries may miss critical tokens or include many irrelevant ones.

Core Entities

Models

Llama3.1-8B-InstructLlama3.2-3B-InstructQwen2.5-7B-InstructQwen2.5-14B-Instruct

Metrics

Accuracyaverage memory usageattention overlapPearson correlation

Datasets

GSM8KRULER (RULER-4K/RULER-CWE)LongbenchMulti-Doc QASingle-Doc QA

Benchmarks

GSM8KRULERLongbench