GVote: per-request KV-cache compression that auto-selects how much to keep, cutting memory ~2× while keeping accuracy

September 3, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang

Links

Abstract / PDF

Why It Matters For Business

GVote can cut GPU memory used by KV-caches about in half without manual tuning. That frees headroom for larger batch sizes, longer contexts, or lower-cost GPUs and reduces engineering time spent tuning budgets per workload.

Summary TLDR

GVote is a per-request KV-cache compression method that samples plausible future queries from a fitted Gaussian over transformer hidden states, votes on which keys are needed, and sets the cache budget automatically. On several benchmarks it halves average KV memory while keeping accuracy similar or better than fixed-budget baselines. Key knobs: nucleus threshold p_nuc (recommend 0.95) and number of samples S (recommend S ≥ 8).

Problem Statement

Existing KV-cache compression methods require a fixed, manually chosen global memory budget. That one-size-fits-all budget either wastes memory on simple requests or causes big accuracy drops on hard requests. We need an adaptive, per-request way to pick how many keys to keep.

Main Contribution

Formulate the fixed-budget limitation and argue it fails for heterogeneous workloads.

Introduce GVote: a Monte‑Carlo, per-request method that samples synthetic future queries from a Gaussian fit to hidden states and unions their top-k keys to set the cache budget.

Show empirical gains across multiple long- and short-context benchmarks: ~2× average KV memory reduction while maintaining comparable or higher accuracy versus fixed-budget baselines.

Key Findings

GVote reduces KV-cache usage roughly twofold on evaluated benchmarks while keeping accuracy similar or better.

Numbers2× memory reduction reported across eight datasets (avg)

Synthetic queries correlate well with actual future queries' attention patterns.

NumbersPearson r = 0.7759; mean attention overlap = 0.929

Hyperparameter guidance: nucleus threshold and sample count control the accuracy/use trade-off.

NumbersRecommend p_nuc = 0.95; recommend S ≥ 8

Higher sample count S increases accuracy but also increases memory and compute and can cause OOM at extreme context sizes.

NumbersS ≥ 8 recommended; extremely large contexts (>128K) with high S can produce huge intermediate logits

Results

Average KV memory usage

Value≈50% of baseline (2× reduction)

Baselinefixed-budget methods (varied 10%–50%)

Accuracy

Value≈0.35 accuracy at 10% memory usage

Baselineother methods require ≥20% usage for lower accuracy

Attention correlation between synthetic and real queries

ValuePearson r = 0.7759

Attention overlap (top mass)

Valuemean overlap = 0.929

Recommended hyperparameters

Valuep_nuc = 0.95, S ≥ 8

Who Should Care

What To Try In 7 Days

Implement GVote as a prefill step (vectorised sampling + boolean mask union) in your PyTorch inference pipeline.

Start with p_nuc=0.95 and S=8; measure average KV memory and end-to-end latency on representative requests.

Compare quality vs a tuned fixed-budget baseline at the same average memory to validate accuracy trade-offs.

Optimization Features

Token Efficiency

  • adaptive key selection

Infra Optimization

  • compatible with FlashAttention varlen interface

System Optimization

  • prefill-time vectorised sampling
  • mask-based union to avoid materializing indices

Inference Optimization

  • KV Cache Optimization
  • Context Compression
  • Token Budgeting

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • GVote adds one-time prefill compute and memory for S synthetic queries; cost rises with S and can be heavy for extreme context lengths.
  • Non-uniform per-head caches produce irregular shapes and require attention kernels that accept variable lengths (e.g., FlashAttention varlen).
  • Sampling relies on hidden states being Gaussian-like; performance may degrade if that assumption fails under some models or domains.
  • Paper does not publish code or exact run-time/latency profiles; implementation engineering required to integrate into production.

When Not To Use

  • When you cannot afford any extra prefill computation or intermediate memory (tight latency/real-time settings).
  • On systems that lack attention kernels supporting variable-length per-head caches.
  • For extremely long contexts (≫128K) where S must be kept small to avoid OOM.

Failure Modes

  • Picking S too large inflates memory via the union of many noisy samples and can cause out-of-memory errors.
  • If hidden-state distribution diverges from Gaussian, synthetic queries may miss critical tokens or include many irrelevant ones.
  • Distribution shift: a budget computed from sampled futures may underperform on adversarial or out-of-distribution requests.

Core Entities

Models

  • Llama3.1-8B-Instruct
  • Llama3.2-3B-Instruct
  • Qwen2.5-7B-Instruct
  • Qwen2.5-14B-Instruct

Metrics

  • Accuracy
  • average memory usage
  • attention overlap
  • Pearson correlation

Datasets

  • GSM8K
  • RULER (RULER-4K/RULER-CWE)
  • Longbench
  • Multi-Doc QA
  • Single-Doc QA

Benchmarks

  • GSM8K
  • RULER
  • Longbench