Overview
Method shows clear empirical gains across several models and datasets and gives practical hyperparameter defaults. Missing public code and full runtime/latency breakdown reduce immediate production confidence.
Citations0
Evidence Strength0.60
Confidence0.65
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
GVote can cut GPU memory used by KV-caches about in half without manual tuning. That frees headroom for larger batch sizes, longer contexts, or lower-cost GPUs and reduces engineering time spent tuning budgets per workload.
Who Should Care
Summary TLDR
GVote is a per-request KV-cache compression method that samples plausible future queries from a fitted Gaussian over transformer hidden states, votes on which keys are needed, and sets the cache budget automatically. On several benchmarks it halves average KV memory while keeping accuracy similar or better than fixed-budget baselines. Key knobs: nucleus threshold p_nuc (recommend 0.95) and number of samples S (recommend S ≥ 8).
Problem Statement
Existing KV-cache compression methods require a fixed, manually chosen global memory budget. That one-size-fits-all budget either wastes memory on simple requests or causes big accuracy drops on hard requests. We need an adaptive, per-request way to pick how many keys to keep.
Main Contribution
Formulate the fixed-budget limitation and argue it fails for heterogeneous workloads.
Introduce GVote: a Monte‑Carlo, per-request method that samples synthetic future queries from a Gaussian fit to hidden states and unions their top-k keys to set the cache budget.
Key Findings
GVote reduces KV-cache usage roughly twofold on evaluated benchmarks while keeping accuracy similar or better.
Synthetic queries correlate well with actual future queries' attention patterns.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average KV memory usage | ≈50% of baseline (2× reduction) | fixed-budget methods (varied 10%–50%) | ≈2× reduction | avg over 8 datasets | GVote halves memory on average while keeping accuracy | Abstract; Section 4.2; Figure 4 |
| Accuracy | ≈0.35 accuracy at 10% memory usage | other methods require ≥20% usage for lower accuracy | GVote uses ~10% where baselines need ≥20% | Multi-Doc QA (Longbench subset) | Paper cites 0.35 accuracy at 10% usage vs baselines needing double memory | Section 1; Figure 1 |
What To Try In 7 Days
Implement GVote as a prefill step (vectorised sampling + boolean mask union) in your PyTorch inference pipeline.
Start with p_nuc=0.95 and S=8; measure average KV memory and end-to-end latency on representative requests.
Compare quality vs a tuned fixed-budget baseline at the same average memory to validate accuracy trade-offs.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
GVote adds one-time prefill compute and memory for S synthetic queries; cost rises with S and can be heavy for extreme context lengths.
Non-uniform per-head caches produce irregular shapes and require attention kernels that accept variable lengths (e.g., FlashAttention varlen).
When Not To Use
When you cannot afford any extra prefill computation or intermediate memory (tight latency/real-time settings).
On systems that lack attention kernels supporting variable-length per-head caches.
Failure Modes
Picking S too large inflates memory via the union of many noisy samples and can cause out-of-memory errors.
If hidden-state distribution diverges from Gaussian, synthetic queries may miss critical tokens or include many irrelevant ones.

