Overview
Results are measured across multiple LVLMs and image/video benchmarks. Gains are strongest on long visual inputs where attention and KV cache dominate costs. The method relies on approximating attention with probe tokens; robustness to probe selection is shown but requires tuning (τ).
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/7
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
ZipVL cuts compute and memory for large vision-language generation. That lowers cloud GPU costs for long images/videos, reduces time-to-first-token for interactive apps, and increases decoding throughput so more requests fit a GPU.
Who Should Care
Summary TLDR
ZipVL picks a variable number of "important" visual tokens per layer based on attention-score distributions, runs attention only on those tokens, and evicts the rest from the KV cache. This joint, layer-wise adaptive token sparsification speeds the prefill (attention) phase and reduces KV memory during decoding while keeping accuracy nearly intact (e.g., −0.5% on VQAv2 with LLaVA-Next-13B). Reported: up to 2.3× prefill latency reduction and 2.8× decoding throughput improvement on long visual inputs.
Problem Statement
High-resolution images and long videos create very long token sequences for vision-language models. Attention computation during prefill is quadratic and slow. Fetching full KV caches during decoding uses lots of memory. Visual tokens are often redundant, so selective token dropping could reduce compute and memory if done without hurting accuracy.
Main Contribution
Adaptive layer-wise ratio assignment: compute how many tokens to keep per layer from attention-score distributions instead of a fixed fraction.
Unified token-use across phases: use the same selected tokens for sparse attention in prefill and for keeping KV entries in decoding.
Key Findings
Prefill (attention) latency reduced up to 2.3× on long inputs.
Decoding throughput improved up to 2.8× thanks to a smaller KV cache.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 80.4% (ZipVL, τ=0.975) | 80.9% (Full) | -0.5% | VQAv2 | Table1: LLaVA-Next-13B rows | Table1 |
| Prefill latency reduction | 2.3× faster | FlashAttention Full | — | Long visual sequences (128K tokens) | Abstract/Table4 prefill numbers at long input lengths | Abstract/Table4 |
What To Try In 7 Days
Measure attention-map sparsity on your LVLM and confirm redundancy in visual tokens.
Implement token selection from attention scores and test a conservative τ (~0.975) on validation tasks.
Combine selected-token sparse attention with FlashAttention; measure TTFT and throughput on long inputs.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Only sparsifies attention; MLP layers remain dense and still cost compute.
Requires a good probe-token design; poor probe selection (e.g., random-only) breaks performance.
When Not To Use
Short input sequences where FlashAttention or other semi-structured methods are already faster.
When absolute maximum accuracy is required and any token eviction risk is unacceptable.
Failure Modes
Evicting tokens that later become important for complex reasoning, causing answer errors.
Using only randomly sampled probe tokens leads to catastrophic accuracy loss (shown in ablation).

