Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
ZipVL cuts compute and memory for large vision-language generation. That lowers cloud GPU costs for long images/videos, reduces time-to-first-token for interactive apps, and increases decoding throughput so more requests fit a GPU.
Summary TLDR
ZipVL picks a variable number of "important" visual tokens per layer based on attention-score distributions, runs attention only on those tokens, and evicts the rest from the KV cache. This joint, layer-wise adaptive token sparsification speeds the prefill (attention) phase and reduces KV memory during decoding while keeping accuracy nearly intact (e.g., −0.5% on VQAv2 with LLaVA-Next-13B). Reported: up to 2.3× prefill latency reduction and 2.8× decoding throughput improvement on long visual inputs.
Problem Statement
High-resolution images and long videos create very long token sequences for vision-language models. Attention computation during prefill is quadratic and slow. Fetching full KV caches during decoding uses lots of memory. Visual tokens are often redundant, so selective token dropping could reduce compute and memory if done without hurting accuracy.
Main Contribution
Adaptive layer-wise ratio assignment: compute how many tokens to keep per layer from attention-score distributions instead of a fixed fraction.
Unified token-use across phases: use the same selected tokens for sparse attention in prefill and for keeping KV entries in decoding.
Practical integration: implements token-level sparsity using standard fast attention (FlashAttention) with a cheap probe-token approximation so no custom kernels are needed.
Key Findings
Prefill (attention) latency reduced up to 2.3× on long inputs.
Decoding throughput improved up to 2.8× thanks to a smaller KV cache.
Accuracy is nearly preserved on standard VQA benchmarks.
Large savings on attention FLOPs and KV cache size for long video inputs.
Layer-wise adaptive ratio gives better compression and accuracy than fixed ratios for KV cache.
Results
Accuracy
Prefill latency reduction
Decoding throughput
Attention FLOPs reduction (video)
KV cache reduction (video)
KV cache compression (LLaMA3-8B)
Accuracy
Who Should Care
What To Try In 7 Days
Measure attention-map sparsity on your LVLM and confirm redundancy in visual tokens.
Implement token selection from attention scores and test a conservative τ (~0.975) on validation tasks.
Combine selected-token sparse attention with FlashAttention; measure TTFT and throughput on long inputs.
Optimization Features
Token Efficiency
- adaptive token budgeting per layer and task
- probe-token approximate attention (64 recent + 64 random)
Infra Optimization
- no custom GPU kernels required; uses existing fast attention implementation
Model Optimization
- token-level sparsity based on attention scores
- layer-wise adaptive retention ratio
System Optimization
- reduces TTFT on long sequences
- enables larger batch sizes by shrinking KV cache
Inference Optimization
- sparse attention only on selected tokens (works with FlashAttention)
- evict less-important tokens from KV cache during decoding
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Only sparsifies attention; MLP layers remain dense and still cost compute.
- Requires a good probe-token design; poor probe selection (e.g., random-only) breaks performance.
- Performance depends on threshold τ; τ < 0.97 can cause significant accuracy drops.
When Not To Use
- Short input sequences where FlashAttention or other semi-structured methods are already faster.
- When absolute maximum accuracy is required and any token eviction risk is unacceptable.
- When you cannot compute or approximate attention scores cheaply (e.g., constrained runtime).
Failure Modes
- Evicting tokens that later become important for complex reasoning, causing answer errors.
- Using only randomly sampled probe tokens leads to catastrophic accuracy loss (shown in ablation).
- Approximation overhead can negate gains on short sequences or small batch sizes.
Core Entities
Models
- LLaVA
- LLaVA-Next
- QWen-VL
- LongVA
- LLaMA3-8B
Metrics
- Accuracy
- Attention FLOPs Reduction (%)
- KV Cache Reduction (%)
- KV Compression Ratio (×)
- Prefill TTFT (s)
- Decoding Throughput (tokens/s)
Datasets
- VQAv2
- ChartQA
- TextVQA
- GQA
- MME
- Video-MME
- GSM8k
Benchmarks
- VQAv2
- ChartQA
- TextVQA
- GQA
- MME
- Video-MME
- GSM8k

