Speed up vision-language inference by keeping only the attention-heavy tokens per layer.

October 11, 20247 min

Overview

Decision SnapshotReady For Pilot

Results are measured across multiple LVLMs and image/video benchmarks. Gains are strongest on long visual inputs where attention and KV cache dominate costs. The method relies on approximating attention with probe tokens; robustness to probe selection is shown but requires tuning (τ).

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/7

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang

Links

Abstract / PDF

Why It Matters For Business

ZipVL cuts compute and memory for large vision-language generation. That lowers cloud GPU costs for long images/videos, reduces time-to-first-token for interactive apps, and increases decoding throughput so more requests fit a GPU.

Who Should Care

Summary TLDR

ZipVL picks a variable number of "important" visual tokens per layer based on attention-score distributions, runs attention only on those tokens, and evicts the rest from the KV cache. This joint, layer-wise adaptive token sparsification speeds the prefill (attention) phase and reduces KV memory during decoding while keeping accuracy nearly intact (e.g., −0.5% on VQAv2 with LLaVA-Next-13B). Reported: up to 2.3× prefill latency reduction and 2.8× decoding throughput improvement on long visual inputs.

Problem Statement

High-resolution images and long videos create very long token sequences for vision-language models. Attention computation during prefill is quadratic and slow. Fetching full KV caches during decoding uses lots of memory. Visual tokens are often redundant, so selective token dropping could reduce compute and memory if done without hurting accuracy.

Main Contribution

Adaptive layer-wise ratio assignment: compute how many tokens to keep per layer from attention-score distributions instead of a fixed fraction.

Unified token-use across phases: use the same selected tokens for sparse attention in prefill and for keeping KV entries in decoding.

Key Findings

Prefill (attention) latency reduced up to 2.3× on long inputs.

Numbers2.3× prefill latency reduction (128K tokens)

Practical UseUse ZipVL for long visual inputs to cut time-to-first-token and attention cost when sequence length is very large.

Evidence RefAbstract/Table4

Decoding throughput improved up to 2.8× thanks to a smaller KV cache.

Numbers2.8× decoding throughput (16K input length)

Practical UseSmaller KV caches let you run larger batch sizes and get more tokens per second during generation.

Evidence RefAbstract/Table4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy80.4% (ZipVL, τ=0.975)80.9% (Full)-0.5%VQAv2Table1: LLaVA-Next-13B rowsTable1
Prefill latency reduction2.3× fasterFlashAttention FullLong visual sequences (128K tokens)Abstract/Table4 prefill numbers at long input lengthsAbstract/Table4

What To Try In 7 Days

Measure attention-map sparsity on your LVLM and confirm redundancy in visual tokens.

Implement token selection from attention scores and test a conservative τ (~0.975) on validation tasks.

Combine selected-token sparse attention with FlashAttention; measure TTFT and throughput on long inputs.

Optimization Features

Token Efficiency
adaptive token budgeting per layer and taskprobe-token approximate attention (64 recent + 64 random)
Infra Optimization
no custom GPU kernels required; uses existing fast attention implementation
Model Optimization
token-level sparsity based on attention scoreslayer-wise adaptive retention ratio
System Optimization
reduces TTFT on long sequencesenables larger batch sizes by shrinking KV cache
Inference Optimization
sparse attention only on selected tokens (works with FlashAttention)evict less-important tokens from KV cache during decoding

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Only sparsifies attention; MLP layers remain dense and still cost compute.

Requires a good probe-token design; poor probe selection (e.g., random-only) breaks performance.

When Not To Use

Short input sequences where FlashAttention or other semi-structured methods are already faster.

When absolute maximum accuracy is required and any token eviction risk is unacceptable.

Failure Modes

Evicting tokens that later become important for complex reasoning, causing answer errors.

Using only randomly sampled probe tokens leads to catastrophic accuracy loss (shown in ablation).

Core Entities

Models

LLaVALLaVA-NextQWen-VLLongVALLaMA3-8B

Metrics

AccuracyAttention FLOPs Reduction (%)KV Cache Reduction (%)KV Compression Ratio (×)Prefill TTFT (s)Decoding Throughput (tokens/s)

Datasets

VQAv2ChartQATextVQAGQAMMEVideo-MMEGSM8k

Benchmarks

VQAv2ChartQATextVQAGQAMMEVideo-MMEGSM8k