Speed up vision-language inference by keeping only the attention-heavy tokens per layer.

Overview

Decision SnapshotReady For Pilot

Results are measured across multiple LVLMs and image/video benchmarks. Gains are strongest on long visual inputs where attention and KV cache dominate costs. The method relies on approximating attention with probe tokens; robustness to probe selection is shown but requires tuning (τ).

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/7

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang

Links

Abstract / PDF

Why It Matters For Business

ZipVL cuts compute and memory for large vision-language generation. That lowers cloud GPU costs for long images/videos, reduces time-to-first-token for interactive apps, and increases decoding throughput so more requests fit a GPU.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

ZipVL picks a variable number of "important" visual tokens per layer based on attention-score distributions, runs attention only on those tokens, and evicts the rest from the KV cache. This joint, layer-wise adaptive token sparsification speeds the prefill (attention) phase and reduces KV memory during decoding while keeping accuracy nearly intact (e.g., −0.5% on VQAv2 with LLaVA-Next-13B). Reported: up to 2.3× prefill latency reduction and 2.8× decoding throughput improvement on long visual inputs.

Problem Statement

High-resolution images and long videos create very long token sequences for vision-language models. Attention computation during prefill is quadratic and slow. Fetching full KV caches during decoding uses lots of memory. Visual tokens are often redundant, so selective token dropping could reduce compute and memory if done without hurting accuracy.

Main Contribution

Adaptive layer-wise ratio assignment: compute how many tokens to keep per layer from attention-score distributions instead of a fixed fraction.

Unified token-use across phases: use the same selected tokens for sparse attention in prefill and for keeping KV entries in decoding.

Key Findings

Prefill (attention) latency reduced up to 2.3× on long inputs.

Numbers2.3× prefill latency reduction (128K tokens)

Practical UseUse ZipVL for long visual inputs to cut time-to-first-token and attention cost when sequence length is very large.

Evidence RefAbstract/Table4

Decoding throughput improved up to 2.8× thanks to a smaller KV cache.

Numbers2.8× decoding throughput (16K input length)

Practical UseSmaller KV caches let you run larger batch sizes and get more tokens per second during generation.

Evidence RefAbstract/Table4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	80.4% (ZipVL, τ=0.975)	80.9% (Full)	-0.5%	VQAv2	Table1: LLaVA-Next-13B rows	Table1
Prefill latency reduction	2.3× faster	FlashAttention Full	—	Long visual sequences (128K tokens)	Abstract/Table4 prefill numbers at long input lengths	Abstract/Table4

What To Try In 7 Days

Measure attention-map sparsity on your LVLM and confirm redundancy in visual tokens.

Implement token selection from attention scores and test a conservative τ (~0.975) on validation tasks.

Combine selected-token sparse attention with FlashAttention; measure TTFT and throughput on long inputs.

Optimization Features

Token Efficiency

adaptive token budgeting per layer and taskprobe-token approximate attention (64 recent + 64 random)

Infra Optimization

no custom GPU kernels required; uses existing fast attention implementation

Model Optimization

token-level sparsity based on attention scoreslayer-wise adaptive retention ratio

System Optimization

reduces TTFT on long sequencesenables larger batch sizes by shrinking KV cache

Inference Optimization

sparse attention only on selected tokens (works with FlashAttention)evict less-important tokens from KV cache during decoding

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Only sparsifies attention; MLP layers remain dense and still cost compute.

Requires a good probe-token design; poor probe selection (e.g., random-only) breaks performance.

When Not To Use

Short input sequences where FlashAttention or other semi-structured methods are already faster.

When absolute maximum accuracy is required and any token eviction risk is unacceptable.

Failure Modes

Evicting tokens that later become important for complex reasoning, causing answer errors.

Using only randomly sampled probe tokens leads to catastrophic accuracy loss (shown in ablation).

Core Entities

Models

LLaVALLaVA-NextQWen-VLLongVALLaMA3-8B

Metrics

AccuracyAttention FLOPs Reduction (%)KV Cache Reduction (%)KV Compression Ratio (×)Prefill TTFT (s)Decoding Throughput (tokens/s)

Datasets

VQAv2ChartQATextVQAGQAMMEVideo-MMEGSM8k

Benchmarks

VQAv2ChartQATextVQAGQAMMEVideo-MMEGSM8k

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prefill (attention) latency reduced up to 2.3× on long inputs.

Decoding throughput improved up to 2.8× thanks to a smaller KV cache.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Measure many LLMs with only a few test items by learning weighted anchors

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

Key finding

Practical survey of how to combine fine-tuned LLMs into one model without retraining

Key finding