Speed up vision-language inference by keeping only the attention-heavy tokens per layer.

October 11, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang

Links

Abstract / PDF

Why It Matters For Business

ZipVL cuts compute and memory for large vision-language generation. That lowers cloud GPU costs for long images/videos, reduces time-to-first-token for interactive apps, and increases decoding throughput so more requests fit a GPU.

Summary TLDR

ZipVL picks a variable number of "important" visual tokens per layer based on attention-score distributions, runs attention only on those tokens, and evicts the rest from the KV cache. This joint, layer-wise adaptive token sparsification speeds the prefill (attention) phase and reduces KV memory during decoding while keeping accuracy nearly intact (e.g., −0.5% on VQAv2 with LLaVA-Next-13B). Reported: up to 2.3× prefill latency reduction and 2.8× decoding throughput improvement on long visual inputs.

Problem Statement

High-resolution images and long videos create very long token sequences for vision-language models. Attention computation during prefill is quadratic and slow. Fetching full KV caches during decoding uses lots of memory. Visual tokens are often redundant, so selective token dropping could reduce compute and memory if done without hurting accuracy.

Main Contribution

Adaptive layer-wise ratio assignment: compute how many tokens to keep per layer from attention-score distributions instead of a fixed fraction.

Unified token-use across phases: use the same selected tokens for sparse attention in prefill and for keeping KV entries in decoding.

Practical integration: implements token-level sparsity using standard fast attention (FlashAttention) with a cheap probe-token approximation so no custom kernels are needed.

Key Findings

Prefill (attention) latency reduced up to 2.3× on long inputs.

Numbers2.3× prefill latency reduction (128K tokens)

Decoding throughput improved up to 2.8× thanks to a smaller KV cache.

Numbers2.8× decoding throughput (16K input length)

Accuracy is nearly preserved on standard VQA benchmarks.

NumbersVQAv2: 80.9% (Full) → 80.4% (ZipVL, τ=0.975) → −0.5%

Large savings on attention FLOPs and KV cache size for long video inputs.

NumbersAttn FLOPs −82.3%, KV cache −57.9% (128-frame video)

Layer-wise adaptive ratio gives better compression and accuracy than fixed ratios for KV cache.

NumbersKV compression 6.18× vs 4.69×; GSM8k acc 54.06% vs 53.75%

Results

Accuracy

Value80.4% (ZipVL, τ=0.975)

Baseline80.9% (Full)

Prefill latency reduction

Value2.3× faster

BaselineFlashAttention Full

Decoding throughput

Value2.8× higher

BaselineFlashAttention baseline

Attention FLOPs reduction (video)

Value82.3%

BaselineFull attention

KV cache reduction (video)

Value57.9%

BaselineFull KV cache

KV cache compression (LLaMA3-8B)

Value6.18×

BaselineFixed-ratio method 4.69×

Accuracy

Value54.06%

BaselineFixed [15] 53.75%

Who Should Care

What To Try In 7 Days

Measure attention-map sparsity on your LVLM and confirm redundancy in visual tokens.

Implement token selection from attention scores and test a conservative τ (~0.975) on validation tasks.

Combine selected-token sparse attention with FlashAttention; measure TTFT and throughput on long inputs.

Optimization Features

Token Efficiency

  • adaptive token budgeting per layer and task
  • probe-token approximate attention (64 recent + 64 random)

Infra Optimization

  • no custom GPU kernels required; uses existing fast attention implementation

Model Optimization

  • token-level sparsity based on attention scores
  • layer-wise adaptive retention ratio

System Optimization

  • reduces TTFT on long sequences
  • enables larger batch sizes by shrinking KV cache

Inference Optimization

  • sparse attention only on selected tokens (works with FlashAttention)
  • evict less-important tokens from KV cache during decoding

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Only sparsifies attention; MLP layers remain dense and still cost compute.
  • Requires a good probe-token design; poor probe selection (e.g., random-only) breaks performance.
  • Performance depends on threshold τ; τ < 0.97 can cause significant accuracy drops.

When Not To Use

  • Short input sequences where FlashAttention or other semi-structured methods are already faster.
  • When absolute maximum accuracy is required and any token eviction risk is unacceptable.
  • When you cannot compute or approximate attention scores cheaply (e.g., constrained runtime).

Failure Modes

  • Evicting tokens that later become important for complex reasoning, causing answer errors.
  • Using only randomly sampled probe tokens leads to catastrophic accuracy loss (shown in ablation).
  • Approximation overhead can negate gains on short sequences or small batch sizes.

Core Entities

Models

  • LLaVA
  • LLaVA-Next
  • QWen-VL
  • LongVA
  • LLaMA3-8B

Metrics

  • Accuracy
  • Attention FLOPs Reduction (%)
  • KV Cache Reduction (%)
  • KV Compression Ratio (×)
  • Prefill TTFT (s)
  • Decoding Throughput (tokens/s)

Datasets

  • VQAv2
  • ChartQA
  • TextVQA
  • GQA
  • MME
  • Video-MME
  • GSM8k

Benchmarks

  • VQAv2
  • ChartQA
  • TextVQA
  • GQA
  • MME
  • Video-MME
  • GSM8k