Training-free token pruning that fixes attention bias to cut Video-LLM FLOPs while keeping accuracy

May 26, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, Guiguang Ding

Links

Abstract / PDF

Why It Matters For Business

AdaTP cuts inference FLOPs by up to ~73% on evaluated Video LLMs without losing task accuracy, lowering compute cost for production video understanding.

Summary TLDR

Video LLMs generate many visual tokens and attention scores used for pruning are biased toward sequence ends and a few fixed spatial locations. AdaTP is a training-free, plug-in pipeline that (1) splits a video into segments by frame similarity, (2) boosts token retention in text-relevant segments using the text encoder, and (3) removes repeated spatial tokens inside each segment. On common video benchmarks, AdaTP cuts compute to ~27% FLOPs on LLaVA-OneVision-7B with no loss of average task score on the evaluated benchmarks.

Problem Statement

Video LLMs are slow because they produce many visual tokens and attention-based pruning often picks tokens that reflect attention bias (global bias to sequence ends; local bias to fixed spatial positions). This leads to poor compressions and performance drops when token counts are reduced.

Main Contribution

Identify two attention biases in Video LLMs: global (tokens cluster at sequence ends) and local (few fixed spatial positions dominate).

Propose AdaTP: a training-free pipeline combining segment-aware pruning, text-guided global debiasing, and intra-segment spatial deduplication.

Show AdaTP reduces FLOPs dramatically while preserving or slightly improving benchmark scores across multiple Video LLMs and compression settings.

Key Findings

Attention scores in early layers concentrate at sequence ends (global bias).

Numbers86.8% of top-10% attention tokens lie in last 4 of 32 frames (Layer 1)

A few spatial locations receive disproportionate attention (local bias).

NumbersTop spatial patch gets 5.77× the average attention (Layer 1)

AdaTP retains task performance while cutting compute.

NumbersLLaVA-OneVision-7B: AdaTP at 27.3% FLOPs → avg score 59.51 vs vanilla 100% FLOPs → 59.34 (evaluated benchmarks)

AdaTP outperforms other training-free pruning baselines across models and compression rates.

NumbersAcross models/rates AdaTP yields higher average scores in Table 1 (multiple rows)

Results

Average score (VideoMME/MLVU/LongVideoBench combined)

Value59.51 (AdaTP at 27.30% FLOPs)

Baseline59.34 (vanilla at 100% FLOPs)

Average score

Value46.53 (AdaTP at 26.43% FLOPs)

Baseline45.94 (vanilla at 100% FLOPs)

Average score

Value59.62 (AdaTP at 36.63% FLOPs)

Baseline59.34 (vanilla at 100% FLOPs)

Who Should Care

What To Try In 7 Days

Profile a Video LLM on representative workloads with torch.profiler to measure current FLOPs.

Plug in AdaTP (training-free) on a dev instance for one model (e.g., LLaVA-OneVision-7B) and compare accuracy and FLOPs on a sample of your videos.

Tune segment similarity (τs) and text threshold (τt) on a small validation set; keep a copy of original answers to detect regressions.

Agent Features

Architectures

  • vision-language

Optimization Features

Token Efficiency

  • adaptive per-segment token budgets
  • deduplicate same spatial patch across frames

Infra Optimization

  • reduces attention compute; measured FLOPs cuts via torch.profiler

Inference Optimization

  • progressive layer-by-layer token pruning
  • segment-aware retention allocation
  • text-guided segment prioritization
  • intra-segment spatial deduplication

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not validated on very large Video LLMs due to compute limits (authors note this).
  • Pipeline has multiple hyperparameters (τs, τt, αboost, γcap, p) that need tuning for best results.
  • Relies on availability of aligned visual and text encoders to compute segment and text relevancy.

When Not To Use

  • If you cannot run the visual/text encoders needed to compute segment similarity and text relevance.
  • If you lack resources to tune hyperparameters on a validation set for your video distribution.
  • On untested, much larger Video LLMs until further validation.

Failure Modes

  • Over-pruning important mid-sequence frames if τt or segmentation thresholds are mis-set (sensitivity shown in ablations).
  • Dropping small but critical spatial cues when spatial deduplication removes the only informative patch.
  • Performance regressions on video types not covered by evaluated benchmarks.

Core Entities

Models

  • LLaVA-OneVision-0.5B
  • LLaVA-OneVision-7B
  • LLaVA-Video-7B

Metrics

  • FLOPs
  • Average benchmark score (percent points shown)

Datasets

  • VideoMME
  • MLVU
  • LongVideoBench

Benchmarks

  • VideoMME
  • MLVU
  • LongVideoBench