Training-free token pruning that fixes attention bias to cut Video-LLM FLOPs while keeping accuracy

May 26, 20257 min

Overview

Decision SnapshotReady For Pilot

Method is plug-and-play and training-free, tested on three Video LLM variants and three public benchmarks, but not validated on larger-scale Video LLMs and needs hyperparameter tuning.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, Guiguang Ding

Links

Abstract / PDF

Why It Matters For Business

AdaTP cuts inference FLOPs by up to ~73% on evaluated Video LLMs without losing task accuracy, lowering compute cost for production video understanding.

Who Should Care

Summary TLDR

Video LLMs generate many visual tokens and attention scores used for pruning are biased toward sequence ends and a few fixed spatial locations. AdaTP is a training-free, plug-in pipeline that (1) splits a video into segments by frame similarity, (2) boosts token retention in text-relevant segments using the text encoder, and (3) removes repeated spatial tokens inside each segment. On common video benchmarks, AdaTP cuts compute to ~27% FLOPs on LLaVA-OneVision-7B with no loss of average task score on the evaluated benchmarks.

Problem Statement

Video LLMs are slow because they produce many visual tokens and attention-based pruning often picks tokens that reflect attention bias (global bias to sequence ends; local bias to fixed spatial positions). This leads to poor compressions and performance drops when token counts are reduced.

Main Contribution

Identify two attention biases in Video LLMs: global (tokens cluster at sequence ends) and local (few fixed spatial positions dominate).

Propose AdaTP: a training-free pipeline combining segment-aware pruning, text-guided global debiasing, and intra-segment spatial deduplication.

Key Findings

Attention scores in early layers concentrate at sequence ends (global bias).

Numbers86.8% of top-10% attention tokens lie in last 4 of 32 frames (Layer 1)

Practical UseDo not prune purely by raw attention scores; they over-select end frames and miss middle-frame content.

Evidence RefSection 3.2, Fig.2

A few spatial locations receive disproportionate attention (local bias).

NumbersTop spatial patch gets 5.77× the average attention (Layer 1)

Practical UsePruning must deduplicate repeated spatial positions across frames, or you keep redundant patches and lose diversity.

Evidence RefSection 3.2, Fig.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average score (VideoMME/MLVU/LongVideoBench combined)59.51 (AdaTP at 27.30% FLOPs)59.34 (vanilla at 100% FLOPs)+0.17LLaVA-OneVision-7B; evaluated benchmarks (Table 1)Table 1 reports AdaTP 27.30% FLOPs avg 59.51 vs vanilla 59.34Table 1
Average score46.53 (AdaTP at 26.43% FLOPs)45.94 (vanilla at 100% FLOPs)+0.59LLaVA-OneVision-0.5B; evaluated benchmarks (Table 1)Table 1 reports AdaTP 26.43% FLOPs avg 46.53 vs vanilla 45.94Table 1

What To Try In 7 Days

Profile a Video LLM on representative workloads with torch.profiler to measure current FLOPs.

Plug in AdaTP (training-free) on a dev instance for one model (e.g., LLaVA-OneVision-7B) and compare accuracy and FLOPs on a sample of your videos.

Tune segment similarity (τs) and text threshold (τt) on a small validation set; keep a copy of original answers to detect regressions.

Agent Features

Architectures
vision-language

Optimization Features

Token Efficiency
adaptive per-segment token budgetsdeduplicate same spatial patch across frames
Infra Optimization
reduces attention compute; measured FLOPs cuts via torch.profiler
Inference Optimization
progressive layer-by-layer token pruningsegment-aware retention allocationtext-guided segment prioritizationintra-segment spatial deduplication

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Not validated on very large Video LLMs due to compute limits (authors note this).

Pipeline has multiple hyperparameters (τs, τt, αboost, γcap, p) that need tuning for best results.

When Not To Use

If you cannot run the visual/text encoders needed to compute segment similarity and text relevance.

If you lack resources to tune hyperparameters on a validation set for your video distribution.

Failure Modes

Over-pruning important mid-sequence frames if τt or segmentation thresholds are mis-set (sensitivity shown in ablations).

Dropping small but critical spatial cues when spatial deduplication removes the only informative patch.

Core Entities

Models

LLaVA-OneVision-0.5BLLaVA-OneVision-7BLLaVA-Video-7B

Metrics

FLOPsAverage benchmark score (percent points shown)

Datasets

VideoMMEMLVULongVideoBench

Benchmarks

VideoMMEMLVULongVideoBench