Training-free token pruning that fixes attention bias to cut Video-LLM FLOPs while keeping accuracy

Overview

Decision SnapshotReady For Pilot

Method is plug-and-play and training-free, tested on three Video LLM variants and three public benchmarks, but not validated on larger-scale Video LLMs and needs hyperparameter tuning.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, Guiguang Ding

Links

Abstract / PDF

Why It Matters For Business

AdaTP cuts inference FLOPs by up to ~73% on evaluated Video LLMs without losing task accuracy, lowering compute cost for production video understanding.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

Video LLMs generate many visual tokens and attention scores used for pruning are biased toward sequence ends and a few fixed spatial locations. AdaTP is a training-free, plug-in pipeline that (1) splits a video into segments by frame similarity, (2) boosts token retention in text-relevant segments using the text encoder, and (3) removes repeated spatial tokens inside each segment. On common video benchmarks, AdaTP cuts compute to ~27% FLOPs on LLaVA-OneVision-7B with no loss of average task score on the evaluated benchmarks.

Problem Statement

Video LLMs are slow because they produce many visual tokens and attention-based pruning often picks tokens that reflect attention bias (global bias to sequence ends; local bias to fixed spatial positions). This leads to poor compressions and performance drops when token counts are reduced.

Main Contribution

Identify two attention biases in Video LLMs: global (tokens cluster at sequence ends) and local (few fixed spatial positions dominate).

Propose AdaTP: a training-free pipeline combining segment-aware pruning, text-guided global debiasing, and intra-segment spatial deduplication.

Key Findings

Attention scores in early layers concentrate at sequence ends (global bias).

Numbers86.8% of top-10% attention tokens lie in last 4 of 32 frames (Layer 1)

Practical UseDo not prune purely by raw attention scores; they over-select end frames and miss middle-frame content.

Evidence RefSection 3.2, Fig.2

A few spatial locations receive disproportionate attention (local bias).

NumbersTop spatial patch gets 5.77× the average attention (Layer 1)

Practical UsePruning must deduplicate repeated spatial positions across frames, or you keep redundant patches and lose diversity.

Evidence RefSection 3.2, Fig.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average score (VideoMME/MLVU/LongVideoBench combined)	59.51 (AdaTP at 27.30% FLOPs)	59.34 (vanilla at 100% FLOPs)	+0.17	LLaVA-OneVision-7B; evaluated benchmarks (Table 1)	Table 1 reports AdaTP 27.30% FLOPs avg 59.51 vs vanilla 59.34	Table 1
Average score	46.53 (AdaTP at 26.43% FLOPs)	45.94 (vanilla at 100% FLOPs)	+0.59	LLaVA-OneVision-0.5B; evaluated benchmarks (Table 1)	Table 1 reports AdaTP 26.43% FLOPs avg 46.53 vs vanilla 45.94	Table 1

What To Try In 7 Days

Profile a Video LLM on representative workloads with torch.profiler to measure current FLOPs.

Plug in AdaTP (training-free) on a dev instance for one model (e.g., LLaVA-OneVision-7B) and compare accuracy and FLOPs on a sample of your videos.

Tune segment similarity (τs) and text threshold (τt) on a small validation set; keep a copy of original answers to detect regressions.

Agent Features

Architectures

vision-language

Optimization Features

Token Efficiency

adaptive per-segment token budgetsdeduplicate same spatial patch across frames

Infra Optimization

reduces attention compute; measured FLOPs cuts via torch.profiler

Inference Optimization

progressive layer-by-layer token pruningsegment-aware retention allocationtext-guided segment prioritizationintra-segment spatial deduplication

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Not validated on very large Video LLMs due to compute limits (authors note this).

Pipeline has multiple hyperparameters (τs, τt, αboost, γcap, p) that need tuning for best results.

When Not To Use

If you cannot run the visual/text encoders needed to compute segment similarity and text relevance.

If you lack resources to tune hyperparameters on a validation set for your video distribution.

Failure Modes

Over-pruning important mid-sequence frames if τt or segmentation thresholds are mis-set (sensitivity shown in ablations).

Dropping small but critical spatial cues when spatial deduplication removes the only informative patch.

Core Entities

Models

LLaVA-OneVision-0.5BLLaVA-OneVision-7BLLaVA-Video-7B

Metrics

FLOPsAverage benchmark score (percent points shown)

Datasets

VideoMMEMLVULongVideoBench

Benchmarks

VideoMMEMLVULongVideoBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Attention scores in early layers concentrate at sequence ends (global bias).

A few spatial locations receive disproportionate attention (local bias).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding