Overview
Method is plug-and-play and training-free, tested on three Video LLM variants and three public benchmarks, but not validated on larger-scale Video LLMs and needs hyperparameter tuning.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
AdaTP cuts inference FLOPs by up to ~73% on evaluated Video LLMs without losing task accuracy, lowering compute cost for production video understanding.
Who Should Care
Summary TLDR
Video LLMs generate many visual tokens and attention scores used for pruning are biased toward sequence ends and a few fixed spatial locations. AdaTP is a training-free, plug-in pipeline that (1) splits a video into segments by frame similarity, (2) boosts token retention in text-relevant segments using the text encoder, and (3) removes repeated spatial tokens inside each segment. On common video benchmarks, AdaTP cuts compute to ~27% FLOPs on LLaVA-OneVision-7B with no loss of average task score on the evaluated benchmarks.
Problem Statement
Video LLMs are slow because they produce many visual tokens and attention-based pruning often picks tokens that reflect attention bias (global bias to sequence ends; local bias to fixed spatial positions). This leads to poor compressions and performance drops when token counts are reduced.
Main Contribution
Identify two attention biases in Video LLMs: global (tokens cluster at sequence ends) and local (few fixed spatial positions dominate).
Propose AdaTP: a training-free pipeline combining segment-aware pruning, text-guided global debiasing, and intra-segment spatial deduplication.
Key Findings
Attention scores in early layers concentrate at sequence ends (global bias).
A few spatial locations receive disproportionate attention (local bias).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average score (VideoMME/MLVU/LongVideoBench combined) | 59.51 (AdaTP at 27.30% FLOPs) | 59.34 (vanilla at 100% FLOPs) | +0.17 | LLaVA-OneVision-7B; evaluated benchmarks (Table 1) | Table 1 reports AdaTP 27.30% FLOPs avg 59.51 vs vanilla 59.34 | Table 1 |
| Average score | 46.53 (AdaTP at 26.43% FLOPs) | 45.94 (vanilla at 100% FLOPs) | +0.59 | LLaVA-OneVision-0.5B; evaluated benchmarks (Table 1) | Table 1 reports AdaTP 26.43% FLOPs avg 46.53 vs vanilla 45.94 | Table 1 |
What To Try In 7 Days
Profile a Video LLM on representative workloads with torch.profiler to measure current FLOPs.
Plug in AdaTP (training-free) on a dev instance for one model (e.g., LLaVA-OneVision-7B) and compare accuracy and FLOPs on a sample of your videos.
Tune segment similarity (τs) and text threshold (τt) on a small validation set; keep a copy of original answers to detect regressions.
Agent Features
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Not validated on very large Video LLMs due to compute limits (authors note this).
Pipeline has multiple hyperparameters (τs, τt, αboost, γcap, p) that need tuning for best results.
When Not To Use
If you cannot run the visual/text encoders needed to compute segment similarity and text relevance.
If you lack resources to tune hyperparameters on a validation set for your video distribution.
Failure Modes
Over-pruning important mid-sequence frames if τt or segmentation thresholds are mis-set (sensitivity shown in ablations).
Dropping small but critical spatial cues when spatial deduplication removes the only informative patch.

