Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
AdaTP cuts inference FLOPs by up to ~73% on evaluated Video LLMs without losing task accuracy, lowering compute cost for production video understanding.
Summary TLDR
Video LLMs generate many visual tokens and attention scores used for pruning are biased toward sequence ends and a few fixed spatial locations. AdaTP is a training-free, plug-in pipeline that (1) splits a video into segments by frame similarity, (2) boosts token retention in text-relevant segments using the text encoder, and (3) removes repeated spatial tokens inside each segment. On common video benchmarks, AdaTP cuts compute to ~27% FLOPs on LLaVA-OneVision-7B with no loss of average task score on the evaluated benchmarks.
Problem Statement
Video LLMs are slow because they produce many visual tokens and attention-based pruning often picks tokens that reflect attention bias (global bias to sequence ends; local bias to fixed spatial positions). This leads to poor compressions and performance drops when token counts are reduced.
Main Contribution
Identify two attention biases in Video LLMs: global (tokens cluster at sequence ends) and local (few fixed spatial positions dominate).
Propose AdaTP: a training-free pipeline combining segment-aware pruning, text-guided global debiasing, and intra-segment spatial deduplication.
Show AdaTP reduces FLOPs dramatically while preserving or slightly improving benchmark scores across multiple Video LLMs and compression settings.
Key Findings
Attention scores in early layers concentrate at sequence ends (global bias).
A few spatial locations receive disproportionate attention (local bias).
AdaTP retains task performance while cutting compute.
AdaTP outperforms other training-free pruning baselines across models and compression rates.
Results
Average score (VideoMME/MLVU/LongVideoBench combined)
Average score
Average score
Who Should Care
What To Try In 7 Days
Profile a Video LLM on representative workloads with torch.profiler to measure current FLOPs.
Plug in AdaTP (training-free) on a dev instance for one model (e.g., LLaVA-OneVision-7B) and compare accuracy and FLOPs on a sample of your videos.
Tune segment similarity (τs) and text threshold (τt) on a small validation set; keep a copy of original answers to detect regressions.
Agent Features
Architectures
- vision-language
Optimization Features
Token Efficiency
- adaptive per-segment token budgets
- deduplicate same spatial patch across frames
Infra Optimization
- reduces attention compute; measured FLOPs cuts via torch.profiler
Inference Optimization
- progressive layer-by-layer token pruning
- segment-aware retention allocation
- text-guided segment prioritization
- intra-segment spatial deduplication
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Not validated on very large Video LLMs due to compute limits (authors note this).
- Pipeline has multiple hyperparameters (τs, τt, αboost, γcap, p) that need tuning for best results.
- Relies on availability of aligned visual and text encoders to compute segment and text relevancy.
When Not To Use
- If you cannot run the visual/text encoders needed to compute segment similarity and text relevance.
- If you lack resources to tune hyperparameters on a validation set for your video distribution.
- On untested, much larger Video LLMs until further validation.
Failure Modes
- Over-pruning important mid-sequence frames if τt or segmentation thresholds are mis-set (sensitivity shown in ablations).
- Dropping small but critical spatial cues when spatial deduplication removes the only informative patch.
- Performance regressions on video types not covered by evaluated benchmarks.
Core Entities
Models
- LLaVA-OneVision-0.5B
- LLaVA-OneVision-7B
- LLaVA-Video-7B
Metrics
- FLOPs
- Average benchmark score (percent points shown)
Datasets
- VideoMME
- MLVU
- LongVideoBench
Benchmarks
- VideoMME
- MLVU
- LongVideoBench

