Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
DOP cuts prefilling compute and real GPU latency substantially while keeping task accuracy near-original, enabling cheaper inference and higher throughput for multimodal deployments.
Summary TLDR
The paper proposes Depth-wise Operation Pruning (DOP), a practical method that prunes per-module "operations" (a module processing a token group) so critical modules get more tokens while redundant modules are skipped. DOP uses a depth-wise constraint and an additive KL-divergence proxy on a small validation set to search fast. Across 6 MLLMs and 13 benchmarks, DOP outperforms token-pruning baselines, reduces TFLOPs up to 86% with ~1% average performance loss on a flagship model, and translates to real GPU latency wins. It needs only small validation runs (25–100 samples) and transfers well between models.
Problem Statement
Existing token reduction methods give the same visual tokens to all decoder modules and ignore that different modules and layers vary in importance. This wastes compute in the decoder. We need a fine-grained way to skip specific module×token computations so important modules can process more tokens under a computation budget.
Main Contribution
Formulate operation pruning: treat each module×token-group computation as an atomic operation and allow selective skipping.
Introduce DOP: depth-wise pruning + additive KL-divergence proxy to make policy search fast and data-light.
Show broad gains: better performance-efficiency tradeoffs across 6 MLLMs and 13 benchmarks, with low optimization cost and transferability.
Key Findings
DOP can cut theoretical FLOPs by 86% while incurring ~1% average performance loss on LLaVA-NeXT-7B.
Under no average performance loss, DOP still reduces TFLOPs by 77% on LLaVA-NeXT-7B.
DOP preserves up to 7% more task performance than the prior state-of-the-art token-pruning baseline on evaluated benchmarks.
The additive proxy (sum of single-parameter divergences) ranks pruning policies similar to the true joint divergence.
Optimization is fast and data-light: good policies found in minutes with 25–100 samples.
DOP policies transfer across models with small loss (≤0.7% Rel. Avg. difference).
Results
TFLOPs reduction (LLaVA-NeXT-7B)
TFLOPs reduction without perf loss (LLaVA-NeXT-7B)
Real GPU prefilling latency reduction
Performance advantage vs baseline
Optimization cost
Who Should Care
What To Try In 7 Days
Run DOP optimization on a representative model with 25–100 validation samples to get a one-time pruning policy.
Plug DOP over your existing visual token reducer (VisPruner or CDPruner) and measure prefilling latency and end-task metrics.
Test policy transfer: optimize once on a dev model and apply the divergences to sibling models to save time.
Optimization Features
Token Efficiency
- integrates with token pruning methods (VisPruner, CDPruner) and resizing
- searches for per-module token allocations (n_v)
Infra Optimization
- one-time lightweight validation runs (minutes) on GPU
Model Optimization
- operation-level pruning (module×token-group)
- depth-wise pruning (prune deeper operations first)
System Optimization
- reduces prefilling computational cost and real GPU latency
Training Optimization
- no training; uses inference divergence for policy search
Inference Optimization
- skips MHA/MLP operations for pruned visual tokens during prefilling
- compatible with FlashAttention2 and other attention implementations
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only prunes visual-token operations in experiments; other token groups not evaluated extensively
- Targets prefilling stage (compute-bound); decoding-stage (memory-bandwidth-bound) gains may be small
- Additive proxy preserves ranking but can swap a few low-rank policies
- Requires access to a small validation set representative of target tasks
When Not To Use
- If your bottleneck is decoding memory bandwidth rather than prefilling compute
- When you cannot run any validation samples on your target model
- When you need guaranteed exact output distributions for safety-critical cases
Failure Modes
- Over-aggressive deep-layer pruning on tasks that require deep visual processing (SeedBench example)
- Suboptimal transfer if source and target architectures have very different redundancy patterns
- Conservative policies that reduce tokens instead of operations when minimal token constraints are mis-set
Core Entities
Models
- LLaVA-1.5-7B
- LLaVA-1.5-13B
- LLaVA-NeXT-7B
- LLaVA-NeXT-13B
- Qwen2.5-VL-7B
- InternVL3-8B
Metrics
- TFLOPs
- prefilling latency (ms)
- KL divergence on first-token output
- relative average performance (Rel. Avg.)
Datasets
- TextVQA
- VQA V2
- GQA
- VizWiz
- ScienceQA
- SeedBench
- MME
- POPE
- MMBench
- MMBench-CN
- AI2D
- ChartQA
- OCRBench
Benchmarks
- 13 multimodal VQA and OCR benchmarks (see datasets list)

