Overview
DOP is a practical, inference-only approach that needs a small validation set and one-time GPU runs; gains are demonstrated on multiple models and real A100 timings, but it focuses on prefilling and visual-token ops only.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
DOP cuts prefilling compute and real GPU latency substantially while keeping task accuracy near-original, enabling cheaper inference and higher throughput for multimodal deployments.
Who Should Care
Summary TLDR
The paper proposes Depth-wise Operation Pruning (DOP), a practical method that prunes per-module "operations" (a module processing a token group) so critical modules get more tokens while redundant modules are skipped. DOP uses a depth-wise constraint and an additive KL-divergence proxy on a small validation set to search fast. Across 6 MLLMs and 13 benchmarks, DOP outperforms token-pruning baselines, reduces TFLOPs up to 86% with ~1% average performance loss on a flagship model, and translates to real GPU latency wins. It needs only small validation runs (25–100 samples) and transfers well between models.
Problem Statement
Existing token reduction methods give the same visual tokens to all decoder modules and ignore that different modules and layers vary in importance. This wastes compute in the decoder. We need a fine-grained way to skip specific module×token computations so important modules can process more tokens under a computation budget.
Main Contribution
Formulate operation pruning: treat each module×token-group computation as an atomic operation and allow selective skipping.
Introduce DOP: depth-wise pruning + additive KL-divergence proxy to make policy search fast and data-light.
Key Findings
DOP can cut theoretical FLOPs by 86% while incurring ~1% average performance loss on LLaVA-NeXT-7B.
Under no average performance loss, DOP still reduces TFLOPs by 77% on LLaVA-NeXT-7B.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| TFLOPs reduction (LLaVA-NeXT-7B) | 86% TFLOPs reduction | original model | ≈1% average performance loss | aggregated across 13 benchmarks | Table 2; Abstract | Table 2 |
| TFLOPs reduction without perf loss (LLaVA-NeXT-7B) | 77% TFLOPs reduction | original model | 0% average performance change | aggregated across 13 benchmarks | Abstract; Results | Abstract |
What To Try In 7 Days
Run DOP optimization on a representative model with 25–100 validation samples to get a one-time pruning policy.
Plug DOP over your existing visual token reducer (VisPruner or CDPruner) and measure prefilling latency and end-task metrics.
Test policy transfer: optimize once on a dev model and apply the divergences to sibling models to save time.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Only prunes visual-token operations in experiments; other token groups not evaluated extensively
Targets prefilling stage (compute-bound); decoding-stage (memory-bandwidth-bound) gains may be small
When Not To Use
If your bottleneck is decoding memory bandwidth rather than prefilling compute
When you cannot run any validation samples on your target model
Failure Modes
Over-aggressive deep-layer pruning on tasks that require deep visual processing (SeedBench example)
Suboptimal transfer if source and target architectures have very different redundancy patterns

