Prune module-level operations to reallocate tokens and cut MLLM compute by up to 86% with small accuracy loss

June 24, 20258 min

Overview

Decision SnapshotReady For Pilot

DOP is a practical, inference-only approach that needs a small validation set and one-time GPU runs; gains are demonstrated on multiple models and real A100 timings, but it focuses on prefilling and visual-token ops only.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer

Links

Abstract / PDF / Data

Why It Matters For Business

DOP cuts prefilling compute and real GPU latency substantially while keeping task accuracy near-original, enabling cheaper inference and higher throughput for multimodal deployments.

Who Should Care

Summary TLDR

The paper proposes Depth-wise Operation Pruning (DOP), a practical method that prunes per-module "operations" (a module processing a token group) so critical modules get more tokens while redundant modules are skipped. DOP uses a depth-wise constraint and an additive KL-divergence proxy on a small validation set to search fast. Across 6 MLLMs and 13 benchmarks, DOP outperforms token-pruning baselines, reduces TFLOPs up to 86% with ~1% average performance loss on a flagship model, and translates to real GPU latency wins. It needs only small validation runs (25–100 samples) and transfers well between models.

Problem Statement

Existing token reduction methods give the same visual tokens to all decoder modules and ignore that different modules and layers vary in importance. This wastes compute in the decoder. We need a fine-grained way to skip specific module×token computations so important modules can process more tokens under a computation budget.

Main Contribution

Formulate operation pruning: treat each module×token-group computation as an atomic operation and allow selective skipping.

Introduce DOP: depth-wise pruning + additive KL-divergence proxy to make policy search fast and data-light.

Key Findings

DOP can cut theoretical FLOPs by 86% while incurring ~1% average performance loss on LLaVA-NeXT-7B.

Numbers86% TFLOPs reduction; ~1% perf loss

Practical UseYou can sharply reduce compute for prefilling with minor accuracy drop; use DOP when prefilling compute dominates.

Evidence RefAbstract; Table 2

Under no average performance loss, DOP still reduces TFLOPs by 77% on LLaVA-NeXT-7B.

Numbers77% TFLOPs reduction; 0% perf loss

Practical UseAt many budgets you can get big compute savings without measurable accuracy change—test DOP first before more invasive optimizations.

Evidence RefAbstract; Results section

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TFLOPs reduction (LLaVA-NeXT-7B)86% TFLOPs reductionoriginal model≈1% average performance lossaggregated across 13 benchmarksTable 2; AbstractTable 2
TFLOPs reduction without perf loss (LLaVA-NeXT-7B)77% TFLOPs reductionoriginal model0% average performance changeaggregated across 13 benchmarksAbstract; ResultsAbstract

What To Try In 7 Days

Run DOP optimization on a representative model with 25–100 validation samples to get a one-time pruning policy.

Plug DOP over your existing visual token reducer (VisPruner or CDPruner) and measure prefilling latency and end-task metrics.

Test policy transfer: optimize once on a dev model and apply the divergences to sibling models to save time.

Optimization Features

Token Efficiency
integrates with token pruning methods (VisPruner, CDPruner) and resizingsearches for per-module token allocations (n_v)
Infra Optimization
one-time lightweight validation runs (minutes) on GPU
Model Optimization
operation-level pruning (module×token-group)depth-wise pruning (prune deeper operations first)
System Optimization
reduces prefilling computational cost and real GPU latency
Training Optimization
no training; uses inference divergence for policy search
Inference Optimization
skips MHA/MLP operations for pruned visual tokens during prefillingcompatible with FlashAttention2 and other attention implementations

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only prunes visual-token operations in experiments; other token groups not evaluated extensively

Targets prefilling stage (compute-bound); decoding-stage (memory-bandwidth-bound) gains may be small

When Not To Use

If your bottleneck is decoding memory bandwidth rather than prefilling compute

When you cannot run any validation samples on your target model

Failure Modes

Over-aggressive deep-layer pruning on tasks that require deep visual processing (SeedBench example)

Suboptimal transfer if source and target architectures have very different redundancy patterns

Core Entities

Models

LLaVA-1.5-7BLLaVA-1.5-13BLLaVA-NeXT-7BLLaVA-NeXT-13BQwen2.5-VL-7BInternVL3-8B

Metrics

TFLOPsprefilling latency (ms)KL divergence on first-token outputrelative average performance (Rel. Avg.)

Datasets

TextVQAVQA V2GQAVizWizScienceQASeedBenchMMEPOPEMMBenchMMBench-CNAI2DChartQAOCRBench

Benchmarks

13 multimodal VQA and OCR benchmarks (see datasets list)