Prune module-level operations to reallocate tokens and cut MLLM compute by up to 86% with small accuracy loss

June 24, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer

Links

Abstract / PDF

Why It Matters For Business

DOP cuts prefilling compute and real GPU latency substantially while keeping task accuracy near-original, enabling cheaper inference and higher throughput for multimodal deployments.

Summary TLDR

The paper proposes Depth-wise Operation Pruning (DOP), a practical method that prunes per-module "operations" (a module processing a token group) so critical modules get more tokens while redundant modules are skipped. DOP uses a depth-wise constraint and an additive KL-divergence proxy on a small validation set to search fast. Across 6 MLLMs and 13 benchmarks, DOP outperforms token-pruning baselines, reduces TFLOPs up to 86% with ~1% average performance loss on a flagship model, and translates to real GPU latency wins. It needs only small validation runs (25–100 samples) and transfers well between models.

Problem Statement

Existing token reduction methods give the same visual tokens to all decoder modules and ignore that different modules and layers vary in importance. This wastes compute in the decoder. We need a fine-grained way to skip specific module×token computations so important modules can process more tokens under a computation budget.

Main Contribution

Formulate operation pruning: treat each module×token-group computation as an atomic operation and allow selective skipping.

Introduce DOP: depth-wise pruning + additive KL-divergence proxy to make policy search fast and data-light.

Show broad gains: better performance-efficiency tradeoffs across 6 MLLMs and 13 benchmarks, with low optimization cost and transferability.

Key Findings

DOP can cut theoretical FLOPs by 86% while incurring ~1% average performance loss on LLaVA-NeXT-7B.

Numbers86% TFLOPs reduction; ~1% perf loss

Under no average performance loss, DOP still reduces TFLOPs by 77% on LLaVA-NeXT-7B.

Numbers77% TFLOPs reduction; 0% perf loss

DOP preserves up to 7% more task performance than the prior state-of-the-art token-pruning baseline on evaluated benchmarks.

Numbersup to 7% Rel. Avg. improvement vs CDPruner

The additive proxy (sum of single-parameter divergences) ranks pruning policies similar to the true joint divergence.

NumbersSpearman ρ ≥ 0.799 (p < 1e-48)

Optimization is fast and data-light: good policies found in minutes with 25–100 samples.

Numbers2 min with 25 samples; 5–18 min for 100 samples across models

DOP policies transfer across models with small loss (≤0.7% Rel. Avg. difference).

Numberstransfer vs direct within 0.7% Rel. Avg.

Results

TFLOPs reduction (LLaVA-NeXT-7B)

Value86% TFLOPs reduction

Baselineoriginal model

TFLOPs reduction without perf loss (LLaVA-NeXT-7B)

Value77% TFLOPs reduction

Baselineoriginal model

Real GPU prefilling latency reduction

Value83% prefilling CUDA latency reduction

BaselineCDPruner under same budget

Performance advantage vs baseline

Valueup to 7% relative average performance preserved

BaselineCDPruner / state-of-the-art token pruners

Optimization cost

Value2–18 minutes (1 A100) depending on sample count

Baselineone-time optimization

Who Should Care

What To Try In 7 Days

Run DOP optimization on a representative model with 25–100 validation samples to get a one-time pruning policy.

Plug DOP over your existing visual token reducer (VisPruner or CDPruner) and measure prefilling latency and end-task metrics.

Test policy transfer: optimize once on a dev model and apply the divergences to sibling models to save time.

Optimization Features

Token Efficiency

  • integrates with token pruning methods (VisPruner, CDPruner) and resizing
  • searches for per-module token allocations (n_v)

Infra Optimization

  • one-time lightweight validation runs (minutes) on GPU

Model Optimization

  • operation-level pruning (module×token-group)
  • depth-wise pruning (prune deeper operations first)

System Optimization

  • reduces prefilling computational cost and real GPU latency

Training Optimization

  • no training; uses inference divergence for policy search

Inference Optimization

  • skips MHA/MLP operations for pruned visual tokens during prefilling
  • compatible with FlashAttention2 and other attention implementations

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only prunes visual-token operations in experiments; other token groups not evaluated extensively
  • Targets prefilling stage (compute-bound); decoding-stage (memory-bandwidth-bound) gains may be small
  • Additive proxy preserves ranking but can swap a few low-rank policies
  • Requires access to a small validation set representative of target tasks

When Not To Use

  • If your bottleneck is decoding memory bandwidth rather than prefilling compute
  • When you cannot run any validation samples on your target model
  • When you need guaranteed exact output distributions for safety-critical cases

Failure Modes

  • Over-aggressive deep-layer pruning on tasks that require deep visual processing (SeedBench example)
  • Suboptimal transfer if source and target architectures have very different redundancy patterns
  • Conservative policies that reduce tokens instead of operations when minimal token constraints are mis-set

Core Entities

Models

  • LLaVA-1.5-7B
  • LLaVA-1.5-13B
  • LLaVA-NeXT-7B
  • LLaVA-NeXT-13B
  • Qwen2.5-VL-7B
  • InternVL3-8B

Metrics

  • TFLOPs
  • prefilling latency (ms)
  • KL divergence on first-token output
  • relative average performance (Rel. Avg.)

Datasets

  • TextVQA
  • VQA V2
  • GQA
  • VizWiz
  • ScienceQA
  • SeedBench
  • MME
  • POPE
  • MMBench
  • MMBench-CN
  • AI2D
  • ChartQA
  • OCRBench

Benchmarks

  • 13 multimodal VQA and OCR benchmarks (see datasets list)