Prune module-level operations to reallocate tokens and cut MLLM compute by up to 86% with small accuracy loss

Overview

Decision SnapshotReady For Pilot

DOP is a practical, inference-only approach that needs a small validation set and one-time GPU runs; gains are demonstrated on multiple models and real A100 timings, but it focuses on prefilling and visual-token ops only.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer

Links

Abstract / PDF / Data

Why It Matters For Business

DOP cuts prefilling compute and real GPU latency substantially while keeping task accuracy near-original, enabling cheaper inference and higher throughput for multimodal deployments.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

The paper proposes Depth-wise Operation Pruning (DOP), a practical method that prunes per-module "operations" (a module processing a token group) so critical modules get more tokens while redundant modules are skipped. DOP uses a depth-wise constraint and an additive KL-divergence proxy on a small validation set to search fast. Across 6 MLLMs and 13 benchmarks, DOP outperforms token-pruning baselines, reduces TFLOPs up to 86% with ~1% average performance loss on a flagship model, and translates to real GPU latency wins. It needs only small validation runs (25–100 samples) and transfers well between models.

Problem Statement

Existing token reduction methods give the same visual tokens to all decoder modules and ignore that different modules and layers vary in importance. This wastes compute in the decoder. We need a fine-grained way to skip specific module×token computations so important modules can process more tokens under a computation budget.

Main Contribution

Formulate operation pruning: treat each module×token-group computation as an atomic operation and allow selective skipping.

Introduce DOP: depth-wise pruning + additive KL-divergence proxy to make policy search fast and data-light.

Key Findings

DOP can cut theoretical FLOPs by 86% while incurring ~1% average performance loss on LLaVA-NeXT-7B.

Numbers86% TFLOPs reduction; ~1% perf loss

Practical UseYou can sharply reduce compute for prefilling with minor accuracy drop; use DOP when prefilling compute dominates.

Evidence RefAbstract; Table 2

Under no average performance loss, DOP still reduces TFLOPs by 77% on LLaVA-NeXT-7B.

Numbers77% TFLOPs reduction; 0% perf loss

Practical UseAt many budgets you can get big compute savings without measurable accuracy change—test DOP first before more invasive optimizations.

Evidence RefAbstract; Results section

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TFLOPs reduction (LLaVA-NeXT-7B)	86% TFLOPs reduction	original model	≈1% average performance loss	aggregated across 13 benchmarks	Table 2; Abstract	Table 2
TFLOPs reduction without perf loss (LLaVA-NeXT-7B)	77% TFLOPs reduction	original model	0% average performance change	aggregated across 13 benchmarks	Abstract; Results	Abstract

What To Try In 7 Days

Run DOP optimization on a representative model with 25–100 validation samples to get a one-time pruning policy.

Plug DOP over your existing visual token reducer (VisPruner or CDPruner) and measure prefilling latency and end-task metrics.

Test policy transfer: optimize once on a dev model and apply the divergences to sibling models to save time.

Optimization Features

Token Efficiency

integrates with token pruning methods (VisPruner, CDPruner) and resizingsearches for per-module token allocations (n_v)

Infra Optimization

one-time lightweight validation runs (minutes) on GPU

Model Optimization

operation-level pruning (module×token-group)depth-wise pruning (prune deeper operations first)

System Optimization

reduces prefilling computational cost and real GPU latency

Training Optimization

no training; uses inference divergence for policy search

Inference Optimization

skips MHA/MLP operations for pruned visual tokens during prefillingcompatible with FlashAttention2 and other attention implementations

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/open-compass/VLMEvalKit https://github.com/Theia-4869/VisPruner https://github.com/Theia-4869/CDPruner

Risks & Boundaries

Limitations

Only prunes visual-token operations in experiments; other token groups not evaluated extensively

Targets prefilling stage (compute-bound); decoding-stage (memory-bandwidth-bound) gains may be small

When Not To Use

If your bottleneck is decoding memory bandwidth rather than prefilling compute

When you cannot run any validation samples on your target model

Failure Modes

Over-aggressive deep-layer pruning on tasks that require deep visual processing (SeedBench example)

Suboptimal transfer if source and target architectures have very different redundancy patterns

Core Entities

Models

LLaVA-1.5-7BLLaVA-1.5-13BLLaVA-NeXT-7BLLaVA-NeXT-13BQwen2.5-VL-7BInternVL3-8B

Metrics

TFLOPsprefilling latency (ms)KL divergence on first-token outputrelative average performance (Rel. Avg.)

Datasets

TextVQAVQA V2GQAVizWizScienceQASeedBenchMMEPOPEMMBenchMMBench-CNAI2DChartQAOCRBench

Benchmarks

13 multimodal VQA and OCR benchmarks (see datasets list)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DOP can cut theoretical FLOPs by 86% while incurring ~1% average performance loss on LLaVA-NeXT-7B.

Under no average performance loss, DOP still reduces TFLOPs by 77% on LLaVA-NeXT-7B.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding