A simple, efficient multimodal LLM that boosts high‑res image and long‑video handling with token merging and visual experts

June 26, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Qi She, Junwen Pan, Xin Wan, Rui Zhang, Dawei Lu, Kai Huang

Links

Abstract / PDF

Why It Matters For Business

MammothModa gives competitive multimodal accuracy while cutting visual token compute and inference time, making it practical for products needing high‑res image, OCR, document VQA, or long‑video understanding.

Summary TLDR

MammothModa is a multimodal LLM designed to handle high‑resolution images and long videos without hurting language skills. Key ideas: dynamic global/local image splitting (GLHR), a minimalist Visual Merger that mean‑pools visual features to cut visual token counts, shared Frame Position IDs for long videos, and small Visual Expert (VE) modules inside the LLM so vision fine‑tuning does not degrade text ability. Ablations show large gains on OCR and document VQA, a ~1.3–1.4x inference speedup from Visual Merger, and competitive leaderboard scores (average ~61.2 on major multimodal suites).

Problem Statement

Multimodal LLMs struggle to combine detailed high‑resolution visual inputs and long videos with complex language understanding. High resolution and many frames create huge numbers of visual tokens, causing big compute costs and risking degradation of the LLM's text skills during vision fine‑tuning.

Main Contribution

Global‑Local High‑Resolution splitting (GLHR) to preserve fine detail by dynamically splitting images into 336×336 patches.

Visual Merger: a lightweight mean‑pooling module that reduces visual token count and speeds up inference with minimal model changes.

Shared Frame Position IDs (FPID) to compress positional embeddings for long videos and avoid positional interpolation.

Visual Expert (VE) modules inserted into LLM layers to process visual tokens and protect textual capabilities during vision fine‑tuning.

A three‑phase training recipe (vision‑language alignment, multi‑task pretraining, supervised fine‑tuning) and curated bilingual multimodal data to reduce hallucinations.

Key Findings

Dynamic splitting at high equivalent resolution (DS-12) substantially improves fine‑grained and document tasks.

NumbersAvg +45.0; OCRBench +105; DocVQA +28.83 (vs Resize)

Visual Merger (mean pooling) reduces test time with small drop in some scores.

NumbersTest time 398s -> 298s (window 3), speed-up 1.34; Avg 59.3 -> 56.78

Shared Frame Position ID collapses positional ids across frames with minimal performance loss or small gains on some metrics.

NumbersPos IDs 4320 -> 30 (30 frames); MME +2.68; Avg -0.48

Fine‑tuning on vision data degrades text benchmarks; Visual Experts mitigate that and boost vision scores.

NumbersFT drops MMLU -2.8, CMMLU -7.6, GSM8K -12.0; FT w/ VE MME +131.9, MMVet +6.2

MammothModa ranks among top multimodal models on public leaderboards.

NumbersAverage score 61.2; MMBench 81.04; MMStar 56.27

Results

Average (leaderboards)

Value61.2

BaselineGPT-4o 69.9 (top listed)

MMBench

Value81.04

MMStar

Value56.27

OCRBench (ablation)

ValueIncrease +105

BaselineResize

DocVQA (ablation)

ValueIncrease +28.83

BaselineResize

Test Time Cost (Visual Merger)

Value398s -> 298s

Baselineno merge

FT language degradation (text-only -> FT)

ValueMMLU 63.2 -> 60.4; GSM8K 42.5 -> 30.5

Baselinetext-only

FT w/ VE visual gain

ValueMME +131.9; MMVet +6.2

BaselineFT without VE

Who Should Care

What To Try In 7 Days

Add a simple mean‑pool Visual Merger to your vision→LLM pipeline to reduce visual tokens and test latency.

Implement GLHR: split large images into 336×336 patches for better OCR and document understanding.

Use per‑frame shared position IDs when ingesting many frames to avoid costly positional interpolation.

Optimization Features

Token Efficiency

  • GLHR splits to increase effective resolution without exploding tokens
  • Visual Merger reduces token count via windowed mean pooling

Model Optimization

  • Visual Merger: spatial mean pooling to reduce tokens
  • Visual Expert modules isolate vision processing

System Optimization

  • Stitching frame features and FPID to compress temporal positional ids

Training Optimization

  • Three‑phase training: alignment, multi‑task pretraining, supervised fine‑tuning
  • Layer‑wise LR decay on ViT to preserve pretrained features

Inference Optimization

  • Dynamic pooling (train/test window differences) to speed inference
  • Shared FPID to avoid positional interpolation overhead

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • No code or dataset release mentioned, making reproduction hard.
  • Visual Merger large pooling windows reduce accuracy; choose window carefully.
  • Shared FPID has small negative effects on some benchmarks (e.g., MMVet -1.9).
  • Leaderboards compare many models; some listed baselines (GPT-4 variants) are closed‑service references.

When Not To Use

  • When exact per‑pixel spatial information is critical and pooling would lose needed detail.
  • If you require open datasets or code for compliance or replication.

Failure Modes

  • Over‑pooling in Visual Merger can harm fine visual detail and some benchmark scores.
  • Shared FPID may slightly degrade some specialized vision metrics.
  • VE modules could add complexity and parameters; improper placement might not fully prevent language degradation.

Core Entities

Models

  • MammothModa
  • ViT (vision transformer)
  • Visual Merger
  • Visual Expert (VE)
  • LLaVA (baseline referenced)

Metrics

  • Average
  • MME
  • MMB-EN
  • MMB-CN
  • OCRBench
  • DocVQA
  • MMVet
  • MMLU
  • CMMLU
  • CEVAL
  • GSM8K
  • Test Time Cost (s)
  • Speed up

Datasets

  • curated bilingual multimodal dataset (authors)
  • caption datasets (used in vision-language alignment)

Benchmarks

  • MMBench
  • MMStar
  • MMMU
  • MathVista
  • Hall. Bench
  • AI2D
  • OCRBench
  • MMVet
  • MME
  • MMB-EN
  • MMB-CN
  • DocVQA
  • MMLU
  • CMMLU
  • CEVAL
  • GSM8K

Context Entities

Models

  • GPT-4v
  • Gemini-1.5-Pro
  • GPT-4o
  • InternVLChatV1.5
  • GLM-4v
  • Step-1V
  • MiniCPM-L3-V2.5
  • Intern-XC2-VL
  • WeMM

Benchmarks

  • public multimodal leaderboards cited in Table 1