A simple, efficient multimodal LLM that boosts high‑res image and long‑video handling with token merging and visual experts

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

Authors

Qi She, Junwen Pan, Xin Wan, Rui Zhang, Dawei Lu, Kai Huang

Links

Abstract / PDF

Why It Matters For Business

MammothModa gives competitive multimodal accuracy while cutting visual token compute and inference time, making it practical for products needing high‑res image, OCR, document VQA, or long‑video understanding.

Summary TLDR

MammothModa is a multimodal LLM designed to handle high‑resolution images and long videos without hurting language skills. Key ideas: dynamic global/local image splitting (GLHR), a minimalist Visual Merger that mean‑pools visual features to cut visual token counts, shared Frame Position IDs for long videos, and small Visual Expert (VE) modules inside the LLM so vision fine‑tuning does not degrade text ability. Ablations show large gains on OCR and document VQA, a ~1.3–1.4x inference speedup from Visual Merger, and competitive leaderboard scores (average ~61.2 on major multimodal suites).

Problem Statement

Multimodal LLMs struggle to combine detailed high‑resolution visual inputs and long videos with complex language understanding. High resolution and many frames create huge numbers of visual tokens, causing big compute costs and risking degradation of the LLM's text skills during vision fine‑tuning.

Main Contribution

Global‑Local High‑Resolution splitting (GLHR) to preserve fine detail by dynamically splitting images into 336×336 patches.

Visual Merger: a lightweight mean‑pooling module that reduces visual token count and speeds up inference with minimal model changes.

Shared Frame Position IDs (FPID) to compress positional embeddings for long videos and avoid positional interpolation.

Visual Expert (VE) modules inserted into LLM layers to process visual tokens and protect textual capabilities during vision fine‑tuning.

A three‑phase training recipe (vision‑language alignment, multi‑task pretraining, supervised fine‑tuning) and curated bilingual multimodal data to reduce hallucinations.

Key Findings

Dynamic splitting at high equivalent resolution (DS-12) substantially improves fine‑grained and document tasks.

NumbersAvg +45.0; OCRBench +105; DocVQA +28.83 (vs Resize)

Visual Merger (mean pooling) reduces test time with small drop in some scores.

NumbersTest time 398s -> 298s (window 3), speed-up 1.34; Avg 59.3 -> 56.78

Shared Frame Position ID collapses positional ids across frames with minimal performance loss or small gains on some metrics.

NumbersPos IDs 4320 -> 30 (30 frames); MME +2.68; Avg -0.48

Fine‑tuning on vision data degrades text benchmarks; Visual Experts mitigate that and boost vision scores.

NumbersFT drops MMLU -2.8, CMMLU -7.6, GSM8K -12.0; FT w/ VE MME +131.9, MMVet +6.2

MammothModa ranks among top multimodal models on public leaderboards.

NumbersAverage score 61.2; MMBench 81.04; MMStar 56.27

Results

Average (leaderboards)

Value61.2

BaselineGPT-4o 69.9 (top listed)

MMBench

Value81.04

MMStar

Value56.27

OCRBench (ablation)

ValueIncrease +105

BaselineResize

DocVQA (ablation)

ValueIncrease +28.83

BaselineResize

Test Time Cost (Visual Merger)

Value398s -> 298s

Baselineno merge

FT language degradation (text-only -> FT)

ValueMMLU 63.2 -> 60.4; GSM8K 42.5 -> 30.5

Baselinetext-only

FT w/ VE visual gain

ValueMME +131.9; MMVet +6.2

BaselineFT without VE

Who Should Care

CtoProduct ManagerMl EngineerEngineering LeadFounder

What To Try In 7 Days

Add a simple mean‑pool Visual Merger to your vision→LLM pipeline to reduce visual tokens and test latency.

Implement GLHR: split large images into 336×336 patches for better OCR and document understanding.

Use per‑frame shared position IDs when ingesting many frames to avoid costly positional interpolation.

Optimization Features

Token Efficiency

GLHR splits to increase effective resolution without exploding tokens
Visual Merger reduces token count via windowed mean pooling

Model Optimization

Visual Merger: spatial mean pooling to reduce tokens
Visual Expert modules isolate vision processing

System Optimization

Stitching frame features and FPID to compress temporal positional ids

Training Optimization

Three‑phase training: alignment, multi‑task pretraining, supervised fine‑tuning
Layer‑wise LR decay on ViT to preserve pretrained features

Inference Optimization

Dynamic pooling (train/test window differences) to speed inference
Shared FPID to avoid positional interpolation overhead

Reproducibility

Open Source Status

unknown

Risks & Boundaries

Limitations

No code or dataset release mentioned, making reproduction hard.
Visual Merger large pooling windows reduce accuracy; choose window carefully.
Shared FPID has small negative effects on some benchmarks (e.g., MMVet -1.9).
Leaderboards compare many models; some listed baselines (GPT-4 variants) are closed‑service references.

When Not To Use

When exact per‑pixel spatial information is critical and pooling would lose needed detail.
If you require open datasets or code for compliance or replication.

Failure Modes

Over‑pooling in Visual Merger can harm fine visual detail and some benchmark scores.
Shared FPID may slightly degrade some specialized vision metrics.
VE modules could add complexity and parameters; improper placement might not fully prevent language degradation.

Core Entities

Models

MammothModa
ViT (vision transformer)
Visual Merger
Visual Expert (VE)
LLaVA (baseline referenced)

Metrics

Average
MME
MMB-EN
MMB-CN
OCRBench
DocVQA
MMVet
MMLU
CMMLU
CEVAL
GSM8K
Test Time Cost (s)
Speed up

Datasets

curated bilingual multimodal dataset (authors)
caption datasets (used in vision-language alignment)

Benchmarks

MMBench
MMStar
MMMU
MathVista
Hall. Bench
AI2D
OCRBench
MMVet
MME
MMB-EN
MMB-CN
DocVQA
MMLU
CMMLU
CEVAL
GSM8K

Context Entities

Models

GPT-4v
Gemini-1.5-Pro
GPT-4o
InternVLChatV1.5
GLM-4v
Step-1V
MiniCPM-L3-V2.5
Intern-XC2-VL
WeMM

Benchmarks

public multimodal leaderboards cited in Table 1

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dynamic splitting at high equivalent resolution (DS-12) substantially improves fine‑grained and document tasks.

Visual Merger (mean pooling) reduces test time with small drop in some scores.

Shared Frame Position ID collapses positional ids across frames with minimal performance loss or small gains on some metrics.

Fine‑tuning on vision data degrades text benchmarks; Visual Experts mitigate that and boost vision scores.

MammothModa ranks among top multimodal models on public leaderboards.

Results

Average (leaderboards)

MMBench

MMStar

OCRBench (ablation)

DocVQA (ablation)

Test Time Cost (Visual Merger)

FT language degradation (text-only -> FT)

FT w/ VE visual gain

Who Should Care

What To Try In 7 Days

Optimization Features

Token Efficiency

Model Optimization

System Optimization

Training Optimization

Inference Optimization

Reproducibility

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Benchmarks

Related Papers