Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
MammothModa gives competitive multimodal accuracy while cutting visual token compute and inference time, making it practical for products needing high‑res image, OCR, document VQA, or long‑video understanding.
Summary TLDR
MammothModa is a multimodal LLM designed to handle high‑resolution images and long videos without hurting language skills. Key ideas: dynamic global/local image splitting (GLHR), a minimalist Visual Merger that mean‑pools visual features to cut visual token counts, shared Frame Position IDs for long videos, and small Visual Expert (VE) modules inside the LLM so vision fine‑tuning does not degrade text ability. Ablations show large gains on OCR and document VQA, a ~1.3–1.4x inference speedup from Visual Merger, and competitive leaderboard scores (average ~61.2 on major multimodal suites).
Problem Statement
Multimodal LLMs struggle to combine detailed high‑resolution visual inputs and long videos with complex language understanding. High resolution and many frames create huge numbers of visual tokens, causing big compute costs and risking degradation of the LLM's text skills during vision fine‑tuning.
Main Contribution
Global‑Local High‑Resolution splitting (GLHR) to preserve fine detail by dynamically splitting images into 336×336 patches.
Visual Merger: a lightweight mean‑pooling module that reduces visual token count and speeds up inference with minimal model changes.
Shared Frame Position IDs (FPID) to compress positional embeddings for long videos and avoid positional interpolation.
Visual Expert (VE) modules inserted into LLM layers to process visual tokens and protect textual capabilities during vision fine‑tuning.
A three‑phase training recipe (vision‑language alignment, multi‑task pretraining, supervised fine‑tuning) and curated bilingual multimodal data to reduce hallucinations.
Key Findings
Dynamic splitting at high equivalent resolution (DS-12) substantially improves fine‑grained and document tasks.
Visual Merger (mean pooling) reduces test time with small drop in some scores.
Shared Frame Position ID collapses positional ids across frames with minimal performance loss or small gains on some metrics.
Fine‑tuning on vision data degrades text benchmarks; Visual Experts mitigate that and boost vision scores.
MammothModa ranks among top multimodal models on public leaderboards.
Results
Average (leaderboards)
MMBench
MMStar
OCRBench (ablation)
DocVQA (ablation)
Test Time Cost (Visual Merger)
FT language degradation (text-only -> FT)
FT w/ VE visual gain
Who Should Care
What To Try In 7 Days
Add a simple mean‑pool Visual Merger to your vision→LLM pipeline to reduce visual tokens and test latency.
Implement GLHR: split large images into 336×336 patches for better OCR and document understanding.
Use per‑frame shared position IDs when ingesting many frames to avoid costly positional interpolation.
Optimization Features
Token Efficiency
- GLHR splits to increase effective resolution without exploding tokens
- Visual Merger reduces token count via windowed mean pooling
Model Optimization
- Visual Merger: spatial mean pooling to reduce tokens
- Visual Expert modules isolate vision processing
System Optimization
- Stitching frame features and FPID to compress temporal positional ids
Training Optimization
- Three‑phase training: alignment, multi‑task pretraining, supervised fine‑tuning
- Layer‑wise LR decay on ViT to preserve pretrained features
Inference Optimization
- Dynamic pooling (train/test window differences) to speed inference
- Shared FPID to avoid positional interpolation overhead
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- No code or dataset release mentioned, making reproduction hard.
- Visual Merger large pooling windows reduce accuracy; choose window carefully.
- Shared FPID has small negative effects on some benchmarks (e.g., MMVet -1.9).
- Leaderboards compare many models; some listed baselines (GPT-4 variants) are closed‑service references.
When Not To Use
- When exact per‑pixel spatial information is critical and pooling would lose needed detail.
- If you require open datasets or code for compliance or replication.
Failure Modes
- Over‑pooling in Visual Merger can harm fine visual detail and some benchmark scores.
- Shared FPID may slightly degrade some specialized vision metrics.
- VE modules could add complexity and parameters; improper placement might not fully prevent language degradation.
Core Entities
Models
- MammothModa
- ViT (vision transformer)
- Visual Merger
- Visual Expert (VE)
- LLaVA (baseline referenced)
Metrics
- Average
- MME
- MMB-EN
- MMB-CN
- OCRBench
- DocVQA
- MMVet
- MMLU
- CMMLU
- CEVAL
- GSM8K
- Test Time Cost (s)
- Speed up
Datasets
- curated bilingual multimodal dataset (authors)
- caption datasets (used in vision-language alignment)
Benchmarks
- MMBench
- MMStar
- MMMU
- MathVista
- Hall. Bench
- AI2D
- OCRBench
- MMVet
- MME
- MMB-EN
- MMB-CN
- DocVQA
- MMLU
- CMMLU
- CEVAL
- GSM8K
Context Entities
Models
- GPT-4v
- Gemini-1.5-Pro
- GPT-4o
- InternVLChatV1.5
- GLM-4v
- Step-1V
- MiniCPM-L3-V2.5
- Intern-XC2-VL
- WeMM
Benchmarks
- public multimodal leaderboards cited in Table 1

