Make visual tokens look like text tokens: learn a visual embedding table and probabilistic visual words.

May 31, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

4

Authors

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Han-Jia Ye

Links

Abstract / PDF

Why It Matters For Business

Ovis improves multimodal understanding without larger LLM backbones, letting teams get better vision+language performance by changing the visual tokenization architecture rather than scaling model size.

Summary TLDR

Ovis replaces the usual continuous visual embedding pipeline with a learnable visual embedding table and probabilistic visual tokens. Each image patch is mapped to a probability over a large visual vocabulary and the final patch embedding is the expected embedding from that table. Trained in three stages with only text-generation loss, Ovis improves multimodal benchmark scores over connector-based MLLMs of similar size and often beats the proprietary Qwen-VL-Plus on evaluated benchmarks. Code and datasets are released.

Problem Statement

Current MLLMs feed continuous visual encoder outputs through a connector (MLP/linear) into the LLM, but visual tokens and text tokens use different tokenization and embedding strategies, causing suboptimal fusion of vision and language.

Main Contribution

Introduce a learnable visual embedding look-up table so visual patches are represented like textual tokens.

Map each visual patch to a probabilistic token (distribution over K visual words) and use the expectation over indexed embeddings as the patch embedding.

Train Ovis in three stages (caption pretrain, visual description, multimodal instruction) using only text-generation loss.

Show consistent benchmark gains over connector-based MLLMs and open-source peers; Ovis-14B often outperforms Qwen-VL-Plus on evaluated benchmarks.

Openly provide training code and datasets to support reproduction.

Key Findings

Ovis architecture beats an otherwise-identical connector-based MLLM.

Numbersavg +8.8% across benchmarks (Table 3)

Ovis-14B outperforms the proprietary Qwen-VL-Plus on many evaluated benchmarks.

NumbersMMStar: 48.5 vs 39.7 (Ovis-Qwen1.5-14B vs Qwen-VL-Plus, Table 1)

Ovis uses a very large visual vocabulary and produces sparse probabilistic tokens.

Numbersvisual vocab K = 131,072; only 0.22% prob values >1e-4 (Fig.8, Sec.4.1)

Results

Average improvement vs connector architecture

Valueavg +8.8%

Baselineconnector-based MLLM (same backbones & params)

MMStar score (Ovis-Qwen1.5-14B)

Value48.5

BaselineQwen-VL-Plus 39.7

Visual tokenizer sparsity

Value0.22% prob values >1e-4

Who Should Care

What To Try In 7 Days

Prototype replacing your connector with a visual embedding table and probabilistic token head using a pretrained ViT and your LLM.

Train just the visual head and embedding table on a caption subset (stage-1 style) to gauge immediate gains.

Run your standard multimodal evaluation suite (e.g., MMBench or task-specific tests) to measure lift before full retrain.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No high-resolution boosting — limited performance on native high-res vision unless extended with extra techniques.
  • Trained only on single-image samples — not validated for multi-image or cross-image reasoning.
  • Model still vulnerable to hallucination and biases common to generative MLLMs.

When Not To Use

  • When your application needs native multi-image fusion or cross-image reasoning without further engineering.
  • If your workload requires native high-resolution vision and you cannot add high-res techniques.
  • If you cannot afford the memory/compute for a very large visual vocabulary (K≈131k) or its indexing.

Failure Modes

  • Hallucination: generating incorrect facts from visual prompts despite improved perception.
  • Biases inherited from training data leading to unfair or harmful outputs.
  • Missing rare visual patterns if the probabilistic token distribution or vocabulary fails to cover them.

Core Entities

Models

  • Ovis-Qwen1.5-7B
  • Ovis-Qwen1.5-14B
  • Ovis-Llama3-8B
  • Ovis-Llama3-8B (variant reported)

Metrics

  • MMStar score
  • MMBench score
  • MMMU score
  • MathVista-Mini score
  • MME sum (perception+cognition)
  • HallusionBench QuestionAcc
  • RealWorldQA score

Datasets

  • COYO-10M
  • ShareGPT4V-Pretrain
  • LLaVA-Finetune
  • COYO (filtered subset)
  • ImageNet-1K (sparsity test)
  • In-house visual description & instruction datasets (AIDC-AI/Ovis-dataset)

Benchmarks

  • MMStar
  • MMBench-EN
  • MMBench-CN
  • MMMU
  • MathVista-Mini
  • MME
  • HallusionBench
  • RealWorldQA

Context Entities

Models

  • Qwen-VL-Plus
  • Qwen-VL-Max
  • GPT4V
  • LLaVA
  • InstructBLIP
  • Mini-Gemini
  • Monkey
  • DeepSeek-VL

Metrics

  • Benchmark leaderboard scores reported in Tables 1-2

Datasets

  • Laion
  • CC12M
  • ScienceQA
  • TextVQA

Benchmarks

  • MMBench
  • MMMU