Make visual tokens look like text tokens: learn a visual embedding table and probabilistic visual words.

May 31, 20247 min

Overview

Decision SnapshotNeeds Validation

Ovis presents a clear architectural change with repeated empirical gains on standard multimodal benchmarks; results hold when backbones, parameter counts, and data are controlled.

Citations4

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Han-Jia Ye

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Ovis improves multimodal understanding without larger LLM backbones, letting teams get better vision+language performance by changing the visual tokenization architecture rather than scaling model size.

Who Should Care

Summary TLDR

Ovis replaces the usual continuous visual embedding pipeline with a learnable visual embedding table and probabilistic visual tokens. Each image patch is mapped to a probability over a large visual vocabulary and the final patch embedding is the expected embedding from that table. Trained in three stages with only text-generation loss, Ovis improves multimodal benchmark scores over connector-based MLLMs of similar size and often beats the proprietary Qwen-VL-Plus on evaluated benchmarks. Code and datasets are released.

Problem Statement

Current MLLMs feed continuous visual encoder outputs through a connector (MLP/linear) into the LLM, but visual tokens and text tokens use different tokenization and embedding strategies, causing suboptimal fusion of vision and language.

Main Contribution

Introduce a learnable visual embedding look-up table so visual patches are represented like textual tokens.

Map each visual patch to a probabilistic token (distribution over K visual words) and use the expectation over indexed embeddings as the patch embedding.

Key Findings

Ovis architecture beats an otherwise-identical connector-based MLLM.

Numbersavg +8.8% across benchmarks (Table 3)

Practical UseIf you replace a connector with Ovis' visual embedding table, expect ≈9% average benchmark gain using the same backbones and data.

Evidence RefTable 3

Ovis-14B outperforms the proprietary Qwen-VL-Plus on many evaluated benchmarks.

NumbersMMStar: 48.5 vs 39.7 (Ovis-Qwen1.5-14B vs Qwen-VL-Plus, Table 1)

Practical UseUsing Ovis' structural visual tokens can close or exceed performance of some high-resource proprietary models on common multimodal tasks.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average improvement vs connector architectureavg +8.8%connector-based MLLM (same backbones & params)+8.8%aggregated benchmarks in Table 3Ovis vs connector-based model trained on same data and backbones (Table 3)Table 3
MMStar score (Ovis-Qwen1.5-14B)48.5Qwen-VL-Plus 39.7+8.8 pointsMMStar (general multimodal benchmark)Reported Table 1 comparisonTable 1

What To Try In 7 Days

Prototype replacing your connector with a visual embedding table and probabilistic token head using a pretrained ViT and your LLM.

Train just the visual head and embedding table on a caption subset (stage-1 style) to gauge immediate gains.

Run your standard multimodal evaluation suite (e.g., MMBench or task-specific tests) to measure lift before full retrain.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://huggingface.co/datasets/AIDC-AI/Ovis-datasetCOYO-700M (public dataset referenced)ShareGPT4V-Pretrain (public dataset referenced)

Risks & Boundaries

Limitations

No high-resolution boosting — limited performance on native high-res vision unless extended with extra techniques.

Trained only on single-image samples — not validated for multi-image or cross-image reasoning.

When Not To Use

When your application needs native multi-image fusion or cross-image reasoning without further engineering.

If your workload requires native high-resolution vision and you cannot add high-res techniques.

Failure Modes

Hallucination: generating incorrect facts from visual prompts despite improved perception.

Biases inherited from training data leading to unfair or harmful outputs.

Core Entities

Models

Ovis-Qwen1.5-7BOvis-Qwen1.5-14BOvis-Llama3-8BOvis-Llama3-8B (variant reported)

Metrics

MMStar scoreMMBench scoreMMMU scoreMathVista-Mini scoreMME sum (perception+cognition)HallusionBench QuestionAccRealWorldQA score

Datasets

COYO-10MShareGPT4V-PretrainLLaVA-FinetuneCOYO (filtered subset)ImageNet-1K (sparsity test)In-house visual description & instruction datasets (AIDC-AI/Ovis-dataset)

Benchmarks

MMStarMMBench-ENMMBench-CNMMMUMathVista-MiniMMEHallusionBenchRealWorldQA

Context Entities

Models

Qwen-VL-PlusQwen-VL-MaxGPT4VLLaVAInstructBLIPMini-GeminiMonkeyDeepSeek-VL

Metrics

Benchmark leaderboard scores reported in Tables 1-2

Datasets

LaionCC12MScienceQATextVQA

Benchmarks

MMBenchMMMU