Make visual tokens look like text tokens: learn a visual embedding table and probabilistic visual words.

Overview

Decision SnapshotNeeds Validation

Ovis presents a clear architectural change with repeated empirical gains on standard multimodal benchmarks; results hold when backbones, parameter counts, and data are controlled.

Citations4

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Han-Jia Ye

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Ovis improves multimodal understanding without larger LLM backbones, letting teams get better vision+language performance by changing the visual tokenization architecture rather than scaling model size.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

Ovis replaces the usual continuous visual embedding pipeline with a learnable visual embedding table and probabilistic visual tokens. Each image patch is mapped to a probability over a large visual vocabulary and the final patch embedding is the expected embedding from that table. Trained in three stages with only text-generation loss, Ovis improves multimodal benchmark scores over connector-based MLLMs of similar size and often beats the proprietary Qwen-VL-Plus on evaluated benchmarks. Code and datasets are released.

Problem Statement

Current MLLMs feed continuous visual encoder outputs through a connector (MLP/linear) into the LLM, but visual tokens and text tokens use different tokenization and embedding strategies, causing suboptimal fusion of vision and language.

Main Contribution

Introduce a learnable visual embedding look-up table so visual patches are represented like textual tokens.

Map each visual patch to a probabilistic token (distribution over K visual words) and use the expectation over indexed embeddings as the patch embedding.

Key Findings

Ovis architecture beats an otherwise-identical connector-based MLLM.

Numbersavg +8.8% across benchmarks (Table 3)

Practical UseIf you replace a connector with Ovis' visual embedding table, expect ≈9% average benchmark gain using the same backbones and data.

Evidence RefTable 3

Ovis-14B outperforms the proprietary Qwen-VL-Plus on many evaluated benchmarks.

NumbersMMStar: 48.5 vs 39.7 (Ovis-Qwen1.5-14B vs Qwen-VL-Plus, Table 1)

Practical UseUsing Ovis' structural visual tokens can close or exceed performance of some high-resource proprietary models on common multimodal tasks.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average improvement vs connector architecture	avg +8.8%	connector-based MLLM (same backbones & params)	+8.8%	aggregated benchmarks in Table 3	Ovis vs connector-based model trained on same data and backbones (Table 3)	Table 3
MMStar score (Ovis-Qwen1.5-14B)	48.5	Qwen-VL-Plus 39.7	+8.8 points	MMStar (general multimodal benchmark)	Reported Table 1 comparison	Table 1

What To Try In 7 Days

Prototype replacing your connector with a visual embedding table and probabilistic token head using a pretrained ViT and your LLM.

Train just the visual head and embedding table on a caption subset (stage-1 style) to gauge immediate gains.

Run your standard multimodal evaluation suite (e.g., MMBench or task-specific tests) to measure lift before full retrain.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AIDC-AI/Ovis

Data URLs

https://huggingface.co/datasets/AIDC-AI/Ovis-datasetCOYO-700M (public dataset referenced)ShareGPT4V-Pretrain (public dataset referenced)

Risks & Boundaries

Limitations

No high-resolution boosting — limited performance on native high-res vision unless extended with extra techniques.

Trained only on single-image samples — not validated for multi-image or cross-image reasoning.

When Not To Use

When your application needs native multi-image fusion or cross-image reasoning without further engineering.

If your workload requires native high-resolution vision and you cannot add high-res techniques.

Failure Modes

Hallucination: generating incorrect facts from visual prompts despite improved perception.

Biases inherited from training data leading to unfair or harmful outputs.

Core Entities

Models

Ovis-Qwen1.5-7BOvis-Qwen1.5-14BOvis-Llama3-8BOvis-Llama3-8B (variant reported)

Metrics

MMStar scoreMMBench scoreMMMU scoreMathVista-Mini scoreMME sum (perception+cognition)HallusionBench QuestionAccRealWorldQA score

Datasets

COYO-10MShareGPT4V-PretrainLLaVA-FinetuneCOYO (filtered subset)ImageNet-1K (sparsity test)In-house visual description & instruction datasets (AIDC-AI/Ovis-dataset)

Benchmarks

MMStarMMBench-ENMMBench-CNMMMUMathVista-MiniMMEHallusionBenchRealWorldQA

Context Entities

Models

Qwen-VL-PlusQwen-VL-MaxGPT4VLLaVAInstructBLIPMini-GeminiMonkeyDeepSeek-VL

Metrics

Benchmark leaderboard scores reported in Tables 1-2

Datasets

LaionCC12MScienceQATextVQA

Benchmarks

MMBenchMMMU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Ovis architecture beats an otherwise-identical connector-based MLLM.

Ovis-14B outperforms the proprietary Qwen-VL-Plus on many evaluated benchmarks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

HOPE: search image-specific, highly misleading distractors to better expose object hallucinations in LVLMs

Key finding

Survey of multimodal RAG: methods, datasets, benchmarks, and open problems

Key finding

Practical guide: which design choices help when adding image input to LLMs

Key finding

HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

Key finding