Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
4
Why It Matters For Business
Ovis improves multimodal understanding without larger LLM backbones, letting teams get better vision+language performance by changing the visual tokenization architecture rather than scaling model size.
Summary TLDR
Ovis replaces the usual continuous visual embedding pipeline with a learnable visual embedding table and probabilistic visual tokens. Each image patch is mapped to a probability over a large visual vocabulary and the final patch embedding is the expected embedding from that table. Trained in three stages with only text-generation loss, Ovis improves multimodal benchmark scores over connector-based MLLMs of similar size and often beats the proprietary Qwen-VL-Plus on evaluated benchmarks. Code and datasets are released.
Problem Statement
Current MLLMs feed continuous visual encoder outputs through a connector (MLP/linear) into the LLM, but visual tokens and text tokens use different tokenization and embedding strategies, causing suboptimal fusion of vision and language.
Main Contribution
Introduce a learnable visual embedding look-up table so visual patches are represented like textual tokens.
Map each visual patch to a probabilistic token (distribution over K visual words) and use the expectation over indexed embeddings as the patch embedding.
Train Ovis in three stages (caption pretrain, visual description, multimodal instruction) using only text-generation loss.
Show consistent benchmark gains over connector-based MLLMs and open-source peers; Ovis-14B often outperforms Qwen-VL-Plus on evaluated benchmarks.
Openly provide training code and datasets to support reproduction.
Key Findings
Ovis architecture beats an otherwise-identical connector-based MLLM.
Ovis-14B outperforms the proprietary Qwen-VL-Plus on many evaluated benchmarks.
Ovis uses a very large visual vocabulary and produces sparse probabilistic tokens.
Results
Average improvement vs connector architecture
MMStar score (Ovis-Qwen1.5-14B)
Visual tokenizer sparsity
Who Should Care
What To Try In 7 Days
Prototype replacing your connector with a visual embedding table and probabilistic token head using a pretrained ViT and your LLM.
Train just the visual head and embedding table on a caption subset (stage-1 style) to gauge immediate gains.
Run your standard multimodal evaluation suite (e.g., MMBench or task-specific tests) to measure lift before full retrain.
Reproducibility
Code Urls
Data Urls
- https://huggingface.co/datasets/AIDC-AI/Ovis-dataset
- COYO-700M (public dataset referenced)
- ShareGPT4V-Pretrain (public dataset referenced)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No high-resolution boosting — limited performance on native high-res vision unless extended with extra techniques.
- Trained only on single-image samples — not validated for multi-image or cross-image reasoning.
- Model still vulnerable to hallucination and biases common to generative MLLMs.
When Not To Use
- When your application needs native multi-image fusion or cross-image reasoning without further engineering.
- If your workload requires native high-resolution vision and you cannot add high-res techniques.
- If you cannot afford the memory/compute for a very large visual vocabulary (K≈131k) or its indexing.
Failure Modes
- Hallucination: generating incorrect facts from visual prompts despite improved perception.
- Biases inherited from training data leading to unfair or harmful outputs.
- Missing rare visual patterns if the probabilistic token distribution or vocabulary fails to cover them.
Core Entities
Models
- Ovis-Qwen1.5-7B
- Ovis-Qwen1.5-14B
- Ovis-Llama3-8B
- Ovis-Llama3-8B (variant reported)
Metrics
- MMStar score
- MMBench score
- MMMU score
- MathVista-Mini score
- MME sum (perception+cognition)
- HallusionBench QuestionAcc
- RealWorldQA score
Datasets
- COYO-10M
- ShareGPT4V-Pretrain
- LLaVA-Finetune
- COYO (filtered subset)
- ImageNet-1K (sparsity test)
- In-house visual description & instruction datasets (AIDC-AI/Ovis-dataset)
Benchmarks
- MMStar
- MMBench-EN
- MMBench-CN
- MMMU
- MathVista-Mini
- MME
- HallusionBench
- RealWorldQA
Context Entities
Models
- Qwen-VL-Plus
- Qwen-VL-Max
- GPT4V
- LLaVA
- InstructBLIP
- Mini-Gemini
- Monkey
- DeepSeek-VL
Metrics
- Benchmark leaderboard scores reported in Tables 1-2
Datasets
- Laion
- CC12M
- ScienceQA
- TextVQA
Benchmarks
- MMBench
- MMMU

