Ferret-UI: a multimodal LLM tuned to find, name, and act on mobile UI elements using an 'any-resolution' image split

Overview

Decision SnapshotNeeds Validation

Ferret-UI improves grounded reasoning tied to pixel regions and reaches strong element-level accuracy; however, evaluator bias (GPT-4 preference for verbosity) and dependence on detected elements limit general-purpose deployment.

Citations3

Evidence Strength0.70

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

Links

Abstract / PDF / Data

Why It Matters For Business

Specialized multimodal models like Ferret-UI give more accurate reads and grounded actions on mobile screens than generalist VLMs, reducing errors in automation, accessibility features, and UI testing workflows.

Who Should Care

ML Engineer Data Scientist Product Manager Engineering Lead CTO

Summary TLDR

Ferret-UI adapts a referring-and-grounding multimodal LLM (Ferret) to mobile screens by adding an “any-resolution” input (split into sub-images) and training on a curated mix of 250K UI examples. It improves detection, OCR, element classification, and grounded conversation on single screens. On elementary UI tasks (referring and grounding) Ferret-UI beats the base Ferret and outperforms GPT-4V on many benchmarks; on advanced, open-ended tasks it scores well on iPhone screens but lags GPT-4V in evaluator-preferred verbosity. The system depends on a pixel-based detection pipeline, which limits coverage of undetected UI cues.

Problem Statement

Generalist multimodal LLMs lose small but critical UI details on mobile screens (elongated aspect ratios and tiny icons/text). The paper asks: how to adapt a referring-and-grounding MLLM to reliably detect, name, and reason about UI elements from raw pixels across phone aspect ratios?

Main Contribution

Ferret-UI: a Ferret-based MLLM fine-tuned for mobile UI tasks with referring, grounding, and reasoning abilities.

Any-resolution (anyres) input design: choose 1x2 or 2x1 grid by aspect ratio, encode sub-images separately to magnify small UI details.

Key Findings

Ferret-UI substantially improves elementary referring and grounding accuracy compared to base Ferret and often surpasses GPT-4V on mobile UI primitives.

NumbersReferring (iPhone): 82.4 vs GPT-4V 61.3; Grounding (iPhone): 81.4 vs GPT-4V 70.3

Practical UseFor tasks that need reliable element-level reads (OCR, icon/widget ID, pointing to UI parts), deploy a UI-specialized MLLM like Ferret-UI instead of a generalist VLM.

Evidence RefTable 2 (Elementary Tasks averages, Ref-i, Grd-i)

Adding anyres (sub-image encoding) boosts iPhone advanced-task scores by about 20 points but can reduce Android advanced-task scores when Android advanced data is absent.

NumbersAdvanced (iPhone) Ferret-UI-base 73.4 → anyres 93.9 (+20.5); Android advanced drops 80.5 → 70.1 (-10.4)

Practical UseUse anyres when prioritizing high-fidelity single-screen iPhone understanding; validate cross-platform performance before deploying on Android-heavy fleets.

Evidence RefTable 3 (Advanced task comparison)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	82.4%	GPT-4V 61.3%	+21.1 pts	Elementary tasks (Table 2, Ref-i)	Ferret-UI-anyres Ref-i 82.4 vs GPT-4V 61.3 (Table 2)	Table 2
Accuracy	83.8%	GPT-4V 4.7%	+79.1 pts	Elementary tasks (Table 2, Grd-A)	Ferret-UI-anyres Grd-A 83.8 vs GPT-4V 4.7 (Table 2)	Table 2

What To Try In 7 Days

Run Ferret-UI-style sub-image (anyres) preprocessing in your UI pipeline and compare element detection accuracy on a small app set.

Add elementary referring tasks (OCR, icon and widget classification) as instruction-tuning data to a vision-language model and measure gains on element-level QA.

Audit your UI detector: measure which UI elements are consistently missed and prioritize detector fixes before LLM tuning.

Agent Features

Frameworks

FerretCLIP-ViT-L/14 encoder + Vicuna-style decoder

Architectures

Ferret-based MLLMany-resolution (1x2 or 2x1) sub-image encodingHybrid region continuous features + full-image features

Optimization Features

System Optimization

Anyres trades compute for finer visual detail (3× more train time for anyres)

Training Optimization

Vision encoder frozen during fine-tuningDecoder and projection layers updated only

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

RICO dataset (cited as [11])AMP dataset (cited as [58])Spotlight public tasks (cited as [30])

Risks & Boundaries

Limitations

Relies on a pixel-based UI detector; elements missed by the detector are not learned or described.

Anyres tuning improved iPhone performance but reduced Android advanced-task scores when Android advanced data were absent.

When Not To Use

If you need design-level judgments (color theme, aesthetic) beyond detected elements.

For multi-screen navigation or long-horizon task planning without extra agent logic.

Failure Modes

Composite widgets are classified as their largest sub-element rather than the whole (e.g., button seen as picture).

Small or nearby text may be read as neighboring text instead of the targeted region.

Core Entities

Models

Ferret-UI-anyresFerret-UI-baseFerretGPT-4VFuyuCogAgentSpotlight

Metrics

CIDErF1AccuracyIoU (>0.5 threshold)GPT-4 scoring for open-ended responses

Datasets

RICOAMP (iPhone subset)Spotlight tasks (screen2words, widgetcaptions, taperception)Curated Ferret-UI training mixture (250K samples)

Benchmarks

Ferret-UI 14-task mobile UI benchmarkSpotlight public benchmark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Ferret-UI substantially improves elementary referring and grounding accuracy compared to base Ferret and often surpasses GPT-4V on mobile UI primitives.

Adding anyres (sub-image encoding) boosts iPhone advanced-task scores by about 20 points but can reduce Android advanced-task scores when Android advanced data is absent.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-