Ferret-UI: a multimodal LLM tuned to find, name, and act on mobile UI elements using an 'any-resolution' image split

April 8, 20249 min

Overview

Decision SnapshotNeeds Validation

Ferret-UI improves grounded reasoning tied to pixel regions and reaches strong element-level accuracy; however, evaluator bias (GPT-4 preference for verbosity) and dependence on detected elements limit general-purpose deployment.

Citations3

Evidence Strength0.70

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

Links

Abstract / PDF / Data

Why It Matters For Business

Specialized multimodal models like Ferret-UI give more accurate reads and grounded actions on mobile screens than generalist VLMs, reducing errors in automation, accessibility features, and UI testing workflows.

Who Should Care

Summary TLDR

Ferret-UI adapts a referring-and-grounding multimodal LLM (Ferret) to mobile screens by adding an “any-resolution” input (split into sub-images) and training on a curated mix of 250K UI examples. It improves detection, OCR, element classification, and grounded conversation on single screens. On elementary UI tasks (referring and grounding) Ferret-UI beats the base Ferret and outperforms GPT-4V on many benchmarks; on advanced, open-ended tasks it scores well on iPhone screens but lags GPT-4V in evaluator-preferred verbosity. The system depends on a pixel-based detection pipeline, which limits coverage of undetected UI cues.

Problem Statement

Generalist multimodal LLMs lose small but critical UI details on mobile screens (elongated aspect ratios and tiny icons/text). The paper asks: how to adapt a referring-and-grounding MLLM to reliably detect, name, and reason about UI elements from raw pixels across phone aspect ratios?

Main Contribution

Ferret-UI: a Ferret-based MLLM fine-tuned for mobile UI tasks with referring, grounding, and reasoning abilities.

Any-resolution (anyres) input design: choose 1x2 or 2x1 grid by aspect ratio, encode sub-images separately to magnify small UI details.

Key Findings

Ferret-UI substantially improves elementary referring and grounding accuracy compared to base Ferret and often surpasses GPT-4V on mobile UI primitives.

NumbersReferring (iPhone): 82.4 vs GPT-4V 61.3; Grounding (iPhone): 81.4 vs GPT-4V 70.3

Practical UseFor tasks that need reliable element-level reads (OCR, icon/widget ID, pointing to UI parts), deploy a UI-specialized MLLM like Ferret-UI instead of a generalist VLM.

Evidence RefTable 2 (Elementary Tasks averages, Ref-i, Grd-i)

Adding anyres (sub-image encoding) boosts iPhone advanced-task scores by about 20 points but can reduce Android advanced-task scores when Android advanced data is absent.

NumbersAdvanced (iPhone) Ferret-UI-base 73.4 → anyres 93.9 (+20.5); Android advanced drops 80.570.1 (-10.4)

Practical UseUse anyres when prioritizing high-fidelity single-screen iPhone understanding; validate cross-platform performance before deploying on Android-heavy fleets.

Evidence RefTable 3 (Advanced task comparison)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy82.4%GPT-4V 61.3%+21.1 ptsElementary tasks (Table 2, Ref-i)Ferret-UI-anyres Ref-i 82.4 vs GPT-4V 61.3 (Table 2)Table 2
Accuracy83.8%GPT-4V 4.7%+79.1 ptsElementary tasks (Table 2, Grd-A)Ferret-UI-anyres Grd-A 83.8 vs GPT-4V 4.7 (Table 2)Table 2

What To Try In 7 Days

Run Ferret-UI-style sub-image (anyres) preprocessing in your UI pipeline and compare element detection accuracy on a small app set.

Add elementary referring tasks (OCR, icon and widget classification) as instruction-tuning data to a vision-language model and measure gains on element-level QA.

Audit your UI detector: measure which UI elements are consistently missed and prioritize detector fixes before LLM tuning.

Agent Features

Frameworks
FerretCLIP-ViT-L/14 encoder + Vicuna-style decoder
Architectures
Ferret-based MLLMany-resolution (1x2 or 2x1) sub-image encodingHybrid region continuous features + full-image features

Optimization Features

System Optimization
Anyres trades compute for finer visual detail (3× more train time for anyres)
Training Optimization
Vision encoder frozen during fine-tuningDecoder and projection layers updated only

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

RICO dataset (cited as [11])AMP dataset (cited as [58])Spotlight public tasks (cited as [30])

Risks & Boundaries

Limitations

Relies on a pixel-based UI detector; elements missed by the detector are not learned or described.

Anyres tuning improved iPhone performance but reduced Android advanced-task scores when Android advanced data were absent.

When Not To Use

If you need design-level judgments (color theme, aesthetic) beyond detected elements.

For multi-screen navigation or long-horizon task planning without extra agent logic.

Failure Modes

Composite widgets are classified as their largest sub-element rather than the whole (e.g., button seen as picture).

Small or nearby text may be read as neighboring text instead of the targeted region.

Core Entities

Models

Ferret-UI-anyresFerret-UI-baseFerretGPT-4VFuyuCogAgentSpotlight

Metrics

CIDErF1AccuracyIoU (>0.5 threshold)GPT-4 scoring for open-ended responses

Datasets

RICOAMP (iPhone subset)Spotlight tasks (screen2words, widgetcaptions, taperception)Curated Ferret-UI training mixture (250K samples)

Benchmarks

Ferret-UI 14-task mobile UI benchmarkSpotlight public benchmark