Overview
Ferret-UI improves grounded reasoning tied to pixel regions and reaches strong element-level accuracy; however, evaluator bias (GPT-4 preference for verbosity) and dependence on detected elements limit general-purpose deployment.
Citations3
Evidence Strength0.70
Confidence0.78
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/5
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 45%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
Specialized multimodal models like Ferret-UI give more accurate reads and grounded actions on mobile screens than generalist VLMs, reducing errors in automation, accessibility features, and UI testing workflows.
Who Should Care
Summary TLDR
Ferret-UI adapts a referring-and-grounding multimodal LLM (Ferret) to mobile screens by adding an “any-resolution” input (split into sub-images) and training on a curated mix of 250K UI examples. It improves detection, OCR, element classification, and grounded conversation on single screens. On elementary UI tasks (referring and grounding) Ferret-UI beats the base Ferret and outperforms GPT-4V on many benchmarks; on advanced, open-ended tasks it scores well on iPhone screens but lags GPT-4V in evaluator-preferred verbosity. The system depends on a pixel-based detection pipeline, which limits coverage of undetected UI cues.
Problem Statement
Generalist multimodal LLMs lose small but critical UI details on mobile screens (elongated aspect ratios and tiny icons/text). The paper asks: how to adapt a referring-and-grounding MLLM to reliably detect, name, and reason about UI elements from raw pixels across phone aspect ratios?
Main Contribution
Ferret-UI: a Ferret-based MLLM fine-tuned for mobile UI tasks with referring, grounding, and reasoning abilities.
Any-resolution (anyres) input design: choose 1x2 or 2x1 grid by aspect ratio, encode sub-images separately to magnify small UI details.
Key Findings
Ferret-UI substantially improves elementary referring and grounding accuracy compared to base Ferret and often surpasses GPT-4V on mobile UI primitives.
Adding anyres (sub-image encoding) boosts iPhone advanced-task scores by about 20 points but can reduce Android advanced-task scores when Android advanced data is absent.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 82.4% | GPT-4V 61.3% | +21.1 pts | Elementary tasks (Table 2, Ref-i) | Ferret-UI-anyres Ref-i 82.4 vs GPT-4V 61.3 (Table 2) | Table 2 |
| Accuracy | 83.8% | GPT-4V 4.7% | +79.1 pts | Elementary tasks (Table 2, Grd-A) | Ferret-UI-anyres Grd-A 83.8 vs GPT-4V 4.7 (Table 2) | Table 2 |
What To Try In 7 Days
Run Ferret-UI-style sub-image (anyres) preprocessing in your UI pipeline and compare element detection accuracy on a small app set.
Add elementary referring tasks (OCR, icon and widget classification) as instruction-tuning data to a vision-language model and measure gains on element-level QA.
Audit your UI detector: measure which UI elements are consistently missed and prioritize detector fixes before LLM tuning.
Agent Features
Frameworks
Architectures
Optimization Features
System Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Relies on a pixel-based UI detector; elements missed by the detector are not learned or described.
Anyres tuning improved iPhone performance but reduced Android advanced-task scores when Android advanced data were absent.
When Not To Use
If you need design-level judgments (color theme, aesthetic) beyond detected elements.
For multi-screen navigation or long-horizon task planning without extra agent logic.
Failure Modes
Composite widgets are classified as their largest sub-element rather than the whole (e.g., button seen as picture).
Small or nearby text may be read as neighboring text instead of the targeted region.

