Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.45
Citation Count
3
Why It Matters For Business
Specialized multimodal models like Ferret-UI give more accurate reads and grounded actions on mobile screens than generalist VLMs, reducing errors in automation, accessibility features, and UI testing workflows.
Summary TLDR
Ferret-UI adapts a referring-and-grounding multimodal LLM (Ferret) to mobile screens by adding an “any-resolution” input (split into sub-images) and training on a curated mix of 250K UI examples. It improves detection, OCR, element classification, and grounded conversation on single screens. On elementary UI tasks (referring and grounding) Ferret-UI beats the base Ferret and outperforms GPT-4V on many benchmarks; on advanced, open-ended tasks it scores well on iPhone screens but lags GPT-4V in evaluator-preferred verbosity. The system depends on a pixel-based detection pipeline, which limits coverage of undetected UI cues.
Problem Statement
Generalist multimodal LLMs lose small but critical UI details on mobile screens (elongated aspect ratios and tiny icons/text). The paper asks: how to adapt a referring-and-grounding MLLM to reliably detect, name, and reason about UI elements from raw pixels across phone aspect ratios?
Main Contribution
Ferret-UI: a Ferret-based MLLM fine-tuned for mobile UI tasks with referring, grounding, and reasoning abilities.
Any-resolution (anyres) input design: choose 1x2 or 2x1 grid by aspect ratio, encode sub-images separately to magnify small UI details.
A curated training mixture (250K instruction-formatted samples) that covers elementary tasks (OCR, icon/widget classification, find/list) and advanced tasks (detailed description, grounded conversations, function inference).
A comprehensive 14-task benchmark covering Spotlight tasks plus 11 elementary/advanced UI tasks on iPhone and Android for evaluation.
Key Findings
Ferret-UI substantially improves elementary referring and grounding accuracy compared to base Ferret and often surpasses GPT-4V on mobile UI primitives.
Adding anyres (sub-image encoding) boosts iPhone advanced-task scores by about 20 points but can reduce Android advanced-task scores when Android advanced data is absent.
Training mixture size and compute: model trained on ~250K instruction samples; Ferret-UI-base trains ~1 day and Ferret-UI-anyres ~3 days on 8 A100 GPUs.
Model performance is constrained by the underlying UI detection pipeline; undetected elements cannot be learned or described.
Results
Accuracy
Accuracy
Advanced tasks average (iPhone)
Training mixture size
Training time (8 × A100)
Who Should Care
What To Try In 7 Days
Run Ferret-UI-style sub-image (anyres) preprocessing in your UI pipeline and compare element detection accuracy on a small app set.
Add elementary referring tasks (OCR, icon and widget classification) as instruction-tuning data to a vision-language model and measure gains on element-level QA.
Audit your UI detector: measure which UI elements are consistently missed and prioritize detector fixes before LLM tuning.
Agent Features
Frameworks
- Ferret
- CLIP-ViT-L/14 encoder + Vicuna-style decoder
Architectures
- Ferret-based MLLM
- any-resolution (1x2 or 2x1) sub-image encoding
- Hybrid region continuous features + full-image features
Optimization Features
System Optimization
- Anyres trades compute for finer visual detail (3× more train time for anyres)
Training Optimization
- Vision encoder frozen during fine-tuning
- Decoder and projection layers updated only
Reproducibility
Data Urls
- RICO dataset (cited as [11])
- AMP dataset (cited as [58])
- Spotlight public tasks (cited as [30])
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Relies on a pixel-based UI detector; elements missed by the detector are not learned or described.
- Anyres tuning improved iPhone performance but reduced Android advanced-task scores when Android advanced data were absent.
- No public claim of releasing the curated Ferret-UI training corpus or code in the paper.
- Model focuses on single-screen understanding; multi-step navigation or cross-screen state tracking is not addressed.
When Not To Use
- If you need design-level judgments (color theme, aesthetic) beyond detected elements.
- For multi-screen navigation or long-horizon task planning without extra agent logic.
- If your production stack cannot support the extra compute/storage cost of anyres-like feature extraction.
Failure Modes
- Composite widgets are classified as their largest sub-element rather than the whole (e.g., button seen as picture).
- Small or nearby text may be read as neighboring text instead of the targeted region.
- Small icons surrounded by text can be misclassified as text in lower-resolution inputs.
- Grounding boxes can be imprecise; IoU errors occur on tightly packed UI regions.
Core Entities
Models
- Ferret-UI-anyres
- Ferret-UI-base
- Ferret
- GPT-4V
- Fuyu
- CogAgent
- Spotlight
Metrics
- CIDEr
- F1
- Accuracy
- IoU (>0.5 threshold)
- GPT-4 scoring for open-ended responses
Datasets
- RICO
- AMP (iPhone subset)
- Spotlight tasks (screen2words, widgetcaptions, taperception)
- Curated Ferret-UI training mixture (250K samples)
Benchmarks
- Ferret-UI 14-task mobile UI benchmark
- Spotlight public benchmark

