Ferret-UI: a multimodal LLM tuned to find, name, and act on mobile UI elements using an 'any-resolution' image split

April 8, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

3

Authors

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

Links

Abstract / PDF

Why It Matters For Business

Specialized multimodal models like Ferret-UI give more accurate reads and grounded actions on mobile screens than generalist VLMs, reducing errors in automation, accessibility features, and UI testing workflows.

Summary TLDR

Ferret-UI adapts a referring-and-grounding multimodal LLM (Ferret) to mobile screens by adding an “any-resolution” input (split into sub-images) and training on a curated mix of 250K UI examples. It improves detection, OCR, element classification, and grounded conversation on single screens. On elementary UI tasks (referring and grounding) Ferret-UI beats the base Ferret and outperforms GPT-4V on many benchmarks; on advanced, open-ended tasks it scores well on iPhone screens but lags GPT-4V in evaluator-preferred verbosity. The system depends on a pixel-based detection pipeline, which limits coverage of undetected UI cues.

Problem Statement

Generalist multimodal LLMs lose small but critical UI details on mobile screens (elongated aspect ratios and tiny icons/text). The paper asks: how to adapt a referring-and-grounding MLLM to reliably detect, name, and reason about UI elements from raw pixels across phone aspect ratios?

Main Contribution

Ferret-UI: a Ferret-based MLLM fine-tuned for mobile UI tasks with referring, grounding, and reasoning abilities.

Any-resolution (anyres) input design: choose 1x2 or 2x1 grid by aspect ratio, encode sub-images separately to magnify small UI details.

A curated training mixture (250K instruction-formatted samples) that covers elementary tasks (OCR, icon/widget classification, find/list) and advanced tasks (detailed description, grounded conversations, function inference).

A comprehensive 14-task benchmark covering Spotlight tasks plus 11 elementary/advanced UI tasks on iPhone and Android for evaluation.

Key Findings

Ferret-UI substantially improves elementary referring and grounding accuracy compared to base Ferret and often surpasses GPT-4V on mobile UI primitives.

NumbersReferring (iPhone): 82.4 vs GPT-4V 61.3; Grounding (iPhone): 81.4 vs GPT-4V 70.3

Adding anyres (sub-image encoding) boosts iPhone advanced-task scores by about 20 points but can reduce Android advanced-task scores when Android advanced data is absent.

NumbersAdvanced (iPhone) Ferret-UI-base 73.4 → anyres 93.9 (+20.5); Android advanced drops 80.5 → 70.1 (-10.4)

Training mixture size and compute: model trained on ~250K instruction samples; Ferret-UI-base trains ~1 day and Ferret-UI-anyres ~3 days on 8 A100 GPUs.

Numbers250K samples; train time 1 day (base) vs 3 days (anyres) on 8 A100s

Model performance is constrained by the underlying UI detection pipeline; undetected elements cannot be learned or described.

NumbersExamples and analysis show missing elements like time/WiFi/battery and design cues not learned

Results

Accuracy

Value82.4%

BaselineGPT-4V 61.3%

Accuracy

Value83.8%

BaselineGPT-4V 4.7%

Advanced tasks average (iPhone)

Value93.9 (percent of GPT-4 scoring)

BaselineFerret-UI-base 73.4

Training mixture size

Value250K samples

Training time (8 × A100)

ValueFerret-UI-base: ~1 day; anyres: ~3 days

Who Should Care

What To Try In 7 Days

Run Ferret-UI-style sub-image (anyres) preprocessing in your UI pipeline and compare element detection accuracy on a small app set.

Add elementary referring tasks (OCR, icon and widget classification) as instruction-tuning data to a vision-language model and measure gains on element-level QA.

Audit your UI detector: measure which UI elements are consistently missed and prioritize detector fixes before LLM tuning.

Agent Features

Frameworks

  • Ferret
  • CLIP-ViT-L/14 encoder + Vicuna-style decoder

Architectures

  • Ferret-based MLLM
  • any-resolution (1x2 or 2x1) sub-image encoding
  • Hybrid region continuous features + full-image features

Optimization Features

System Optimization

  • Anyres trades compute for finer visual detail (3× more train time for anyres)

Training Optimization

  • Vision encoder frozen during fine-tuning
  • Decoder and projection layers updated only

Reproducibility

Data Urls

  • RICO dataset (cited as [11])
  • AMP dataset (cited as [58])
  • Spotlight public tasks (cited as [30])

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on a pixel-based UI detector; elements missed by the detector are not learned or described.
  • Anyres tuning improved iPhone performance but reduced Android advanced-task scores when Android advanced data were absent.
  • No public claim of releasing the curated Ferret-UI training corpus or code in the paper.
  • Model focuses on single-screen understanding; multi-step navigation or cross-screen state tracking is not addressed.

When Not To Use

  • If you need design-level judgments (color theme, aesthetic) beyond detected elements.
  • For multi-screen navigation or long-horizon task planning without extra agent logic.
  • If your production stack cannot support the extra compute/storage cost of anyres-like feature extraction.

Failure Modes

  • Composite widgets are classified as their largest sub-element rather than the whole (e.g., button seen as picture).
  • Small or nearby text may be read as neighboring text instead of the targeted region.
  • Small icons surrounded by text can be misclassified as text in lower-resolution inputs.
  • Grounding boxes can be imprecise; IoU errors occur on tightly packed UI regions.

Core Entities

Models

  • Ferret-UI-anyres
  • Ferret-UI-base
  • Ferret
  • GPT-4V
  • Fuyu
  • CogAgent
  • Spotlight

Metrics

  • CIDEr
  • F1
  • Accuracy
  • IoU (>0.5 threshold)
  • GPT-4 scoring for open-ended responses

Datasets

  • RICO
  • AMP (iPhone subset)
  • Spotlight tasks (screen2words, widgetcaptions, taperception)
  • Curated Ferret-UI training mixture (250K samples)

Benchmarks

  • Ferret-UI 14-task mobile UI benchmark
  • Spotlight public benchmark