Overview
The paper reports strong, multi-pronged evidence (expert annotations, ablations, user study) but is limited to English and X and depends on commercial LLMs and MBFC ratings.
Citations4
Evidence Strength0.80
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
MUSE can automate credible, auditable misinformation corrections at scale, reducing dependence on slow, expensive human fact-checking and improving user belief accuracy for platforms and publishers.
Who Should Care
Summary TLDR
MUSE combines a large language model with image captioning and credibility-aware web retrieval to identify and explain misinformation on social media with grounded links. Expert evaluation on X Community Notes shows MUSE outperforms GPT-4 (mean quality 8.1 vs 5.9) and high-helpfulness lay responses, reduces hallucinated links, works for images and text, and improves end-user ability to spot misinformation by 9.8%. Runtime ~2 minutes and cost ≈ $0.5/post (now ≈ $0.2).
Problem Statement
Social media posts often mix true, false, and misleading elements and can include images. Manual corrections scale poorly. Off-the-shelf LLMs are fluent but hallucinate, lack current knowledge, and struggle with images. We need a practical system that finds timely, credible evidence, handles visuals, and produces auditable corrections.
Main Contribution
MUSE system that augments an LLM with informative image captioning and credibility-aware web retrieval to produce grounded corrections.
A 13-dimension expert evaluation rubric covering identification, explanation, text quality, and references.
Key Findings
MUSE achieves higher overall expert-rated quality than baselines.
MUSE identifies and explains inaccuracies more often and more completely.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Expert overall quality (mean ± SD) | MUSE 8.1 ± 2.0; GPT-4 5.9 ± 2.7; lay (high) 6.3 ± 2.0 | — | MUSE +37% vs GPT-4; +29% vs lay high | 232 tweets; expert ratings | Fig.2a; Methods | Fig.2a |
| Explicit identification rate | MUSE 89% explicit identification | GPT-4 lower by 16 percentage points | MUSE +16pp vs GPT-4 | expert-evaluated responses | Fig.2b-e | Fig.2b |
What To Try In 7 Days
Prototype a retrieval-augmented LLM pipeline: generate search queries from posts, retrieve top pages, and feed extracted evidence into your LLM.
Add an informative image captioning step (OCR + celebrity tags) to make image content LLM-readable.
Run a small A/B user test (100–1,000 users) comparing raw LLM responses vs retrieval-grounded responses and measure belief change.
Reproducibility
Risks & Boundaries
Limitations
No video input support; only text and images handled.
Evaluation is English-only and focused on X Community Notes posts.
When Not To Use
For video-based misinformation without an image snapshot.
In non-English deployments without retraining captioning and retrieval filters.
Failure Modes
Bad or biased retrieval sources lead to incorrect explanations.
Publisher credibility labels (MBFC) may be incomplete or contested.

