MUSE: an LLM + vision + credibility-aware web retrieval that corrects social media misinformation

March 17, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

4

Authors

Xinyi Zhou, Ashish Sharma, Amy X. Zhang, Tim Althoff

Links

Abstract / PDF

Why It Matters For Business

MUSE can automate credible, auditable misinformation corrections at scale, reducing dependence on slow, expensive human fact-checking and improving user belief accuracy for platforms and publishers.

Summary TLDR

MUSE combines a large language model with image captioning and credibility-aware web retrieval to identify and explain misinformation on social media with grounded links. Expert evaluation on X Community Notes shows MUSE outperforms GPT-4 (mean quality 8.1 vs 5.9) and high-helpfulness lay responses, reduces hallucinated links, works for images and text, and improves end-user ability to spot misinformation by 9.8%. Runtime ~2 minutes and cost ≈ $0.5/post (now ≈ $0.2).

Problem Statement

Social media posts often mix true, false, and misleading elements and can include images. Manual corrections scale poorly. Off-the-shelf LLMs are fluent but hallucinate, lack current knowledge, and struggle with images. We need a practical system that finds timely, credible evidence, handles visuals, and produces auditable corrections.

Main Contribution

MUSE system that augments an LLM with informative image captioning and credibility-aware web retrieval to produce grounded corrections.

A 13-dimension expert evaluation rubric covering identification, explanation, text quality, and references.

Large expert study on X Community Notes (232 tweets; comparative responses) plus a 988-person end-user study showing measurable impact.

Key Findings

MUSE achieves higher overall expert-rated quality than baselines.

NumbersMean overall quality: MUSE 8.1, GPT-4 5.9, laypeople (high) 6.3

MUSE identifies and explains inaccuracies more often and more completely.

Numbers89% explicit identification; 61% full identification of all inaccuracies vs GPT-4 38%

MUSE provides far fewer broken or hallucinated reference links than GPT-4.

NumbersGPT-4: 49% links page-not-found; MUSE: nearly 0% page-not-found and 96% reachable links relevant

MUSE improves real users' ability to recognize misinformation.

NumbersEnd-user belief correction improved by 9.8% (from 4.5 to 4.9 on 1–7 scale)

MUSE is practical to run at scale today.

NumbersAverage runtime ≈ 2 minutes; cost ≈ $0.5/post at evaluation time (now ≈ $0.2)

Results

Expert overall quality (mean ± SD)

ValueMUSE 8.1 ± 2.0; GPT-4 5.9 ± 2.7; lay (high) 6.3 ± 2.0

Explicit identification rate

ValueMUSE 89% explicit identification

BaselineGPT-4 lower by 16 percentage points

Full identification of all inaccuracies

ValueMUSE 61% vs GPT-4 38% vs lay high 26%

BaselineGPT-4 38%

Reference reachability and relevance

ValueMUSE ≈100% reachable; 96% reachable links relevant

BaselineGPT-4 49% page-not-found; 76% reachable links relevant

End-user belief change

ValueCorrect belief score 4.5→4.9 (1–7 scale)

BaselineOther approaches no significant change

Who Should Care

What To Try In 7 Days

Prototype a retrieval-augmented LLM pipeline: generate search queries from posts, retrieve top pages, and feed extracted evidence into your LLM.

Add an informative image captioning step (OCR + celebrity tags) to make image content LLM-readable.

Run a small A/B user test (100–1,000 users) comparing raw LLM responses vs retrieval-grounded responses and measure belief change.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No video input support; only text and images handled.
  • Evaluation is English-only and focused on X Community Notes posts.
  • System depends on external publisher ratings (MBFC) and commercial LLMs, which affects transparency and cost.
  • Possible selection biases in Community Notes data; not fully representative of all misinformation.

When Not To Use

  • For video-based misinformation without an image snapshot.
  • In non-English deployments without retraining captioning and retrieval filters.
  • When sub-minute response is required and two-minute runtime is too slow.

Failure Modes

  • Bad or biased retrieval sources lead to incorrect explanations.
  • Publisher credibility labels (MBFC) may be incomplete or contested.
  • Hallucinated evidence if low-relevance pages are included or filters fail.
  • Performance may drop for niche or obscure claims not covered by retrieved sources.

Core Entities

Models

  • GPT-4
  • Llama-3 (70B)
  • BLIP-2

Metrics

  • Expert overall quality score (0-10)
  • Identification/comprehensiveness percentages
  • Reference reachability and relevance
  • End-user belief change (1-7 scale)

Datasets

  • X Community Notes (tweets and lay responses)

Context Entities

Models

  • msmarco-distilbert-base-tas-b
  • facebook/dino-vitb8
  • BLIP-2

Datasets

  • MBFC publisher ratings